Skip to content

Bug: Upgrade postgresql replication node failed due to command timeout

Summary

When I upgrade the node for patroni replica,It failed. after investigation, it is found that when the upgrade command pg-upgrade is executed, the replica database node will use pg_basebackup command to pull the basic backup again from the leader node. Since Gitlab did not detect the available status of the running database within the specified time, it was interrupted due to timeout. The pg_basebackup command was immediately interrupted,so I modified the settings postgresql['max_service_checks'] = 20 and postgresql['service_check_interval'] = 60, the settings not in gitlab.rb default, now command timeout from 3 minutes to 10 minutes, but 10 minutes is not enough.I do not know which parameter in gitlab.rb can change , from source code I found 600s is the limit of running command, 10 minutes is not enough for a database of 100GB+

Steps to reproduce

  1. Create a patroni cluster,
  2. Limit network traffic to simulate large data in database
  3. Upgrade leader node
  4. Upgrade replica node, then error occurs

Example Project

What is the current bug behavior?

Upgrade patroni replica node failed when database has a large data

What is the expected correct behavior?

Success to upgrade patroni replica node

Relevant logs and/or screenshots

Command output error logs:

    ================================================================================
    Error executing action `run` on resource 'ruby_block[wait for postgresql to start]'
    ================================================================================

    RuntimeError
    ------------
    PostgreSQL did not respond before service checks were exhausted

    Cookbook Trace:
    ---------------
    /opt/gitlab/embedded/cookbooks/cache/cookbooks/gitlab/libraries/helpers/pg_status_helper.rb:56:in `ready?'
    /opt/gitlab/embedded/cookbooks/cache/cookbooks/gitlab/libraries/helpers/base_pg_helper.rb:28:in `is_ready?'
    /opt/gitlab/embedded/cookbooks/cache/cookbooks/patroni/recipes/enable.rb:93:in `block (2 levels) in from_file'

    Resource Declaration:
    ---------------------
    # In /opt/gitlab/embedded/cookbooks/cache/cookbooks/patroni/recipes/enable.rb

     92: ruby_block 'wait for postgresql to start' do
     93:   block { pg_helper.is_ready? }
     94:   only_if { omnibus_helper.should_notify?(patroni_helper.service_name) }
     95: end
     96:

    Compiled Resource:
    ------------------
    # Declared in /opt/gitlab/embedded/cookbooks/cache/cookbooks/patroni/recipes/enable.rb:92:in `from_file'

    ruby_block("wait for postgresql to start") do
      action [:run]
      default_guard_interpreter :default
      declared_type :ruby_block
      cookbook_name "patroni"
      recipe_name "enable"
      block #<Proc:0x00000000044db008 /opt/gitlab/embedded/cookbooks/cache/cookbooks/patroni/recipes/enable.rb:93>
      block_name "wait for postgresql to start"
      only_if { #code block }
    end

    System Info:
    ------------
    chef_version=15.14.0
    platform=centos
    platform_version=7.9.2009
    ruby=ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
    program_name=/opt/gitlab/embedded/bin/chef-client
    executable=/opt/gitlab/embedded/bin/chef-client


Running handlers:
Running handlers complete
Chef Infra Client failed. 4 resources updated in 01 minutes 39 seconds
===STDERR===
There was an error running gitlab-ctl reconfigure:

ruby_block[wait for postgresql to start] (patroni::enable line 92) had an error: RuntimeError: PostgreSQL did not respond before service checks were exhausted

======
== Fatal error ==
Error updating PostgreSQL configuration. Please check the output
== Reverting ==
ok: down: patroni: 1s, normally up
Symlink correct version of binaries: OK
ok: run: patroni: (pid 23741) 1s
== Reverted ==
== Reverted to 11.11. Please check output for what went wrong ==

patroni logs:

2022-02-26_07:01:43.24105 2022-02-26 15:01:43,240 WARNING: Could not register service: unknown role type uninitialized
2022-02-26_07:01:43.24108 2022-02-26 15:01:43,240 INFO: bootstrap from leader 'postgresql-02' in progress
2022-02-26_07:01:43.33090 pg_basebackup: error: could not create directory "/mnt/postgresql/data/pg_wal": Permission denied
2022-02-26_07:01:43.33115 pg_basebackup: removing contents of data directory "/mnt/postgresql/data"
2022-02-26_07:01:43.33170 2022-02-26 15:01:43,331 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2022-02-26_07:01:43.33171 2022-02-26 15:01:43,331 WARNING: Trying again in 5 seconds
2022-02-26_07:01:48.64477 pg_basebackup: error: could not create directory "/mnt/postgresql/data/pg_wal": Permission denied
2022-02-26_07:01:48.64479 pg_basebackup: removing contents of data directory "/mnt/postgresql/data"
2022-02-26_07:01:48.64601 2022-02-26 15:01:48,645 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2022-02-26_07:01:48.64602 2022-02-26 15:01:48,645 ERROR: failed to bootstrap from leader 'postgresql-02'

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Possible fixes

马翔 编辑于