Bug: Upgrade postgresql replication node failed due to command timeout
Summary
When I upgrade the node for patroni replica,It failed. after investigation, it is found that when the upgrade command pg-upgrade
is executed, the replica database node will use pg_basebackup
command to pull the basic backup again from the leader node. Since Gitlab did not detect the available status of the running database within the specified time, it was interrupted due to timeout. The pg_basebackup
command was immediately interrupted,so I modified the settings postgresql['max_service_checks'] = 20
and postgresql['service_check_interval'] = 60
, the settings not in gitlab.rb
default, now command timeout from 3 minutes to 10 minutes, but 10 minutes is not enough.I do not know which parameter in gitlab.rb
can change , from source code I found 600s is the limit of running command, 10 minutes is not enough for a database of 100GB+
Steps to reproduce
- Create a patroni cluster,
- Limit network traffic to simulate large data in database
- Upgrade leader node
- Upgrade replica node, then error occurs
Example Project
What is the current bug behavior?
Upgrade patroni replica node failed when database has a large data
What is the expected correct behavior?
Success to upgrade patroni replica node
Relevant logs and/or screenshots
Command output error logs:
================================================================================
Error executing action `run` on resource 'ruby_block[wait for postgresql to start]'
================================================================================
RuntimeError
------------
PostgreSQL did not respond before service checks were exhausted
Cookbook Trace:
---------------
/opt/gitlab/embedded/cookbooks/cache/cookbooks/gitlab/libraries/helpers/pg_status_helper.rb:56:in `ready?'
/opt/gitlab/embedded/cookbooks/cache/cookbooks/gitlab/libraries/helpers/base_pg_helper.rb:28:in `is_ready?'
/opt/gitlab/embedded/cookbooks/cache/cookbooks/patroni/recipes/enable.rb:93:in `block (2 levels) in from_file'
Resource Declaration:
---------------------
# In /opt/gitlab/embedded/cookbooks/cache/cookbooks/patroni/recipes/enable.rb
92: ruby_block 'wait for postgresql to start' do
93: block { pg_helper.is_ready? }
94: only_if { omnibus_helper.should_notify?(patroni_helper.service_name) }
95: end
96:
Compiled Resource:
------------------
# Declared in /opt/gitlab/embedded/cookbooks/cache/cookbooks/patroni/recipes/enable.rb:92:in `from_file'
ruby_block("wait for postgresql to start") do
action [:run]
default_guard_interpreter :default
declared_type :ruby_block
cookbook_name "patroni"
recipe_name "enable"
block #<Proc:0x00000000044db008 /opt/gitlab/embedded/cookbooks/cache/cookbooks/patroni/recipes/enable.rb:93>
block_name "wait for postgresql to start"
only_if { #code block }
end
System Info:
------------
chef_version=15.14.0
platform=centos
platform_version=7.9.2009
ruby=ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
program_name=/opt/gitlab/embedded/bin/chef-client
executable=/opt/gitlab/embedded/bin/chef-client
Running handlers:
Running handlers complete
Chef Infra Client failed. 4 resources updated in 01 minutes 39 seconds
===STDERR===
There was an error running gitlab-ctl reconfigure:
ruby_block[wait for postgresql to start] (patroni::enable line 92) had an error: RuntimeError: PostgreSQL did not respond before service checks were exhausted
======
== Fatal error ==
Error updating PostgreSQL configuration. Please check the output
== Reverting ==
ok: down: patroni: 1s, normally up
Symlink correct version of binaries: OK
ok: run: patroni: (pid 23741) 1s
== Reverted ==
== Reverted to 11.11. Please check output for what went wrong ==
patroni logs:
2022-02-26_07:01:43.24105 2022-02-26 15:01:43,240 WARNING: Could not register service: unknown role type uninitialized
2022-02-26_07:01:43.24108 2022-02-26 15:01:43,240 INFO: bootstrap from leader 'postgresql-02' in progress
2022-02-26_07:01:43.33090 pg_basebackup: error: could not create directory "/mnt/postgresql/data/pg_wal": Permission denied
2022-02-26_07:01:43.33115 pg_basebackup: removing contents of data directory "/mnt/postgresql/data"
2022-02-26_07:01:43.33170 2022-02-26 15:01:43,331 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2022-02-26_07:01:43.33171 2022-02-26 15:01:43,331 WARNING: Trying again in 5 seconds
2022-02-26_07:01:48.64477 pg_basebackup: error: could not create directory "/mnt/postgresql/data/pg_wal": Permission denied
2022-02-26_07:01:48.64479 pg_basebackup: removing contents of data directory "/mnt/postgresql/data"
2022-02-26_07:01:48.64601 2022-02-26 15:01:48,645 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2022-02-26_07:01:48.64602 2022-02-26 15:01:48,645 ERROR: failed to bootstrap from leader 'postgresql-02'
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)