db_load_balancing功能不可用会导致patroni集群失去高可用的能力
Summary
集群配置参考3k集群配置文档,极狐GitLab版本:15.11.2-jh
目前在高可用集群中开启了db_load_balancing这个功能之后反而会导致patroni集群失去高可用能力,及时只宕机一个patroni从节点,也会导致整个gitlab集群不可用,访问返回502.
Steps to reproduce
使用极狐GitLab15.11.2-jh安装包部署3k高可用集群,并在rails和sidekiq中配置database_load_balancing,当集群正常运行之后,开始进行高可用测试。
测试方法为:让任意一台patroni节点强制关机,注意:这里一定要要强制关机才会触发这个bug,只使用gitlab-ctl stop
关闭patroni节点是不会触发这个bug的。
Example Project
What is the current bug behavior?
当强制关机一台patroni节点之后会导致整个gitlab集群不可用,失去了高可用的能力。
通过观察/var/log/gitlab/gitlab-rails/database_load_balancing.log
发现:当使用gitlab-ctl stop
关闭patroni节点之后,在日志中会看到对应的节点处于offline的状态,这种状态才应该是正常的。但是当通过强制关机关闭patroni节点之后,database_load_balancing.log就不再显示去检查对应的节点的日志,但是此时db_load_balancing的配置中还保留着对应的故障节点。我猜测这可能是导致这个bug的原因!
当出现这个故障时,观察到puma有如下错误日志:
==> /var/log/gitlab/puma/current <==
2023-06-08_06:40:25.89442 {"timestamp":"2023-06-08T06:40:25.894Z","pid":59844,"message":"* Master PID: 59844"}
2023-06-08_06:40:25.89443 {"timestamp":"2023-06-08T06:40:25.894Z","pid":59844,"message":"* Workers: 8"}
2023-06-08_06:40:25.89444 {"timestamp":"2023-06-08T06:40:25.894Z","pid":59844,"message":"* Restarts: (✔) hot (✖) phased"}
2023-06-08_06:40:25.89445 {"timestamp":"2023-06-08T06:40:25.894Z","pid":59844,"message":"* Preloading application"}
2023-06-08_06:41:01.53819 {"timestamp":"2023-06-08T06:41:01.538Z","pid":59844,"message":"* Listening on unix:///var/opt/gitlab/gitlab-rails/sockets/gitlab.socket"}
2023-06-08_06:41:01.53830 {"timestamp":"2023-06-08T06:41:01.538Z","pid":59844,"message":"* Listening on http://0.0.0.0:8080"}
2023-06-08_06:41:01.53834 {"timestamp":"2023-06-08T06:41:01.538Z","pid":59844,"message":"! WARNING: Detected 2 Thread(s) started in app boot:"}
2023-06-08_06:41:01.53838 {"timestamp":"2023-06-08T06:41:01.538Z","pid":59844,"message":"! #\u003cThread:0x00007f82d2cd5080 /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/rack-timeout-0.6.3/lib/rack/timeout/support/scheduler.rb:73 sleep\u003e - /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/rack-timeout-0.6.3/lib/rack/timeout/support/scheduler.rb:91:in `sleep'"}
2023-06-08_06:41:01.53842 {"timestamp":"2023-06-08T06:41:01.538Z","pid":59844,"message":"! #\u003cThread:0x00007f82d9d9e3e8 /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sentry-ruby-5.8.0/lib/sentry/session_flusher.rb:81 sleep\u003e - /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sentry-ruby-5.8.0/lib/sentry/session_flusher.rb:83:in `sleep'"}
2023-06-08_06:41:01.53848 {"timestamp":"2023-06-08T06:41:01.538Z","pid":59844,"message":"Use Ctrl-C to stop"}
==> /var/log/gitlab/puma/puma_stdout.log <==
Note: GC compacting is currently disabled. Refer to `config/initializers_before_autoloader/003_gc_compact.rb` for details.
{"timestamp":"2023-06-08T06:41:01.653Z","pid":59844,"message":"! Friendly fork preparation complete."}
{"timestamp":"2023-06-08T06:41:02.266Z","pid":59844,"message":"- Worker 0 (PID: 60001) booted in 0.6s, phase: 0"}
{"timestamp":"2023-06-08T06:41:02.276Z","pid":59844,"message":"- Worker 1 (PID: 60003) booted in 0.6s, phase: 0"}
{"timestamp":"2023-06-08T06:41:02.278Z","pid":59844,"message":"- Worker 2 (PID: 60005) booted in 0.6s, phase: 0"}
{"timestamp":"2023-06-08T06:41:02.278Z","pid":59844,"message":"- Worker 3 (PID: 60007) booted in 0.58s, phase: 0"}
{"timestamp":"2023-06-08T06:41:02.313Z","pid":59844,"message":"- Worker 4 (PID: 60009) booted in 0.6s, phase: 0"}
{"timestamp":"2023-06-08T06:41:02.325Z","pid":59844,"message":"- Worker 6 (PID: 60013) booted in 0.59s, phase: 0"}
{"timestamp":"2023-06-08T06:41:02.325Z","pid":59844,"message":"- Worker 5 (PID: 60011) booted in 0.61s, phase: 0"}
{"timestamp":"2023-06-08T06:41:02.342Z","pid":59844,"message":"- Worker 7 (PID: 60015) booted in 0.6s, phase: 0"}
==> /var/log/gitlab/puma/puma_stderr.log <==
source=rack-timeout id=01H2CWBPPP716GGSW7VKTFT0HN timeout=60000ms service=60000ms state=timed_out at=error
source=rack-timeout id=01H2CWBXHHN5QT7187PFJ4BVX5 timeout=60000ms service=60000ms state=timed_out at=error
source=rack-timeout id=01H2CWBZ8DBY3WDPRWA857RHSX timeout=60000ms service=60000ms state=timed_out at=error
source=rack-timeout id=01H2CWC4CB29E7K6JV8XQZ1K5S timeout=60000ms service=60000ms state=timed_out at=error
source=rack-timeout id=01H2CWCB76K8BDZZ26ZXN88EY4 timeout=60000ms service=60000ms state=timed_out at=error
workhorse有如下错误日志:
{"correlation_id":"01H2CVY97PA6G546CKJRQJCYJ4","duration_ms":1999,"error":"badgateway: failed to receive response: context canceled","level":"error","method":"GET","msg":"","time":"2023-06-08T14:43:42+08:00","uri":"/-/readiness"}
{"content_type":"application/json; charset=utf-8","correlation_id":"01H2CVY97PA6G546CKJRQJCYJ4","duration_ms":1999,"host":"jihu-ha.futureman.xin","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:0","remote_ip":"127.0.0.1","route":"^/-/(readiness|liveness)$","status":499,"system":"http","time":"2023-06-08T14:43:42+08:00","ttfb_ms":1999,"uri":"/-/readiness","user_agent":"","written_bytes":26}
{"correlation_id":"01H2CVYG2HGRXATFG259H02WNX","duration_ms":2000,"error":"badgateway: failed to receive response: context canceled","level":"error","method":"GET","msg":"","time":"2023-06-08T14:43:49+08:00","uri":"/-/readiness"}
{"content_type":"application/json; charset=utf-8","correlation_id":"01H2CVYG2HGRXATFG259H02WNX","duration_ms":2000,"host":"jihu-ha.futureman.xin","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:0","remote_ip":"127.0.0.1","route":"^/-/(readiness|liveness)$","status":499,"system":"http","time":"2023-06-08T14:43:49+08:00","ttfb_ms":2000,"uri":"/-/readiness","user_agent":"","written_bytes":26}
{"correlation_id":"01H2CVYPXCDRZNT4MRRAVSHE30","duration_ms":2000,"error":"badgateway: failed to receive response: context canceled","level":"error","method":"GET","msg":"","time":"2023-06-08T14:43:56+08:00","uri":"/-/readiness"}
{"content_type":"application/json; charset=utf-8","correlation_id":"01H2CVYPXCDRZNT4MRRAVSHE30","duration_ms":2001,"host":"jihu-ha.futureman.xin","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:0","remote_ip":"127.0.0.1","route":"^/-/(readiness|liveness)$","status":499,"system":"http","time":"2023-06-08T14:43:56+08:00","ttfb_ms":2001,"uri":"/-/readiness","user_agent":"","written_bytes":26}
{"error":"keywatcher: pubsub receive: EOF","level":"error","msg":"","time":"2023-06-08T14:43:58+08:00"}
{"address":"10.12.1.6:6379","level":"info","msg":"redis: dialing","network":"tcp","time":"2023-06-08T14:43:58+08:00"}
{"correlation_id":"01H2CVYHFF9XHN9FKD03SQ5ZV4","duration_ms":10000,"error":"badgateway: failed to receive response: context canceled","level":"error","method":"GET","msg":"","time":"2023-06-08T14:43:59+08:00","uri":"/-/readiness"}
{"content_type":"application/json; charset=utf-8","correlation_id":"01H2CVYHFF9XHN9FKD03SQ5ZV4","duration_ms":10000,"host":"jihu-ha.futureman.xin","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:0","remote_ip":"127.0.0.1","route":"^/-/(readiness|liveness)$","status":499,"system":"http","time":"2023-06-08T14:43:59+08:00","ttfb_ms":10000,"uri":"/-/readiness","user_agent":"","written_bytes":26}
{"correlation_id":"01H2CVYXR8NEPCVQ2JZDTTEVJZ","duration_ms":2000,"error":"badgateway: failed to receive response: context canceled","level":"error","method":"GET","msg":"","time":"2023-06-08T14:44:03+08:00","uri":"/-/readiness"}
{"content_type":"application/json; charset=utf-8","correlation_id":"01H2CVYXR8NEPCVQ2JZDTTEVJZ","duration_ms":2001,"host":"jihu-ha.futureman.xin","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:0","remote_ip":"127.0.0.1","route":"^/-/(readiness|liveness)$","status":499,"system":"http","time":"2023-06-08T14:44:03+08:00","ttfb_ms":2000,"uri":"/-/readiness","user_agent":"","written_bytes":26}
What is the expected correct behavior?
无论是任何原因导致了patroni节点宕机,都不应该影响到整个集群的高可用性,配置db_load_balancing不应该影响集群的高可用。
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)