GitLab的KAS工作于“CI/CD workflow”模式下时候,连接K8s集群失败

Summary

GitLab当前提供的CD是基于KAS的,有两种工作模式:GitOps workflowCI/CD workflow

CI/CD workflow属于一种“push”模式,会将本地的manitest推送到目标k8s中,其中与k8s的连接是由kas来生成并维护的,具体是生成一个指向目标k8s的KUBECONFIG环境变量,这样kubectl,helmfile,helm等命令就可以连接并向k8s下发命令

jihulab/incubation-engineering/posthog这个项目为例,KUBECONFIG环境变量的值为/builds/jihulab/incubation-engineering/posthog.tmp/KUBECONFIG, 内容如下:

apiVersion: v1
clusters:
- cluster:
    server: https://kas.jihulab.com/k8s-proxy
  name: gitlab
contexts:
- context:
    cluster: gitlab
    user: agent:364
  name: jihulab/incubation-engineering/posthog:posthog
current-context: jihulab/incubation-engineering/posthog:posthog
kind: Config
preferences: {}
users:
- name: agent:364
  user:
    token: ci:364:[MASKED]

Steps to reproduce

  1. 启动一个K8s集群,比如TKE
  2. 在极狐Saas中创建一个project,然后以KAS方式连接并管理这个k8s集群(使用CI/CD workflow工作模式,具体安装配置参见上面链接)
  3. 在project中新建.gitlab-ci.yam文件,内容如下所示
image: registry.gitlab.com/gitlab-org/cluster-integration/helm-install-image/releases/3.7.2-kube-1.21.5-alpine-3.15

stages:
  - init
  - destroy

variables:
  KUBE_CONTEXT: jihulab/incubation-engineering/posthog:posthog
  HELM_RELEASE_NAME: posthog

  
.kube-context:
   before_script:
     - if [ -n "$KUBE_CONTEXT" ]; then kubectl config use-context "$KUBE_CONTEXT"; fi
     

init:
  extends: [.kube-context]
  stage: init
  script:
   - |

    ret=`kubectl  get ns posthog --ignore-not-found=true | wc -l`
    if [[ $ret -eq 0 ]]; then
        kubectl create ns posthog
    fi
 
 
destroy:
  extends: [.kube-context]
  stage: destroy
  script:
    - |
      echo "begin"
      helm  list -n posthog | awk '{print $1}'| grep ${HELM_RELEASE_NAME} | wc -l
      echo "step1 complete"
      ret=`helm  list -n posthog | awk '{print $1}'| grep ${HELM_RELEASE_NAME} | wc -l`

      if [[ $ret -ne 0 ]]; then
        helm  uninstall ${HELM_RELEASE_NAME} -n posthog
      else
        echo "${HELM_RELEASE_NAME} is not installed successfully, please delete it manual "
      fi
      
  when: manual
  1. 运行pipeline

What is the current bug behavior?

kubectl和helm命令执行失败率很高,经常出现如下错误

Error: Kubernetes cluster unreachable: Get "https://kas.jihulab.com/k8s-proxy/version?timeout=32s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
helm.go:88: [debug] Get "https://kas.jihulab.com/k8s-proxy/version?timeout=32s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Kubernetes cluster unreachable
helm.sh/helm/v3/pkg/kube.(*Client).IsReachable
	helm.sh/helm/v3/pkg/kube/client.go:121
helm.sh/helm/v3/pkg/action.(*History).Run
	helm.sh/helm/v3/pkg/action/history.go:48
main.newUpgradeCmd.func2
	helm.sh/helm/v3/cmd/helm/upgrade.go:102
github.com/spf13/cobra.(*Command).execute
	github.com/spf13/cobra@v1.2.1/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
	github.com/spf13/cobra@v1.2.1/command.go:974
github.com/spf13/cobra.(*Command).Execute
	github.com/spf13/cobra@v1.2.1/command.go:902
main.main
	helm.sh/helm/v3/cmd/helm/helm.go:87
runtime.main
	runtime/proc.go:225
runtime.goexit
	runtime/asm_amd64.s:1371

需要重复执行很多次,才可能成功。比如对于这里的destroyjob, 执行结果如下截图 image

似乎在执行

echo "begin"
helm  list -n posthog | awk '{print $1}'| grep ${HELM_RELEASE_NAME} | wc -l

语句后,job就失败了,后面的 echo "step1 complete"语句根本没被执行到。

What is the expected correct behavior?

正常情况下,init和destroy都应该能正确被执行

Relevant logs and/or screenshots

此时k8s上部署的gitlab-agent日志有很多错误

{"level":"error","time":"2022-07-06T05:34:57.944Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:35:45.246Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T05:36:18.708Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:36:51.055Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:38:30.472Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:39:08.604Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T05:40:07.672Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:44:21.142Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:47:42.553Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T05:48:01.748Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:51:29.669Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:53:12.581Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"Connect(): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing failed to WebSocket dial: expected handshake response status code 101 but got 404\""}
{"level":"warn","time":"2022-07-06T05:56:43.647Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:00:52.981Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:01:28.049Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T06:05:26.897Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:10:07.584Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:10:34.849Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T06:11:10.023Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:15:40.962Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:17:58.401Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T06:18:06.847Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:21:29.165Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:24:39.659Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T06:25:16.972Z","msg":"GetConfiguration failed","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing failed to WebSocket dial: expected handshake response status code 101 but got 404\""}
{"level":"error","time":"2022-07-06T06:28:08.452Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:32:21.658Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T06:32:51.445Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:33:28.235Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}

gitlab-agent是通过helm方式安装的,类似如下

 helm  upgrade --install posthog gitlab/gitlab-agent \
    --namespace gitlab-agent \
    --create-namespace \
    --set image.tag=v15.1.0 \
    --set config.token=xxxxxxxxx \
    --set config.kasAddress=wss://kas.jihulab.com

影响和分析

KAS的CI/CD workflow,几乎无法正常使用!而当前GitLab版本也不可能退回到certificate-base模式

其实在使用GitOps workflow时,这种情况也存在,即gitlab-agent中有大量类似错误,猜测:gitops是一种pull模式,失败一次后,会在下次继续尝试部署,对于用户无感知,但用户能感知到的是更改过了很久才生效,这也与实际测试情况相符。

###相关issue

https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent/-/issues/138

https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent/-/issues/255

Possible fixes