GitLab的KAS工作于“CI/CD workflow”模式下时候,连接K8s集群失败
Summary
GitLab当前提供的CD是基于KAS的,有两种工作模式:GitOps workflow和CI/CD workflow
CI/CD workflow属于一种“push”模式,会将本地的manitest推送到目标k8s中,其中与k8s的连接是由kas来生成并维护的,具体是生成一个指向目标k8s的KUBECONFIG环境变量,这样kubectl,helmfile,helm等命令就可以连接并向k8s下发命令
以jihulab/incubation-engineering/posthog这个项目为例,KUBECONFIG环境变量的值为/builds/jihulab/incubation-engineering/posthog.tmp/KUBECONFIG, 内容如下:
apiVersion: v1
clusters:
- cluster:
server: https://kas.jihulab.com/k8s-proxy
name: gitlab
contexts:
- context:
cluster: gitlab
user: agent:364
name: jihulab/incubation-engineering/posthog:posthog
current-context: jihulab/incubation-engineering/posthog:posthog
kind: Config
preferences: {}
users:
- name: agent:364
user:
token: ci:364:[MASKED]
Steps to reproduce
- 启动一个K8s集群,比如TKE
- 在极狐Saas中创建一个project,然后以KAS方式连接并管理这个k8s集群(使用CI/CD workflow工作模式,具体安装配置参见上面链接)
- 在project中新建.gitlab-ci.yam文件,内容如下所示
image: registry.gitlab.com/gitlab-org/cluster-integration/helm-install-image/releases/3.7.2-kube-1.21.5-alpine-3.15
stages:
- init
- destroy
variables:
KUBE_CONTEXT: jihulab/incubation-engineering/posthog:posthog
HELM_RELEASE_NAME: posthog
.kube-context:
before_script:
- if [ -n "$KUBE_CONTEXT" ]; then kubectl config use-context "$KUBE_CONTEXT"; fi
init:
extends: [.kube-context]
stage: init
script:
- |
ret=`kubectl get ns posthog --ignore-not-found=true | wc -l`
if [[ $ret -eq 0 ]]; then
kubectl create ns posthog
fi
destroy:
extends: [.kube-context]
stage: destroy
script:
- |
echo "begin"
helm list -n posthog | awk '{print $1}'| grep ${HELM_RELEASE_NAME} | wc -l
echo "step1 complete"
ret=`helm list -n posthog | awk '{print $1}'| grep ${HELM_RELEASE_NAME} | wc -l`
if [[ $ret -ne 0 ]]; then
helm uninstall ${HELM_RELEASE_NAME} -n posthog
else
echo "${HELM_RELEASE_NAME} is not installed successfully, please delete it manual "
fi
when: manual
- 运行pipeline
What is the current bug behavior?
kubectl和helm命令执行失败率很高,经常出现如下错误
Error: Kubernetes cluster unreachable: Get "https://kas.jihulab.com/k8s-proxy/version?timeout=32s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
helm.go:88: [debug] Get "https://kas.jihulab.com/k8s-proxy/version?timeout=32s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Kubernetes cluster unreachable
helm.sh/helm/v3/pkg/kube.(*Client).IsReachable
helm.sh/helm/v3/pkg/kube/client.go:121
helm.sh/helm/v3/pkg/action.(*History).Run
helm.sh/helm/v3/pkg/action/history.go:48
main.newUpgradeCmd.func2
helm.sh/helm/v3/cmd/helm/upgrade.go:102
github.com/spf13/cobra.(*Command).execute
github.com/spf13/cobra@v1.2.1/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
github.com/spf13/cobra@v1.2.1/command.go:974
github.com/spf13/cobra.(*Command).Execute
github.com/spf13/cobra@v1.2.1/command.go:902
main.main
helm.sh/helm/v3/cmd/helm/helm.go:87
runtime.main
runtime/proc.go:225
runtime.goexit
runtime/asm_amd64.s:1371
需要重复执行很多次,才可能成功。比如对于这里的destroyjob, 执行结果如下截图

似乎在执行
echo "begin"
helm list -n posthog | awk '{print $1}'| grep ${HELM_RELEASE_NAME} | wc -l
语句后,job就失败了,后面的 echo "step1 complete"语句根本没被执行到。
What is the expected correct behavior?
正常情况下,init和destroy都应该能正确被执行
Relevant logs and/or screenshots
此时k8s上部署的gitlab-agent日志有很多错误
{"level":"error","time":"2022-07-06T05:34:57.944Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:35:45.246Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T05:36:18.708Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:36:51.055Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:38:30.472Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:39:08.604Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T05:40:07.672Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:44:21.142Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:47:42.553Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T05:48:01.748Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:51:29.669Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T05:53:12.581Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"Connect(): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing failed to WebSocket dial: expected handshake response status code 101 but got 404\""}
{"level":"warn","time":"2022-07-06T05:56:43.647Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:00:52.981Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:01:28.049Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T06:05:26.897Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:10:07.584Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:10:34.849Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T06:11:10.023Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:15:40.962Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:17:58.401Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T06:18:06.847Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:21:29.165Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:24:39.659Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T06:25:16.972Z","msg":"GetConfiguration failed","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing failed to WebSocket dial: expected handshake response status code 101 but got 404\""}
{"level":"error","time":"2022-07-06T06:28:08.452Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:32:21.658Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"warn","time":"2022-07-06T06:32:51.445Z","msg":"GetConfiguration.Recv failed","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
{"level":"error","time":"2022-07-06T06:33:28.235Z","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"rpc error: code = Unavailable desc = error reading from server: failed to get reader: failed to read frame header: EOF"}
gitlab-agent是通过helm方式安装的,类似如下
helm upgrade --install posthog gitlab/gitlab-agent \
--namespace gitlab-agent \
--create-namespace \
--set image.tag=v15.1.0 \
--set config.token=xxxxxxxxx \
--set config.kasAddress=wss://kas.jihulab.com
影响和分析
KAS的CI/CD workflow,几乎无法正常使用!而当前GitLab版本也不可能退回到certificate-base模式
其实在使用GitOps workflow时,这种情况也存在,即gitlab-agent中有大量类似错误,猜测:gitops是一种pull模式,失败一次后,会在下次继续尝试部署,对于用户无感知,但用户能感知到的是更改过了很久才生效,这也与实际测试情况相符。
###相关issue
https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent/-/issues/138
https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent/-/issues/255