Failure testing and drill practice in SaaS

Backgroud

The issue mentions that we need to conduct Chaos engineering, which is the process of testing a distributed computing system to ensure that it can withstand unexpected disruptions, focus on random and unpredictable behavior.

But before carrying out automated chaos engineering, we first need to have the most basic knowledge of Gitlab distributed system. Firstly we can carry out some basic failure testing (through purposeful destruction) to find out the possible weaknesses of the system, so as to verify the ability of the system and personnel to cope with various unexpected problems in a real complex environment, and to improve the immunity of the system. Therefore, as a precondition of Chaos Engineering, it is necessary to first sort out the basic failure testing scenarios, recovery plans, and backup recovery plans. Later, we will conduct a fully automated randomized Chaos experiment.

Failure testing process

Before the failure testing

Build the scenario and select the assumptions for this experiment, such as
- No business will be affected by a downstream service hang.
- The business will not be affected by a pod being killed suddenly.
- Failures can be recovered in 10m minutes with the recovery plan
- When a core downstream dependency hangs, the downgrade solution must be effective and have acceptable side effects.
Write failure recovery plans for specific failure scenarios
Set up the expected impact
Set up monitoring and alerting metrics and establish SLI

During the failure testing

Monitor relevant metrics to determine if they are within the expected impact range and check the effectiveness of alarm alerts

After the failure testing

Identify system vulnerabilities and obtain improvements
Evaluate monitoring alarm availability and sort out invalid alarms
Evaluate failure recovery plans

Failure testing scenarios.

Middleware
k8s
Load balancing
CDN

由 Jiaxin Qi 编辑于 9月 06, 2021