Failure testing and drill practice in SaaS
Backgroud
The issue mentions that we need to conduct Chaos engineering, which is the process of testing a distributed computing system to ensure that it can withstand unexpected disruptions, focus on random and unpredictable behavior.
But before carrying out automated chaos engineering, we first need to have the most basic knowledge of Gitlab distributed system. Firstly we can carry out some basic failure testing (through purposeful destruction) to find out the possible weaknesses of the system, so as to verify the ability of the system and personnel to cope with various unexpected problems in a real complex environment, and to improve the immunity of the system. Therefore, as a precondition of Chaos Engineering, it is necessary to first sort out the basic failure testing scenarios, recovery plans, and backup recovery plans. Later, we will conduct a fully automated randomized Chaos experiment.
Failure testing process
Before the failure testing
- Build the scenario and select the assumptions for this experiment, such as
- No business will be affected by a downstream service hang.
- The business will not be affected by a pod being killed suddenly.
- Failures can be recovered in 10m minutes with the recovery plan
- When a core downstream dependency hangs, the downgrade solution must be effective and have acceptable side effects.
- Write failure recovery plans for specific failure scenarios
- Set up the expected impact
- Set up monitoring and alerting metrics and establish SLI
During the failure testing
- Monitor relevant metrics to determine if they are within the expected impact range and check the effectiveness of alarm alerts
After the failure testing
- Identify system vulnerabilities and obtain improvements
- Evaluate monitoring alarm availability and sort out invalid alarms
- Evaluate failure recovery plans
Failure testing scenarios.
- Middleware
- k8s
- Load balancing
- CDN