Geo database replication
Background
The current SaaS environment backup is based on the following policy:
- Full backup once a day
- Archiving WALs once a minute
Considering the use of Archiving WALs, the current achievable RPO < 1min, which means that the PostgreSQL cluster loses at most 1 minute of data after recovery from downtime.
However, if a PostgreSQL cluster failure is encountered and the PostgreSQL cluster needs to be restored, the whole recovery process may take a very long time due to the huge data of the PostgreSQL cluster, e.g. gitlab.com lost its entire PostgreSQL cluster due to a mistake by the DBA, and the entire recovery process from the backup took 2 hours. So the RTO time for the current backup solution could be more than 1 hour.
We need a backup recovery solution with a shorter RTO.
Proposal
By building a PostgreSQL cluster with Geo streaming replication, the goal:
-
- RPO < 1 min
-
- RTO < 10min
The specific steps are.
-
Build a Geo database replication cluster based on Patroni solution, which is a read-only cluster with data streaming replication from the primary site cluster, Due to the communication delay between different regions, The replication should be asynchronous, but the delay of replication should be monitored and should not be bigger than 1 minute.
-
Before GitLab geo is online, the cluster is cold standby, the Geo database replication cluster will be smaller than the primary cluster, But should have the ability to scale to the same size as the primary PostgreSQL cluster within 10 minutes and take all the traffic from the primary site.
-
After GitLab geo is online, it can be used as a read-only database for the geo site, and the specifications can be expanded accordingly.
-
By switching the DNS of master and geo database replication cluster, switch manually through the SRE, you need to ensure that the SRE has the ability to adjust the DNS configuration. Before DNS switch need to ensure that the Geo database replication cluster has the ability to take all the traffic from the master site.
FYI: