Adding documentation for geo failover with secondary runners

* Adds steps for what may need to be done in order to avoid any potential load issues upon failover

Adding documentation for geo failover with secondary runners
944f0bb2 · Ian Baum · GitLab · 00afb5aa · 944f0bb2 · 944f0bb2
--- a/doc/administration/geo/disaster_recovery/planned_failover.md
+++ b/doc/administration/geo/disaster_recovery/planned_failover.md
@@ -186,6 +186,10 @@ On the **primary** site:
   takes to finish syncing.
 1. Select **Add broadcast message**.
+### Runner failover
+If you have any runners connected to your current secondary, see [how to handle them](../secondary_proxy/runners.md#handling-a-planned-failover-with-secondary-runners) during the failover.
 ## Prevent updates to the **primary** site
 To ensure that all data is replicated to a secondary site, updates (write requests) need to

--- a/doc/administration/geo/secondary_proxy/runners.md
+++ b/doc/administration/geo/secondary_proxy/runners.md
@@ -43,3 +43,44 @@ Using separate secondary URLs, the runners should be:
 1. Registered with the secondary external URL.
 1. Configured with [`clone_url`](https://docs.gitlab.com/runner/configuration/advanced-configuration.html#how-clone_url-works) set to the `external_url` of the secondary instance.
+## Handling a Planned Failover with secondary runners
+When executing [a planned failover](../disaster_recovery/planned_failover.md), secondary runners try to keep talking to their local instance. This leads to decreased runner capacity, and may need to be accounted for.
+### With Location Aware public URL
+When using the [Location Aware public URL](location_aware_external_url.md), all runners automatically connect to the closest Geo site.
+When failing over to a new primary:
+- While the old primary is still in the DNS record, any runners previously connected to your old primary still attempt to pick up jobs from the old primary. If it is unreachable, the runners [detect this](https://docs.gitlab.com/runner/configuration/advanced-configuration.html#how-unhealthy_requests_limit-and-unhealthy_interval-works), and stop requesting for an extended period of time after the instance returns.
+- If you have [multiple secondary nodes](../disaster_recovery/index.md#promoting-secondary-geo-replica-in-multi-secondary-configurations), after the initial failover the remaining secondaries are in an unhealthy state until they are [replicated](../disaster_recovery/index.md#step-2-initiate-the-replication-process) with the new primary. The runners attached to them are then unable to check in, and their health check also kicks in.
+- If you remove any of the unhealthy nodes from the Geo DNS entry, the runners pick the next closest instance. Depending on your architecture, this may not be what you want, as you could overwhelm your site in its reduced state.
+To alleviate any of these issues, you can [pause](#pausing-runners) or shutdown some of the runners until the site is back up to 100%.
+If you are not concerned about these issues, there is nothing to do here.
+### With separate URLs
+- If you are returning the old primary to service, you can pause the old primary runners until it is back online. This prevents the health check from kicking in.
+- If the old primary is not returning, or you want to avoid temporarily reduced runner capacity, the primary runners should be reconfigured to connect to the new primary.
+- If multiple secondaries are being used, the runners should be [paused](#pausing-runners), shutdown, or reconfigured to connect to the new primary while they are being replicated to the new primary.
+### Pausing runners
+You must have administrator access to use any of the following methods:
+- Through the Admin Area:
+  1. On the left sidebar, at the bottom, select **Admin Area**.
+  1. Select **Settings > Runners**.
+  1. Identify the runners you would like to pause.
+  1. Select the `pause` button next to each runner you would like to pause.
+  1. After the failover is complete, unpause the runners you paused in the previous step.
+- Use the [Runners API](../../../api/runners.md):
+  1. Fetch or create a [Personal Access Token](../../../user/profile/personal_access_tokens.md) with administrator access.
+  1. Get the list of runners. You can filter the list [using the API](../../../api/runners.md#list-all-runners).
+  1. Identify the runners you would like to pause, and make note of their `id`.
+  1. [Follow the API documentation](../../../api/runners.md#pause-a-runner) to pause each runner.
+  1. After the failover is complete, unpause the list of runners using the API by setting `paused=false`.