Merge branch 'docs-430906-poc-for-vertical-split' into 'master'

Add docs for data model trade-offs See merge request https://gitlab.com/gitlab-org/gitlab/-/merge_requests/142625 Merged-by: Simon Tomlinson <stomlinson@gitlab.com> Approved-by: Simon Tomlinson <stomlinson@gitlab.com> Reviewed-by: Krasimir Angelov <kangelov@gitlab.com> Reviewed-by: Simon Tomlinson <stomlinson@gitlab.com> Co-authored-by: Krasimir Angelov <kangelov@gitlab.com>

Merge branch 'docs-430906-poc-for-vertical-split' into 'master'
d07b2044 · Simon Tomlinson · GitLab · c2bb09ee · f53005bf · d07b2044
--- a/doc/development/database/layout_and_access_patterns.md
+++ b/doc/development/database/layout_and_access_patterns.md
@@ -59,3 +59,71 @@ Several tables have already been split in this way. For example:
 - `search_data` is split from `issues`.
 - `project_pages_metadata` is split from `projects`.
 - `merge_request_diff_details` is split from `merge_request_diffs`
+## Data model trade-offs
+Certain tables, like `users`, `namespaces`, and `projects`, can get very wide.
+These tables are usually central to the application, and used very often.
+Why is this a problem?
+- Many of these columns are included in indexes, which leads to index write amplifcation.
+  When the number of indexes on the table is more than 16, it affects query planning,
+  and may lead to [light-weight lock (LWLock) contention](https://gitlab.com/groups/gitlab-org/-/epics/11543).
+- Updates in PostgreSQL are implemented as a combination of delete and insert. This means that each column,
+  even if rarely used, is copied over and over again, on each update. This affects the amount of generated
+  write ahead log (WAL).
+- When there is a column that is frequently updated, each update results in all table columns
+  being copied. Again, this results in increase of generated WAL, and creates more work for
+  auto-vacuum.
+- PostgreSQL stores data as rows, or tuples in a page. Wide rows reduce the number of tuples per page,
+  and this affects read performance.
+A possible solution to this problem is to keep only the most important columns on the main table,
+and extract the rest into different tables, having one-to-one relationship with the main table.
+Good candidates are columns that are either very frequently updated, for example `last_activity_at`,
+or columns that are rarely updated and/or used, like activation tokens.
+The trade-off that comes with such extraction is that index-only scans are no longer possible.
+Instead, the application must either join to the new table or execute an additional query. The performance impacts
+of this should be weighed against the benefits of the vertical table split.
+There is a very good episode on this topic on the [PostgresFM](https://postgres.fm) podcast,
+where @NikolayS of [PostgresAI](https://postgres.ai/) and @michristofides of [PgMustard](https://www.pgmustard.com/)
+discuss this topic in more depth - [https://postgres.fm/episodes/data-model-trade-offs](https://postgres.fm/episodes/data-model-trade-offs).
+### Example
+Lets look at the `users` table, which at of the time of writing has 75 columns.
+We can see a few groups of columns that match the above criteria, and are good candidates
+for extraction:
+- OTP related columns, like `encrypted_otp_secret`, `otp_secret_expires_at`, etc.
+  There are few of these columns, and once populated they should not be updated often (if at all).
+- Columns related to email confirmation - `confirmation_token`, `confirmation_sent_at`,
+  and `confirmed_at`. Once populated these are most likely never updated.
+- Timestamps like `password_expires_at`, `last_credential_check_at`, and `admin_email_unsubscribed_at`.
+  Such columns are either updated very often, or not at all. It will be better if they are in a separate table.
+- Various tokens (and columns related to them), like `unlock_token`, `incoming_email_token`, and `feed_token`.
+Let's focus on `users.incoming_email_token` - every user on GitLab.com has one set, and this token is rarely updated.
+In order to extract it from `users` into a new table, we'll have to do the following:
+1. Release M [example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/141561)
+   - Create table (release M)
+   - Update the application to read from the new table, and fallback to the original column when there is no data yet.
+   - Start to back-fill the new table
+1. Release N [example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/141833)
+   - Finalize the background migration doing the back-fill. This should be done in the next release *after* a [required stop](../../update/index.md#required-upgrade-stops).
+1. Release N + 1 [example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/141835)
+   - Update the application to read and write from the new table only.
+   - Ignore the original column. This starts the process of safely removing database columns, as described in our [guides](avoiding_downtime_in_migrations.md#dropping-columns).
+1. Release N + 2 [example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/142086)
+   - Drop the original column.
+1. Release N + 3 [example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/142087)
+   - Remove the ignore rule for the original column.
+While this is a lenghty process, it's needed in order to do the extraction
+without disrupting the application. Once completed, the original column and the related index will
+no longer exists on the `users` table, which will result in improved performance.