From 44ab239de91b0a31bd1e69c6cda477a352fa65a0 Mon Sep 17 00:00:00 2001
From: Christian Couder <chriscool@tuxfamily.org>
Date: Sun, 7 Apr 2024 23:39:29 +0000
Subject: [PATCH] doc: Add Git monorepo performance info

Our monorepo guide doesn't explain very well what happens on the Git
side when repositories are too large in general.

Let's try to fix that by explaining what packfiles are, when they are
used, and the fact that creating them is expensive in CPU and memory.
---
 .../project/repository/monorepos/index.md     | 49 +++++++++++++++----
 1 file changed, 39 insertions(+), 10 deletions(-)

diff --git a/doc/user/project/repository/monorepos/index.md b/doc/user/project/repository/monorepos/index.md
index 128f8d99cdd3..f3b8edb85e1c 100644
--- a/doc/user/project/repository/monorepos/index.md
+++ b/doc/user/project/repository/monorepos/index.md
@@ -22,15 +22,28 @@ Monorepos can be large for [many reasons](https://about.gitlab.com/blog/2022/09/
 
 Large repositories pose a performance risk when used in GitLab, especially if a large monorepo receives many clones or pushes a day, which is common for them.
 
-Git itself has performance limitations when it comes to handling
-monorepos.
+### Git performance issues with large repositories
 
-Monorepos can also impact notably on hardware, in some cases hitting limitations such as vertical scaling and network or disk bandwidth limits.
+Git uses [packfiles](https://git-scm.com/book/en/v2/Git-Internals-Packfiles)
+to store its objects so that they take up as little space as
+possible. Packfiles are also used to transfer objects when cloning,
+fetching, or pushing between a Git client and a Git server. Using packfiles is
+usually good because it reduces the amount of disk space and network
+bandwith required.
+
+However, creating packfiles requires a lot of CPU and memory to compress object
+content. So when repositories are large, every Git operation
+that requires creating packfiles becomes expensive and slow as more
+and bigger objects need to be processed and transfered.
+
+### Consequences for GitLab
 
 [Gitaly](https://gitlab.com/gitlab-org/gitaly) is our Git storage service built
 on top of [Git](https://git-scm.com/). This means that any limitations of
 Git are experienced in Gitaly, and in turn by end users of GitLab.
 
+Monorepos can also impact notably on hardware, in some cases hitting limitations such as vertical scaling and network or disk bandwidth limits.
+
 ## Optimize GitLab settings
 
 You should use as many of the following strategies as possible to minimize
@@ -39,9 +52,9 @@ fetches on the Gitaly server.
 ### Rationale
 
 The most resource intensive operation in Git is the
-[`git-pack-objects`](https://git-scm.com/docs/git-pack-objects) process. It is
-responsible for figuring out all of the commit history and files to send back to
-the client.
+[`git-pack-objects`](https://git-scm.com/docs/git-pack-objects)
+process, which is responsible for creating packfiles after figuring out
+all of the commit history and files to send back to the client.
 
 The larger the repository, the more commits, files, branches, and tags that a
 repository has and the more expensive this operation is. Both memory and CPU
@@ -332,10 +345,26 @@ when doing an object graph walk.
 
 ### Large blobs
 
-The presence of large files (called blobs in Git), can be problematic for Git
-because it does not handle large binary files efficiently. If there are blobs over
-10 MB or instance in the `git-sizer` output, this probably means there is binary
-data in your repository.
+Blobs are the [Git objects](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects)
+that are used to store and manage the content of the files that users
+have commited into Git repositories.
+
+#### Issues with large blobs
+
+Large blobs can be problematic for Git because Git does not handle
+large binary data efficiently. Blobs over 10 MB in the `git-sizer` output
+probably means that there is large binary data in your repository.
+
+While source code can usually be efficiently compressed, binary data
+is often already compressed. This means that Git is unlikely to be
+successful when it tries to compress large blobs when creating packfiles.
+This results in larger packfiles and higher CPU, memory, and bandwidth
+usage on both Git clients and servers.
+
+On the client side, because Git stores blob content in both packfiles
+(usually under `.git/objects/pack/`) and regular files (in
+[worktrees](https://git-scm.com/docs/git-worktree)), much more disk
+space is usually required than for source code.
 
 #### Use LFS for large blobs
 
-- 
GitLab