From 44ab239de91b0a31bd1e69c6cda477a352fa65a0 Mon Sep 17 00:00:00 2001 From: Christian Couder <chriscool@tuxfamily.org> Date: Sun, 7 Apr 2024 23:39:29 +0000 Subject: [PATCH] doc: Add Git monorepo performance info Our monorepo guide doesn't explain very well what happens on the Git side when repositories are too large in general. Let's try to fix that by explaining what packfiles are, when they are used, and the fact that creating them is expensive in CPU and memory. --- .../project/repository/monorepos/index.md | 49 +++++++++++++++---- 1 file changed, 39 insertions(+), 10 deletions(-) diff --git a/doc/user/project/repository/monorepos/index.md b/doc/user/project/repository/monorepos/index.md index 128f8d99cdd3..f3b8edb85e1c 100644 --- a/doc/user/project/repository/monorepos/index.md +++ b/doc/user/project/repository/monorepos/index.md @@ -22,15 +22,28 @@ Monorepos can be large for [many reasons](https://about.gitlab.com/blog/2022/09/ Large repositories pose a performance risk when used in GitLab, especially if a large monorepo receives many clones or pushes a day, which is common for them. -Git itself has performance limitations when it comes to handling -monorepos. +### Git performance issues with large repositories -Monorepos can also impact notably on hardware, in some cases hitting limitations such as vertical scaling and network or disk bandwidth limits. +Git uses [packfiles](https://git-scm.com/book/en/v2/Git-Internals-Packfiles) +to store its objects so that they take up as little space as +possible. Packfiles are also used to transfer objects when cloning, +fetching, or pushing between a Git client and a Git server. Using packfiles is +usually good because it reduces the amount of disk space and network +bandwith required. + +However, creating packfiles requires a lot of CPU and memory to compress object +content. So when repositories are large, every Git operation +that requires creating packfiles becomes expensive and slow as more +and bigger objects need to be processed and transfered. + +### Consequences for GitLab [Gitaly](https://gitlab.com/gitlab-org/gitaly) is our Git storage service built on top of [Git](https://git-scm.com/). This means that any limitations of Git are experienced in Gitaly, and in turn by end users of GitLab. +Monorepos can also impact notably on hardware, in some cases hitting limitations such as vertical scaling and network or disk bandwidth limits. + ## Optimize GitLab settings You should use as many of the following strategies as possible to minimize @@ -39,9 +52,9 @@ fetches on the Gitaly server. ### Rationale The most resource intensive operation in Git is the -[`git-pack-objects`](https://git-scm.com/docs/git-pack-objects) process. It is -responsible for figuring out all of the commit history and files to send back to -the client. +[`git-pack-objects`](https://git-scm.com/docs/git-pack-objects) +process, which is responsible for creating packfiles after figuring out +all of the commit history and files to send back to the client. The larger the repository, the more commits, files, branches, and tags that a repository has and the more expensive this operation is. Both memory and CPU @@ -332,10 +345,26 @@ when doing an object graph walk. ### Large blobs -The presence of large files (called blobs in Git), can be problematic for Git -because it does not handle large binary files efficiently. If there are blobs over -10 MB or instance in the `git-sizer` output, this probably means there is binary -data in your repository. +Blobs are the [Git objects](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) +that are used to store and manage the content of the files that users +have commited into Git repositories. + +#### Issues with large blobs + +Large blobs can be problematic for Git because Git does not handle +large binary data efficiently. Blobs over 10 MB in the `git-sizer` output +probably means that there is large binary data in your repository. + +While source code can usually be efficiently compressed, binary data +is often already compressed. This means that Git is unlikely to be +successful when it tries to compress large blobs when creating packfiles. +This results in larger packfiles and higher CPU, memory, and bandwidth +usage on both Git clients and servers. + +On the client side, because Git stores blob content in both packfiles +(usually under `.git/objects/pack/`) and regular files (in +[worktrees](https://git-scm.com/docs/git-worktree)), much more disk +space is usually required than for source code. #### Use LFS for large blobs -- GitLab