diff --git a/doc/user/project/repository/monorepos/index.md b/doc/user/project/repository/monorepos/index.md index 128f8d99cdd33dca87c6f6c31b26291fa8465d88..f3b8edb85e1cba4964c2e501afd7fd9925db7489 100644 --- a/doc/user/project/repository/monorepos/index.md +++ b/doc/user/project/repository/monorepos/index.md @@ -22,15 +22,28 @@ Monorepos can be large for [many reasons](https://about.gitlab.com/blog/2022/09/ Large repositories pose a performance risk when used in GitLab, especially if a large monorepo receives many clones or pushes a day, which is common for them. -Git itself has performance limitations when it comes to handling -monorepos. +### Git performance issues with large repositories -Monorepos can also impact notably on hardware, in some cases hitting limitations such as vertical scaling and network or disk bandwidth limits. +Git uses [packfiles](https://git-scm.com/book/en/v2/Git-Internals-Packfiles) +to store its objects so that they take up as little space as +possible. Packfiles are also used to transfer objects when cloning, +fetching, or pushing between a Git client and a Git server. Using packfiles is +usually good because it reduces the amount of disk space and network +bandwith required. + +However, creating packfiles requires a lot of CPU and memory to compress object +content. So when repositories are large, every Git operation +that requires creating packfiles becomes expensive and slow as more +and bigger objects need to be processed and transfered. + +### Consequences for GitLab [Gitaly](https://gitlab.com/gitlab-org/gitaly) is our Git storage service built on top of [Git](https://git-scm.com/). This means that any limitations of Git are experienced in Gitaly, and in turn by end users of GitLab. +Monorepos can also impact notably on hardware, in some cases hitting limitations such as vertical scaling and network or disk bandwidth limits. + ## Optimize GitLab settings You should use as many of the following strategies as possible to minimize @@ -39,9 +52,9 @@ fetches on the Gitaly server. ### Rationale The most resource intensive operation in Git is the -[`git-pack-objects`](https://git-scm.com/docs/git-pack-objects) process. It is -responsible for figuring out all of the commit history and files to send back to -the client. +[`git-pack-objects`](https://git-scm.com/docs/git-pack-objects) +process, which is responsible for creating packfiles after figuring out +all of the commit history and files to send back to the client. The larger the repository, the more commits, files, branches, and tags that a repository has and the more expensive this operation is. Both memory and CPU @@ -332,10 +345,26 @@ when doing an object graph walk. ### Large blobs -The presence of large files (called blobs in Git), can be problematic for Git -because it does not handle large binary files efficiently. If there are blobs over -10 MB or instance in the `git-sizer` output, this probably means there is binary -data in your repository. +Blobs are the [Git objects](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) +that are used to store and manage the content of the files that users +have commited into Git repositories. + +#### Issues with large blobs + +Large blobs can be problematic for Git because Git does not handle +large binary data efficiently. Blobs over 10 MB in the `git-sizer` output +probably means that there is large binary data in your repository. + +While source code can usually be efficiently compressed, binary data +is often already compressed. This means that Git is unlikely to be +successful when it tries to compress large blobs when creating packfiles. +This results in larger packfiles and higher CPU, memory, and bandwidth +usage on both Git clients and servers. + +On the client side, because Git stores blob content in both packfiles +(usually under `.git/objects/pack/`) and regular files (in +[worktrees](https://git-scm.com/docs/git-worktree)), much more disk +space is usually required than for source code. #### Use LFS for large blobs