Skip to content
代码片段 群组 项目
提交 f68d4861 编辑于 作者: Sami Hiltunen's avatar Sami Hiltunen 提交者: Evan Read
浏览文件

Document Praefect derived replica paths

This commit document's Praefect's new storage layout along with some
helpful tips to debug issues with it.
上级 700c7dee
No related branches found
No related tags found
无相关合并请求
...@@ -232,6 +232,137 @@ variable replication factor is tracked in [this issue](https://gitlab.com/groups ...@@ -232,6 +232,137 @@ variable replication factor is tracked in [this issue](https://gitlab.com/groups
As with normal Gitaly storages, virtual storages can be sharded. As with normal Gitaly storages, virtual storages can be sharded.
### Storage layout
WARNING:
The storage layout is an internal detail of Gitaly Cluster and is not guaranteed to remain stable between releases.
The information here is only for informational purposes and to help with debugging. Performing changes in the
repositories directly on the disk is not supported and may lead to breakage or the changes being overwritten.
Gitaly Cluster's virtual storages provide an abstraction that looks like a single storage but actually consists of
multiple physical storages. Gitaly Cluster has to replicate each operation to each physical storage. Operations
may succeed on some of the physical storages but fail on others.
Partially applied operations can cause problems with other operations and leave the system in a state it can't recover from.
To avoid these types of problems, each operation should either fully apply or not apply at all. This property of operations is called
[atomicity](https://en.wikipedia.org/wiki/Atomicity_(database_systems)).
GitLab controls the storage layout on the repository storages. GitLab instructs the repository storage where to create,
delete, and move repositories. These operations create atomicity issues when they are being applied to multiple physical storages.
For example:
- GitLab deletes a repository while one of its replicas is unavailable.
- GitLab later recreates the repository.
As a result, the stale replica that was unavailable at the time of deletion may cause conflicts and prevent
recreation of the repository.
These atomicity issues have caused multiple problems in the past with:
- Geo syncing to a secondary site with Gitaly Cluster.
- Backup restoration.
- Repository moves between repository storages.
Gitaly Cluster provides atomicity for these operations by storing repositories on the disk in a special layout that prevents
conflicts that could occur due to partially applied operations.
#### Client-generated replica paths
Repositories are stored in the storages at the relative path determined by the [Gitaly client](#gitaly-architecture). These paths can be
identified by them not beginning with the `@cluster` prefix. The relative paths
follow the [hashed storage](../repository_storage_types.md#hashed-storage) schema.
#### Praefect-generated replica paths (GitLab 15.0 and later)
> Introduced in GitLab 15.0 behind [a feature flag](https://gitlab.com/gitlab-org/gitaly/-/issues/4218) named `gitaly_praefect_generated_replica_paths`. Disabled by default.
FLAG:
On self-managed GitLab, by default this feature is not available. To make it available, ask an administrator to [enable the feature flag](../feature_flags.md)
named `gitaly_praefect_generated_replica_paths`. On GitLab.com, this feature is available but can be configured by GitLab.com administrators only. The feature is not ready for production use.
When Gitaly Cluster creates a repository, it assigns the repository a unique and permanent ID called the _repository ID_. The repository ID is
internal to Gitaly Cluster and doesn't relate to any IDs elsewhere in GitLab. If a repository is removed from Gitaly Cluster and later moved
back, the repository is assigned a new repository ID and is a different repository from Gitaly Cluster's perspective. The sequence of repository IDs
always increases, but there may be gaps in the sequence.
The repository ID is used to derive a unique storage path called _replica path_ for each repository on the cluster. The replicas of
a repository are all stored at the same replica path on the storages. The replica path is distinct from the _relative path_:
- The relative path is a name the Gitaly client uses to identify a repository, together with its virtual storage, that is unique to them.
- The replica path is the actual physical path in the physical storages.
Praefect translates the repositories in the RPCs from the virtual `(virtual storage, relative path)` identifier into physical repository
`(storage, replica_path)` identifier when handling the client requests.
The format of the replica path for:
- Object pools is `@cluster/pools/<xx>/<xx>/<repository ID>`. Object pools are stored in a different directory than other repositories.
They must be identifiable by Gitaly to avoid pruning them as part of housekeeping. Pruning object pools can cause data loss in the linked
repositories.
- Other repositories is `@cluster/repositories/<xx>/<xx>/<repository ID>`
For example, `@cluster/repositories/6f/96/54771`.
The last component of the replica path, `54771`, is the repository ID. This can be used to identify the repository on the disk.
`<xx>/<xx>` are the first four hex digits of the SHA256 hash of the string representation of the repository ID. This is used to balance
the repositories evenly into subdirectories to avoid overly large directories that might cause problems on some file
systems. In this case, `54771` hashes to `6f960ab01689464e768366d3315b3d3b2c28f38761a58a70110554eb04d582f7` so the
first four digits are `6f` and `96`.
#### Identify repositories on disk
Use the [`praefect metadata`](troubleshooting.md#view-repository-metadata) subcommand to:
- Retrieve a repository's virtual storage and relative path from the metadata store. After you have the hashed storage path, you can use the Rails
console to retrieve the project path.
- Find where a repository is stored in the cluster with either:
- The virtual storage and relative path.
- The repository ID.
The repository on disk also contains the project path in the Git configuration file. The configuration file can be used to determine
the project's location even if the repository's metadata has been deleted. Follow the
[instructions in hashed storage's documentation](../repository_storage_types.md#from-hashed-path-to-project-name).
#### Atomicity of operations
Gitaly Cluster uses the PostgreSQL metadata store with the storage layout to ensure atomicity of repository creation,
deletion, and move operations. The disk operations can't be atomically applied across multiple storages. However, PostgreSQL guarantees
the atomicity of the metadata operations. Gitaly Cluster models the operations in a manner that the failing operations always leave
the metadata consistent. The disks may contain stale state even after successful operations. This is expected and the leftover state
won't intefere with future operations but may use up disk space unnecessarily until a clean up is performed.
There is on-going work on a [background crawler](https://gitlab.com/gitlab-org/gitaly/-/issues/3719) that cleans up the leftover
repositories from the storages.
##### Repository creations
When creating repositories, Praefect:
1. Reserves a repository ID from PostgreSQL. This is atomic and no two creations receive the same ID.
1. Creates replicas on the Gitaly storages in the replica path derived from the repository ID.
1. Creates metadata records after the repository is successfully created on disk.
Even if two concurrent operations create the same repository, they'd be stored in different directories on the storages and not
conflict. The first to complete creates the metadata record and the other operation fails with an "already exists" error.
The failing creation leaves leftover repositories on the storages. There is on-going work on a
[background crawler](https://gitlab.com/gitlab-org/gitaly/-/issues/3719) that clean up the leftover repositories from the storages.
The repository IDs are generated from the `repositories_repository_id_seq` in PostgreSQL. In the above example, the failing operation took
one repository ID without successfully creating a repository with it. Failed repository creations are expected lead to gaps in the repository IDs.
##### Repository deletions
A repository is deleted by removing its metadata record. The repository ceases to logically exist as soon as the metadata record is deleted.
PostgreSQL guarantees the atomicity of the removal and a concurrent delete fails with a "not found" error. After successfully deleting
the metadata record, Praefect attempts to remove the replicas from the storages. This may fail and leave leftover state in the storages.
The leftover state is eventually cleaned up.
##### Repository moves
Unlike Gitaly, Gitaly Cluster doesn't move the repositories in the storages but only virtually moves the repository by updating the
relative path of the repository in the metadata store.
### Moving beyond NFS ### Moving beyond NFS
Engineering support for NFS for Git repositories is deprecated. Technical support is planned to be unavailable starting Engineering support for NFS for Git repositories is deprecated. Technical support is planned to be unavailable starting
......
0% 加载中 .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册