-
由 Bob Van Landuyt 创作于
This crosslinks an issue in which we've recently started discussing the direction of Labkit. I think it makes sense to crosslink it from this blueprint. Related to https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2793
由 Bob Van Landuyt 创作于This crosslinks an issue in which we've recently started discussing the direction of Labkit. I think it makes sense to crosslink it from this blueprint. Related to https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2793
status: proposed
creation-date: "2023-04-13"
authors: [ "@andrewn" ]
coach: "@grzesiek"
GitLab Service-Integration: AI and Beyond
This document is an abbreviated proposal for Service-Integration to allow teams within GitLab to rapidly build new application features that leverage AI, ML, and data technologies.
Executive Summary
This document proposes a service-integration approach to setting up infrastructure to allow teams within GitLab to build new application features that leverage AI, ML, and data technologies at a rapid pace. The scope of the document is limited specifically to internally hosted features, not third-party APIs. The current application architecture runs most GitLab application features in Ruby. However, many ML/AI experiments require different resources and tools, implemented in different languages, with huge libraries that do not always play nicely together, and have different hardware requirements. Adding all these features to the existing infrastructure will increase the size of the GitLab application container rapidly, resulting in slower startup times, increased number of dependencies, security risks, negatively impacting development velocity, and increasing complexity due to different hardware requirements. As an alternative, the proposal suggests adding services to avoid overloading GitLabs main workloads. These services will run independently with isolated resources and dependencies. By adding services, GitLab can maintain the availability and security of GitLab.com, and enable engineers to rapidly iterate on new ML/AI experiments.
Scope
The infrastructure, platform, and other changes related to ML/AI experiments is broad. This blueprint is limited specifically to the following scope:
- Production workloads, running (directly or indirectly) as a result of requests into the GitLab application (
gitlab.com
), or an associated subdomains (for example,codesuggestions.gitlab.com
). - Excludes requests from the GitLab application, made to third-party APIs outside of our infrastructure. From an Infrastructure point-of-view, external AI/ML API requests are no different from other API (non ML/AI) requests and generally follow the existing guidelines that are in place for calling external APIs.
- Excludes training and tuning workloads not directly connected to our production workloads. Training and tuning workloads are distinct from production workloads and will be covered by their own blueprint(s).
Running Production ML/AI experiment workloads
Why Not Simply Continue To Use The Existing Application Architecture?
Let's start with some background on how the application is deployed:
- Most GitLab application features are implemented in Ruby and run in one of two types of Ruby deployments: broadly Rails and Sidekiq (although we do partition this traffic further for different workloads).
- These Ruby workloads have two main container images
gitlab-webservice-ee
andgitlab-sidekiq-ee
. All the code, libraries, binaries, and other resources that we use to support the main Ruby part of the codebase are embedded within these images. - There are thousands of pods running these containers in production for GitLab.com at any moment in time. They are started up and shut down at a high rate throughout the day as traffic demands on the site fluctuate.
- For most new features developed, any new supporting resources need to be added to either one, or both of these containers.
Many of the initial discussions focus on adding supporting resources to these existing containers (example). Choosing this approach would have many downsides, in terms of both the velocity at which new features can be iterated on, and in terms of the availability of GitLab.com.
Many of the AI experiments that GitLab is considering integrating into the application are substantially different from other libraries and tools that have been integrated in the past.
- ML toolkits are implemented in a plethora of languages, each requiring separate runtimes. Python, C, C++ are the most common, but there is a long tail of languages used.
- There are a very large number of tools that we're looking to integrate with and no single tool will support all the features that are being investigated. Tensorflow, PyTorch, Keras, Scikit-learn, Alpaca are just a few examples.
- These libraries are huge. Tensorflow's container image with GPU support is 3GB, PyTorch is 5GB, Keras is 300MB. Prophet is ~250MB.
- Many of these libraries do not play nicely together: they may have dependencies that are not compatible, or require different versions of Python, or GPU driver versions.
It's likely that in the next few months, GitLab will experiment with many different features, using many different libraries.
Trying to deploy all of these features into the existing infrastructure would have many downsides:
- The size of the GitLab application container would expand very rapidly as each new experiment introduces a new set of supporting libraries, each library is as big, or bigger, than the existing GitLab application within the container.
- Startup times for new workloads would increase, potentially impacting the availability of GitLab.com during high-traffic periods.
- The number of dependencies within the container would increase rapidly, putting pressure on the engineering teams to keep ahead of exploits and vulnerabilities.
- The security attack surface within the container would be greatly increased with each new dependency. These containers include secrets which, if leaked via an exploit would need costly application-wide secret rotation to be done.
- Development velocity will be negatively impacted as engineers work to avoid dependency conflicts between libraries.
- Additionally there may be extra complexity due to different hardware requirements for different libraries with appropriate drivers etc for GPUs, TPUs, CUDA versions, etc.
- Our Kubernetes workloads have been tuned for the existing multithreaded Ruby request (Rails) and message (Sidekiq) processes. Adding extremely resource-intensive applications into these workloads would affect unrelated requests, starving requests of CPU and memory and requiring complex tuning to ensure fairness. Failure to do this would impact our availability of GitLab.com.