diff --git a/doc/development/ai_features/embeddings.md b/doc/development/ai_features/embeddings.md index 7da6565a444183f48623e2641aaee460c17a2f0c..9f22b511e450fbd2fe615d15b8f97b71db49fcb2 100644 --- a/doc/development/ai_features/embeddings.md +++ b/doc/development/ai_features/embeddings.md @@ -76,3 +76,48 @@ The following process outlines the steps to get embeddings generated and stored 1. Add a new unit primitive: [here](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/merge_requests/918) and [here](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/155835). 1. Use `Elastic::ApplicationVersionedSearch` to access callbacks and add the necessary checks for when to generate embeddings. See [`Search::Elastic::IssuesSearch`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/models/concerns/search/elastic/issues_search.rb) for an example. 1. Backfill embeddings: [example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/154940). + +## Adding issue embeddings locally + +### Prerequisites + +1. [Make sure Elasticsearch is running](../advanced_search.md#setting-up-development-environment). +1. If you have an existing Elasticsearch setup, make sure the `AddEmbeddingToIssues` migration has been completed by executing the following until it returns: + + ```ruby + Elastic::MigrationWorker.new.perform + ``` + +1. Make sure you can run [GitLab Duo features on your local environment](../ai_features/index.md#instructions-for-setting-up-gitlab-duo-features-in-the-local-development-environment). +1. Ensure running the following in a rails console outputs an embedding (a vector of 768 dimensions). If not, there is a problem with the AI setup. + + ```ruby + Gitlab::Llm::VertexAi::Embeddings::Text.new('text', user: nil, tracking_context: {}, unit_primitive: 'semantic_search_issue').execute + ``` + +### Running the backfill + +To backfill issue embeddings for a project's issues, run the following in a rails console: + +```ruby +Gitlab::Duo::Developments::BackfillIssueEmbeddings.execute(project_id: project_id) +``` + +The task adds the issues to a queue and processes them in batches, indexing embeddings into Elasticsearch. +It respects a rate limit of 450 embeddings per minute. Reach out to `@maddievn` or `#g_global_search` in Slack if there are any issues. + +### Verify + +If the following returns 0, all issues for the project have embeddings: + +<details><summary>Expand</summary> + +```shell +curl "http://localhost:9200/gitlab-development-issues/_count" \ +--header "Content-Type: application/json" \ +--data '{"query": {"bool": {"filter": [{"term": {"project_id": PROJECT_ID}}], "must_not": [{"exists": {"field": "embedding"}}]}}}' | jq '.count' +``` + +</details> + +Replacing `PROJECT_ID` with your project ID. diff --git a/ee/lib/gitlab/duo/developments/backfill_issue_embeddings.rb b/ee/lib/gitlab/duo/developments/backfill_issue_embeddings.rb new file mode 100644 index 0000000000000000000000000000000000000000..4fdd352523b6f00f4900d4b10fef1feeecc9e909 --- /dev/null +++ b/ee/lib/gitlab/duo/developments/backfill_issue_embeddings.rb @@ -0,0 +1,34 @@ +# frozen_string_literal: true + +module Gitlab + module Duo + module Developments + class BackfillIssueEmbeddings + def self.execute(project_id:) + issues_to_backfill = Project.find(project_id).issues + + puts "Adding #{issues_to_backfill.count} issue embeddings to the queue" + + issues_to_backfill.each_batch do |batch| + batch.each do |issue| + ::Search::Elastic::ProcessEmbeddingBookkeepingService.track_embedding!(issue) + end + end + + while ::Search::Elastic::ProcessEmbeddingBookkeepingService.queue_size > 0 + puts "Queue size: #{::Search::Elastic::ProcessEmbeddingBookkeepingService.queue_size}" + + ::Search::Elastic::ProcessEmbeddingBookkeepingService.new.execute + + if ::Search::Elastic::ProcessEmbeddingBookkeepingService.queue_size > 0 + puts 'Sleeping for 1 minute...' + sleep(60) + end + end + + puts "Finished processing the queue.\nAll issues for project (#{project_id}) now have embeddings." + end + end + end + end +end