diff --git a/doc/development/database/batched_background_migrations.md b/doc/development/database/batched_background_migrations.md index f5393b26a3265a34a5ba54f27794de0b986453a5..f3ea82b5c61fb389fb71baadfa59bc9fb5060ee7 100644 --- a/doc/development/database/batched_background_migrations.md +++ b/doc/development/database/batched_background_migrations.md @@ -374,6 +374,77 @@ module Gitlab end ``` +### Adding filters to the initial batching + +By default, when creating background jobs to perform the migration, batched background migrations will iterate over the full specified table. This is done using the [`PrimaryKeyBatchingStrategy`](https://gitlab.com/gitlab-org/gitlab/-/blob/c9dabd1f4b8058eece6d8cb4af95e9560da9a2ee/lib/gitlab/database/migrations/batched_background_migration_helpers.rb#L17). This means if there are 1000 records in the table and the batch size is 100, there will be 10 jobs. For illustrative purposes, `EachBatch` is used like this: + +```ruby +# PrimaryKeyBatchingStrategy +Projects.all.each_batch(of: 100) do |relation| + relation.where(foo: nil).update_all(foo: 'bar') # this happens in each background job +end +``` + +There are cases where we only need to look at a subset of records. Perhaps we only need to update 1 out of every 10 of those 1000 records. It would be best if we could apply a filter to the initial relation when the jobs are created: + +```ruby +Projects.where(foo: nil).each_batch(of: 100) do |relation| + relation.update_all(foo: 'bar') +end +``` + +In the `PrimaryKeyBatchingStrategy` example, we do not know how many records will be updated in each batch. In the filtered example, we know exactly 100 will be updated with each batch. + +The `PrimaryKeyBatchingStrategy` contains [a method that can be overwritten](https://gitlab.com/gitlab-org/gitlab/-/blob/dd1e70d3676891025534dc4a1e89ca9383178fe7/lib/gitlab/background_migration/batching_strategies/primary_key_batching_strategy.rb#L38-52) to apply additional filtering on the initial `EachBatch`. + +We can accomplish this by: + +1. Create a new class that inherits from `PrimaryKeyBatchingStrategy` and overrides the method using the desired filter (this may be the same filter used in the sub-batch): + + ```ruby + # frozen_string_literal: true + + module GitLab + module BackgroundMigration + module BatchingStrategies + class FooStrategy < PrimaryKeyBatchingStrategy + def apply_additional_filters(relation, job_arguments: [], job_class: nil) + relation.where(foo: nil) + end + end + end + end + end + ``` + +1. In the post-deployment migration that queues the batched background migration, specify the new batching strategy using the `batch_class_name` parameter: + + ```ruby + class BackfillProjectsFoo < Gitlab::Database::Migration[2.0] + MIGRATION = 'BackfillProjectsFoo' + DELAY_INTERVAL = 2.minutes + BATCH_CLASS_NAME = 'FooStrategy' + + restrict_gitlab_migration gitlab_schema: :gitlab_main + + def up + queue_batched_background_migration( + MIGRATION, + :routes, + :id, + job_interval: DELAY_INTERVAL, + batch_class_name: BATCH_CLASS_NAME + ) + end + + def down + delete_batched_background_migration(MIGRATION, :routes, :id, []) + end + end + ``` + +When applying a batching strategy, it is important to ensure the filter properly covered by an index to optimize `EachBatch` performance. See [the `EachBatch` docs for more information](../iterating_tables_in_batches.md). + ## Testing Writing tests is required for: