info:Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review.
---
# Deduplicate database records in a database table
This guide describes a strategy for introducing database-level uniqueness constraint (unique index) to existing database tables with data.
Requirements:
- Attribute modifications (`INSERT`, `UPDATE`) related to the columns happen only via ActiveRecord (the technique depends on AR callbacks).
- Duplications are rare and mostly happen due to concurrent record creation. This can be verified by checking the production database table via teleport (reach out to a database maintainer for help).
The total runtime mainly depends on the number of records in the database table. The migration will require scanning all records; to fit into the
post-deployment migration runtime limit (about 10 minutes), database table with less than 10 million rows can be considered a small table.
## Deduplication strategy for small tables
The strategy requires 3 milestones. As an example, we're going to deduplicate the `issues` table based on the `title` column where the `title` must be unique for a given `project_id` column.
Milestone 1:
1. Add a new database index (not unique) to the table via post-migration (if not present already).
1. Add model-level uniqueness validation to reduce the likelihood of duplicates (if not present already).
1. Add a transaction-level [advisory lock](https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS) to prevent creating duplicate records.
The second step on its own will not prevent duplicate records, see the [Rails guides](https://guides.rubyonrails.org/active_record_validations.html#uniqueness) for more information.
1. Implement the deduplication logic in a post deployment migration.
1. Replace the existing index with a unique index.
How to resolve duplicates (e.g., merge attributes, keep the most recent record) depends on the features built on top of the database table. In this example, we keep the most recent record.
```ruby
defup
model=define_batchable_model('issues')
# Single pass over the table
model.each_batchdo|batch|
# find duplicated (project_id, title) pairs
duplicates=model
.where("(project_id, title) IN (#{batch.select(:project_id,:title).to_sql})")
1. Remove the advisory lock by removing the `prevent_concurrent_inserts` ActiveRecord callback method.
NOTE:
This milestone must be after a [required stop](required_stops.md).
## Deduplicate strategy for large tables
When deduplicating a large table we can move the batching and the deduplication logic into a [batched background migration](batched_background_migrations.md).
Milestone 1:
1. Add a new database index (not unique) to the table via post migration.
1. Add model-level uniqueness validation to reduce the likelihood of duplicates (if not present already).
1. Add a transaction-level [advisory lock](https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS) to prevent creating duplicate records.
Milestone 2:
1. Implement the deduplication logic in a batched background migration and enqueue it in a post deployment migration.
Milestone 3:
1. Finalize the batched background migration.
1. Replace the existing index with a unique index.
1. Remove the advisory lock by removing the `prevent_concurrent_inserts` ActiveRecord callback method.
NOTE:
This milestone must be after a [required stop](required_stops.md).