diff --git a/doc/development/database/index.md b/doc/development/database/index.md index b427f54ff3cb977b2aa4c52d42040d8ce4da3183..884967282853e596a0c845da17d64359d443c01f 100644 --- a/doc/development/database/index.md +++ b/doc/development/database/index.md @@ -30,7 +30,7 @@ info: To determine the technical writer assigned to the Stage/Group associated w - [Testing Rails migrations](../testing_guide/testing_migrations_guide.md) guide - [Post deployment migrations](post_deployment_migrations.md) - [Background migrations](background_migrations.md) -- [Swapping tables](../swapping_tables.md) +- [Swapping tables](swapping_tables.md) - [Deleting migrations](deleting_migrations.md) - [Partitioning tables](table_partitioning.md) @@ -47,7 +47,7 @@ info: To determine the technical writer assigned to the Stage/Group associated w - [`NOT NULL` constraints](not_null_constraints.md) - [Strings and the Text data type](strings_and_the_text_data_type.md) - [Single table inheritance](../single_table_inheritance.md) -- [Polymorphic associations](../polymorphic_associations.md) +- [Polymorphic associations](polymorphic_associations.md) - [Serializing data](../serializing_data.md) - [Hash indexes](../hash_indexes.md) - [Storing SHA1 hashes as binary](../sha1_as_binary.md) diff --git a/doc/development/database/polymorphic_associations.md b/doc/development/database/polymorphic_associations.md new file mode 100644 index 0000000000000000000000000000000000000000..ac4dc7720a58eb0c1f8d831747ef623b6db328dd --- /dev/null +++ b/doc/development/database/polymorphic_associations.md @@ -0,0 +1,152 @@ +--- +stage: Data Stores +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# Polymorphic Associations + +**Summary:** always use separate tables instead of polymorphic associations. + +Rails makes it possible to define so called "polymorphic associations". This +usually works by adding two columns to a table: a target type column, and a +target ID. For example, at the time of writing we have such a setup for +`members` with the following columns: + +- `source_type`: a string defining the model to use, can be either `Project` or + `Namespace`. +- `source_id`: the ID of the row to retrieve based on `source_type`. For + example, when `source_type` is `Project` then `source_id` contains a + project ID. + +While such a setup may appear to be useful, it comes with many drawbacks; enough +that you should avoid this at all costs. + +## Space Wasted + +Because this setup relies on string values to determine the model to use, it +wastes a lot of space. For example, for `Project` and `Namespace` the +maximum size is 9 bytes, plus 1 extra byte for every string when using +PostgreSQL. While this may only be 10 bytes per row, given enough tables and +rows using such a setup we can end up wasting quite a bit of disk space and +memory (for any indexes). + +## Indexes + +Because our associations are broken up into two columns this may result in +requiring composite indexes for queries to be performed efficiently. While +composite indexes are not wrong at all, they can be tricky to set up as the +ordering of columns in these indexes is important to ensure optimal performance. + +## Consistency + +One really big problem with polymorphic associations is being unable to enforce +data consistency on the database level using foreign keys. For consistency to be +enforced on the database level one would have to write their own foreign key +logic to support polymorphic associations. + +Enforcing consistency on the database level is absolutely crucial for +maintaining a healthy environment, and thus is another reason to avoid +polymorphic associations. + +## Query Overhead + +When using polymorphic associations you always need to filter using both +columns. For example, you may end up writing a query like this: + +```sql +SELECT * +FROM members +WHERE source_type = 'Project' +AND source_id = 13083; +``` + +Here PostgreSQL can perform the query quite efficiently if both columns are +indexed. As the query gets more complex, it may not be able to use these +indexes effectively. + +## Mixed Responsibilities + +Similar to functions and classes, a table should have a single responsibility: +storing data with a certain set of pre-defined columns. When using polymorphic +associations, you are storing different types of data (possibly with +different columns set) in the same table. + +## The Solution + +Fortunately, there is a solution to these problems: use a +separate table for every type you would otherwise store in the same table. Using +a separate table allows you to use everything a database may provide to ensure +consistency and query data efficiently, without any additional application logic +being necessary. + +Let's say you have a `members` table storing both approved and pending members, +for both projects and groups, and the pending state is determined by the column +`requested_at` being set or not. Schema wise such a setup can lead to various +columns only being set for certain rows, wasting space. It's also possible that +certain indexes are only set for certain rows, again wasting space. Finally, +querying such a table requires less than ideal queries. For example: + +```sql +SELECT * +FROM members +WHERE requested_at IS NULL +AND source_type = 'GroupMember' +AND source_id = 4 +``` + +Instead such a table should be broken up into separate tables. For example, you +may end up with 4 tables in this case: + +- project_members +- group_members +- pending_project_members +- pending_group_members + +This makes querying data trivial. For example, to get the members of a group +you'd run: + +```sql +SELECT * +FROM group_members +WHERE group_id = 4 +``` + +To get all the pending members of a group in turn you'd run: + +```sql +SELECT * +FROM pending_group_members +WHERE group_id = 4 +``` + +If you want to get both you can use a `UNION`, though you need to be explicit +about what columns you want to `SELECT` as otherwise the result set uses the +columns of the first query. For example: + +```sql +SELECT id, 'Group' AS target_type, group_id AS target_id +FROM group_members + +UNION ALL + +SELECT id, 'Project' AS target_type, project_id AS target_id +FROM project_members +``` + +The above example is perhaps a bit silly, but it shows that there's nothing +stopping you from merging the data together and presenting it on the same page. +Selecting columns explicitly can also speed up queries as the database has to do +less work to get the data (compared to selecting all columns, even ones you're +not using). + +Our schema also becomes easier. No longer do we need to both store and index the +`source_type` column, we can define foreign keys easily, and we don't need to +filter rows using the `IS NULL` condition. + +To summarize: using separate tables allows us to use foreign keys effectively, +create indexes only where necessary, conserve space, query data more +efficiently, and scale these tables more easily (for example, by storing them on +separate disks). A nice side effect of this is that code can also become easier, +as a single model isn't responsible for handling different kinds of +data. diff --git a/doc/development/database/swapping_tables.md b/doc/development/database/swapping_tables.md new file mode 100644 index 0000000000000000000000000000000000000000..efb481ccf35c65eba4b707dfce8ed1321d3825f4 --- /dev/null +++ b/doc/development/database/swapping_tables.md @@ -0,0 +1,51 @@ +--- +stage: Data Stores +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# Swapping Tables + +Sometimes you need to replace one table with another. For example, when +migrating data in a very large table it's often better to create a copy of the +table and insert & migrate the data into this new table in the background. + +Let's say you want to swap the table `events` with `events_for_migration`. In +this case you need to follow 3 steps: + +1. Rename `events` to `events_temporary` +1. Rename `events_for_migration` to `events` +1. Rename `events_temporary` to `events_for_migration` + +Rails allows you to do this using the `rename_table` method: + +```ruby +rename_table :events, :events_temporary +rename_table :events_for_migration, :events +rename_table :events_temporary, :events_for_migration +``` + +This does not require any downtime as long as the 3 `rename_table` calls are +executed in the _same_ database transaction. Rails by default uses database +transactions for migrations, but if it doesn't you need to start one +manually: + +```ruby +Event.transaction do + rename_table :events, :events_temporary + rename_table :events_for_migration, :events + rename_table :events_temporary, :events_for_migration +end +``` + +Once swapped you _have to_ reset the primary key of the new table. For +PostgreSQL you can use the `reset_pk_sequence!` method like so: + +```ruby +reset_pk_sequence!('events') +``` + +Failure to reset the primary keys results in newly created rows starting +with an ID value of 1. Depending on the existing data this can then lead to +duplicate key constraints from popping up, preventing users from creating new +data. diff --git a/doc/development/polymorphic_associations.md b/doc/development/polymorphic_associations.md index bbeaab40a9088462d7adfda11448a6716489d471..6b9158b8408c780780b57aea14f69263e700af2b 100644 --- a/doc/development/polymorphic_associations.md +++ b/doc/development/polymorphic_associations.md @@ -1,152 +1,11 @@ --- -stage: none -group: unassigned -info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +redirect_to: 'database/polymorphic_associations.md' +remove_date: '2022-11-04' --- -# Polymorphic Associations +This document was moved to [another location](database/polymorphic_associations.md). -**Summary:** always use separate tables instead of polymorphic associations. - -Rails makes it possible to define so called "polymorphic associations". This -usually works by adding two columns to a table: a target type column, and a -target ID. For example, at the time of writing we have such a setup for -`members` with the following columns: - -- `source_type`: a string defining the model to use, can be either `Project` or - `Namespace`. -- `source_id`: the ID of the row to retrieve based on `source_type`. For - example, when `source_type` is `Project` then `source_id` contains a - project ID. - -While such a setup may appear to be useful, it comes with many drawbacks; enough -that you should avoid this at all costs. - -## Space Wasted - -Because this setup relies on string values to determine the model to use, it -wastes a lot of space. For example, for `Project` and `Namespace` the -maximum size is 9 bytes, plus 1 extra byte for every string when using -PostgreSQL. While this may only be 10 bytes per row, given enough tables and -rows using such a setup we can end up wasting quite a bit of disk space and -memory (for any indexes). - -## Indexes - -Because our associations are broken up into two columns this may result in -requiring composite indexes for queries to be performed efficiently. While -composite indexes are not wrong at all, they can be tricky to set up as the -ordering of columns in these indexes is important to ensure optimal performance. - -## Consistency - -One really big problem with polymorphic associations is being unable to enforce -data consistency on the database level using foreign keys. For consistency to be -enforced on the database level one would have to write their own foreign key -logic to support polymorphic associations. - -Enforcing consistency on the database level is absolutely crucial for -maintaining a healthy environment, and thus is another reason to avoid -polymorphic associations. - -## Query Overhead - -When using polymorphic associations you always need to filter using both -columns. For example, you may end up writing a query like this: - -```sql -SELECT * -FROM members -WHERE source_type = 'Project' -AND source_id = 13083; -``` - -Here PostgreSQL can perform the query quite efficiently if both columns are -indexed. As the query gets more complex, it may not be able to use these -indexes effectively. - -## Mixed Responsibilities - -Similar to functions and classes, a table should have a single responsibility: -storing data with a certain set of pre-defined columns. When using polymorphic -associations, you are storing different types of data (possibly with -different columns set) in the same table. - -## The Solution - -Fortunately, there is a solution to these problems: use a -separate table for every type you would otherwise store in the same table. Using -a separate table allows you to use everything a database may provide to ensure -consistency and query data efficiently, without any additional application logic -being necessary. - -Let's say you have a `members` table storing both approved and pending members, -for both projects and groups, and the pending state is determined by the column -`requested_at` being set or not. Schema wise such a setup can lead to various -columns only being set for certain rows, wasting space. It's also possible that -certain indexes are only set for certain rows, again wasting space. Finally, -querying such a table requires less than ideal queries. For example: - -```sql -SELECT * -FROM members -WHERE requested_at IS NULL -AND source_type = 'GroupMember' -AND source_id = 4 -``` - -Instead such a table should be broken up into separate tables. For example, you -may end up with 4 tables in this case: - -- project_members -- group_members -- pending_project_members -- pending_group_members - -This makes querying data trivial. For example, to get the members of a group -you'd run: - -```sql -SELECT * -FROM group_members -WHERE group_id = 4 -``` - -To get all the pending members of a group in turn you'd run: - -```sql -SELECT * -FROM pending_group_members -WHERE group_id = 4 -``` - -If you want to get both you can use a `UNION`, though you need to be explicit -about what columns you want to `SELECT` as otherwise the result set uses the -columns of the first query. For example: - -```sql -SELECT id, 'Group' AS target_type, group_id AS target_id -FROM group_members - -UNION ALL - -SELECT id, 'Project' AS target_type, project_id AS target_id -FROM project_members -``` - -The above example is perhaps a bit silly, but it shows that there's nothing -stopping you from merging the data together and presenting it on the same page. -Selecting columns explicitly can also speed up queries as the database has to do -less work to get the data (compared to selecting all columns, even ones you're -not using). - -Our schema also becomes easier. No longer do we need to both store and index the -`source_type` column, we can define foreign keys easily, and we don't need to -filter rows using the `IS NULL` condition. - -To summarize: using separate tables allows us to use foreign keys effectively, -create indexes only where necessary, conserve space, query data more -efficiently, and scale these tables more easily (for example, by storing them on -separate disks). A nice side effect of this is that code can also become easier, -as a single model isn't responsible for handling different kinds of -data. +<!-- This redirect file can be deleted after <2022-11-04>. --> +<!-- Redirects that point to other docs in the same project expire in three months. --> +<!-- Redirects that point to docs in a different project or site (for example, link is not relative and starts with `https:`) expire in one year. --> +<!-- Before deletion, see: https://docs.gitlab.com/ee/development/documentation/redirects.html --> diff --git a/doc/development/swapping_tables.md b/doc/development/swapping_tables.md index efb481ccf35c65eba4b707dfce8ed1321d3825f4..eaa6568dc367986108f840dfe1d3bed430bedc22 100644 --- a/doc/development/swapping_tables.md +++ b/doc/development/swapping_tables.md @@ -1,51 +1,11 @@ --- -stage: Data Stores -group: Database -info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +redirect_to: 'database/swapping_tables.md' +remove_date: '2022-11-04' --- -# Swapping Tables +This document was moved to [another location](database/swapping_tables.md). -Sometimes you need to replace one table with another. For example, when -migrating data in a very large table it's often better to create a copy of the -table and insert & migrate the data into this new table in the background. - -Let's say you want to swap the table `events` with `events_for_migration`. In -this case you need to follow 3 steps: - -1. Rename `events` to `events_temporary` -1. Rename `events_for_migration` to `events` -1. Rename `events_temporary` to `events_for_migration` - -Rails allows you to do this using the `rename_table` method: - -```ruby -rename_table :events, :events_temporary -rename_table :events_for_migration, :events -rename_table :events_temporary, :events_for_migration -``` - -This does not require any downtime as long as the 3 `rename_table` calls are -executed in the _same_ database transaction. Rails by default uses database -transactions for migrations, but if it doesn't you need to start one -manually: - -```ruby -Event.transaction do - rename_table :events, :events_temporary - rename_table :events_for_migration, :events - rename_table :events_temporary, :events_for_migration -end -``` - -Once swapped you _have to_ reset the primary key of the new table. For -PostgreSQL you can use the `reset_pk_sequence!` method like so: - -```ruby -reset_pk_sequence!('events') -``` - -Failure to reset the primary keys results in newly created rows starting -with an ID value of 1. Depending on the existing data this can then lead to -duplicate key constraints from popping up, preventing users from creating new -data. +<!-- This redirect file can be deleted after <2022-11-04>. --> +<!-- Redirects that point to other docs in the same project expire in three months. --> +<!-- Redirects that point to docs in a different project or site (for example, link is not relative and starts with `https:`) expire in one year. --> +<!-- Before deletion, see: https://docs.gitlab.com/ee/development/documentation/redirects.html -->