[Organize Code] Add migrations in JH repo

changed the description

By Qian Zhang on 2021-07-27T13:40:36 (imported from GitLab)

assigned to @icbd

By Qian Zhang on 2021-07-27T13:43:53 (imported from GitLab)

@kwiebers Can we have an SME to discuss the issue and the proposals that @icbd mentioned in the description? Thanks!

By Qian Zhang on 2021-07-27T13:48:58 (imported from GitLab)

@cdu1 - This is a follow-up for database migration request at https://gitlab.com/gitlab-jh/gitlab/-/issues/114#note_630929799. Can you help identify a SME to consult on the migration process for JiHu?

/cc @jeromezng @meks @godfat-gitlab

By Kyle Wiebers on 2021-07-27T18:05:52 (imported from GitLab)

@kwiebers Can you please invite Craig Gomes to this project? It may be necessary to invite other DB engineers, too.

By Chun Du on 2021-07-27T18:05:52 (imported from GitLab)

@cdu1 - This is now a public project so he should be able to see this issue.

@jeromezng - Can you help give @craig-gomes access to this project?

By Kyle Wiebers on 2021-07-27T18:17:10 (imported from GitLab)

Happy to find help, if it makes sense I'd add the @gitlab-org/maintainers/database group instead of just the database group.

By Craig Gomes on 2021-07-27T20:24:33 (imported from GitLab)

Access has been granted to @gitlab-org/maintainers/database

By Jerome Ng on 2021-08-04T01:02:33 (imported from GitLab)

To make it clear, the original context was https://gitlab.com/gitlab-jh/gitlab/-/issues/114#note_627059593

What we need to decide would be:

Have JH only migrations, or:
Share schema between EE and JH

By Lin Jen-Shin on 2021-08-04T14:10:00 (imported from GitLab)

Hmmm.. I still do not have access to https://gitlab.com/gitlab-jh/gitlab/-/issues/114#note_627059593. I get a 404 error. @pbair can you see the link above?

By Craig Gomes on 2021-08-04T14:27:59 (imported from GitLab)

@craig-gomes Yes, I can see it.

By Patrick Bair on 2021-08-04T14:41:58 (imported from GitLab)

Weird. I can see plenty of other issues in that project, but for some reason I cannot see #114

By Craig Gomes on 2021-08-04T15:38:56 (imported from GitLab)

It's a confidential issue. Are you a member of @gitlab-org/maintainers/database @craig-gomes ?

Edit: You don't seem a DB maintainer @craig-gomes.

By Chun Du on 2021-08-04T15:48:26 (imported from GitLab)

I just made @craig-gomes a developer in this project because I think it should be fine?

(PS @gitlab-org/maintainers/database is developer in the group gitlab-jh rather than this project)

By Lin Jen-Shin on 2021-08-04T15:56:52 (imported from GitLab)

That helped, now I can see it. Thank you @godfat-gitlab

By Craig Gomes on 2021-08-04T16:23:29 (imported from GitLab)

Now that I have access to see the discussion, when do you need feedback? We're pretty short-staffed in the database group right now.

By Craig Gomes on 2021-08-04T17:45:20 (imported from GitLab)

@craig-gomes Can you please comment on the two possible plans that @icbd mentioned in the description? Basically we need a long term solution for JH team to add DB migration files and initializers in JH repo. Thanks!

By Qian Zhang on 2021-08-11T07:52:18 (imported from GitLab)

Initializers should be a separate topic, unrelated to database. Can we make another issue for that? And I don't think there's an issue having jh/config/initializers. We just need to implement it.

By Lin Jen-Shin on 2021-08-11T08:57:00 (imported from GitLab)

The decision around migration files is secondary to the larger decision of how manage a separate schema from mainline GitLab.

In the linked comment: https://gitlab.com/gitlab-jh/gitlab/-/issues/114#note_627059593, it sounds like in this example the decision was to implement the functionality in upstream GitLab instead of the JH codebase. Is this not possible to do with other desired changes as well?

I don't have the historical context of problems encountered when GitLab had separate schemas between CE and EE, but taking a similar split here seems like it would be difficult to manage long term.

By Patrick Bair on 2021-08-11T14:39:13 (imported from GitLab)

And I don't think there's an issue having jh/config/initializers. We just need to implement it.

@godfat-gitlab Could you please elaborate a bit more? What do we need to implement?

In the linked comment: https://gitlab.com/gitlab-jh/gitlab/-/issues/114#note_627059593, it sounds like in this example the decision was to implement the functionality in upstream GitLab instead of the JH codebase. Is this not possible to do with other desired changes as well?

@pbair Yeah, for Zentao integration we have decided to directly implement in upstream repo since we all think it should be a CE & EE feature, but in future there will be some JH only features which may also need to introduce new migration files, that's why we have this issue to determine the long term solution for JH migration files.

By Qian Zhang on 2021-08-12T13:50:05 (imported from GitLab)

@qianzhangxa I didn't check where to change yet, but Rails is loading files inside config/initializers and there must be a way to make it also load files inside jh/config/initializers. We just need to make that happen.

By Lin Jen-Shin on 2021-08-12T16:05:07 (imported from GitLab)

I created an issue for initializers: https://gitlab.com/gitlab-jh/gitlab/-/issues/185

By Lin Jen-Shin on 2021-08-20T03:43:54 (imported from GitLab)

Adding another merge request which requires JH migrations: https://gitlab.com/gitlab-jh/gitlab/-/merge_requests/67

I am unsure if we can put this to upstream this because this requires integrating with Tencent SMS service.

@jeromezng @kwiebers What do you think?

By Lin Jen-Shin on 2021-08-12T16:09:16 (imported from GitLab)

Thanks for raising this and discussing it during our 1:1. Please reply to the thread above with the challenges we see with split migrations, the path forward and the support we need from the Database team.

By Kyle Wiebers on 2021-08-12T14:43:06 (imported from GitLab)

I think I'll use this thread because this is a new evidence :P

Given that it's unlikely that we can put https://gitlab.com/gitlab-jh/gitlab/-/merge_requests/67 to upstream because it requires integrating with Tencent SMS service and send user phone numbers to Tencent, I think we probably just have to accept that JH will need to have a different database schema.

I don't have the historical context of problems encountered when GitLab had separate schemas between CE and EE, but taking a similar split here seems like it would be difficult to manage long term.

@pbair As described in https://gitlab.com/gitlab-jh/gitlab/-/issues/114#note_627560888, I think the most difficult part is making downgrading and upgrading between editions work. This is especially difficult when we're moving features between EE and CE.

However given that downgrading and upgrading between JH and EE is not a hard requirement as mentioned in https://gitlab.com/gitlab-jh/gitlab/-/issues/114#note_630929799, I think we can at least ignore that for now.

The other problems, the more notable one might be conflicting in db/schema.rb (back then it's that way). I think the better approach here could be introducing jh/db/structure.sql and when it's running under JH, it'll just ignore db/structure.sql and use jh/db/structure.sql as the source of truth.

All CI jobs related to database schema could do the same, that it'll just look at the counterpart in jh/ and ignore the original one.

Do you think this could be feasible moving forward?

By Lin Jen-Shin on 2021-08-12T17:25:09 (imported from GitLab)

@pbair Is there any updates you can share?

By Lin Jen-Shin on 2021-08-20T08:59:19 (imported from GitLab)

@godfat-gitlab @icbd I also agree with the idea to keep a separate jh/db/structure.sql to avoid conflicts with the upstream.

I think it would be useful to keep JH migrations in their own jh/db/migrate to separate them from EE. This makes an initial conversion from EE -> JH easier, as it can be done by only running the JH migrations on top of an existing EE database.

If upgrading and downgrading between JH and EE is not a requirement, that makes this general problem much simpler. But I think it can be managed if necessary, in the same way EE is currently managed, as long as JH remains mostly a superset of EE.

We can think of JH migrations adding columns, tables, etc. to the schema so new models and functionality can be "patched" on top of the existing EE code. In the event of a downgrade, the "patched" code is removed while maintaining the same schema, and for the most part the additive changes should be easily ignored.

By Patrick Bair on 2021-08-25T12:00:59 (imported from GitLab)

I think it would be useful to keep JH migrations in their own jh/db/migrate to separate them from EE. This makes an initial conversion from EE -> JH easier, as it can be done by only running the JH migrations on top of an existing EE database.

@pbair This does not always work, because when an instance upgrades from EE to JH it might be long after the original migrations were introduced. Consider this scenario:

EE added "books" table in version X
JH added "isbn" column into "books" in version X
1 month later. EE renamed table "books" into "articles" in version X+1
An instance is running EE version X+1 so it has "articles" rather than "books"
Now this instance wants to run JH migrations, but it cannot add "isbn" column into "books" because it does not exist

Here to maintain the compatibility, when "books" is renamed to "articles", JH will need to update the migrations so if "books" exists, then add to that table, otherwise if "articles" exists, then add to that table instead. We need to maintain both paths because an instance could be upgrading from a newer EE or an older JH.

It's not impossible to maintain this, but it's going to be more and more tricky moving forward, and we'll need to define the upgrading/downgrading boundaries very clearly, and explicitly test against those scenarios.

I suppose we have only 2 choices.

If we want to allow upgrade/downgrade between EE and JH, we have to maintain a single schema, meaning that JH migrations will need to be included in EE.
If we're ok not officially supporting upgrade/downgrade between EE and JH, then we can just go with jh/db/structure.sql and jh/db/migrate.

Does this make sense to you? If so, I believe we can start implementing jh/db/structure.sql and jh/db/migrate and fix tests along the way.

By Lin Jen-Shin on 2021-08-25T12:00:50 (imported from GitLab)

@godfat-gitlab That's a valid scenario that would be difficult to manage. Using individual migrations would probably not be the best long-term solution to manage upgrades/downgrades for the reasons you mentioned, but it's also the only tool that exists at the moment. I agree that if upgrades and downgrades are allowed with a separate schema, it would have to be through only explicit scenarios that can be maintained and tested in a manageable way.

Even without the upgrade/downgrade there are other potential pitfalls I can think of. For example, there are sometimes lengthy data migrations which use triggers to sync data within the same table or multiple tables over many releases. If a concurrent migration modifies one of the tables in an incompatible way, this can break the migration, but may not be immediately apparent. Most migration methods are not aware of these long-running migrations, and we don't have an automated way to control this, but instead rely on manual processes.

This is a good opportunity to consider adding tooling to prevent these scenarios, which is also desired in the EE codebase but not previously prioritized.

Does this make sense to you? If so, I believe we can start implementing jh/db/structure.sql and jh/db/migrate and fix tests along the way.

Yes, this does. Having separate structure and migrations makes sense to me as the simplest solution to begin work now.

By Patrick Bair on 2021-08-23T15:54:57 (imported from GitLab)

For example, there are sometimes lengthy data migrations which use triggers to sync data within the same table or multiple tables over many releases.

Maybe we need a way to lock a specific table in terms of schema during the period. Perhaps a CI job which detects if the table does not change? When we want to lock the table, add it to the list. When we consider the migration done, we remove it from the list.

However, I am not very sure how to consider done, given that there might have other instances running in a different states? I suppose the boundaries are versions, and we need to make that clear.

Yes, this does. Having separate structure and migrations makes sense to me as the simplest solution to begin work now.

@icbd Since this is fully under jh/, do you think you would be able to do it? Do you need help to implement jh/db/migrate and jh/db/structure.sql?

By Lin Jen-Shin on 2021-08-23T16:11:42 (imported from GitLab)

Maybe we need a way to lock a specific table in terms of schema during the period. Perhaps a CI job which detects if the table does not change? When we want to lock the table, add it to the list. When we consider the migration done, we remove it from the list.

This topic has gotten additional attention recently since we had two migrations on EE which conflicted with one another. I created an issue to track and restart the conversation: gitlab-org/gitlab#339276

By Patrick Bair on 2021-08-25T09:50:52 (imported from GitLab)

If we choose the plan of:

implement jh/db/migrate and jh/db/structure.sql

I will submit an MR later to follow this plan.

Maybe we can discuss in more detail based on the code, thanks for your suggestion! @pbair

The example about Book and ISBN is wonderful, thank you @godfat-gitlab

Update

Draft MR: https://gitlab.com/gitlab-jh/gitlab/-/merge_requests/84

By Baodong on 2021-08-26T05:41:32 (imported from GitLab)

mentioned in merge request !67 (closed)

By Lin Jen-Shin on 2021-08-12T17:39:57 (imported from GitLab)

mentioned in issue #185 (closed)

By Lin Jen-Shin on 2021-08-13T15:31:06 (imported from GitLab)

added to epic &5275

By Lin Jen-Shin on 2021-08-13T15:40:42 (imported from GitLab)

changed title from [Organize Code] Add migrations and initializers in JH repo to [Organize Code] Add migrations in JH repo

By Lin Jen-Shin on 2021-08-13T15:43:45 (imported from GitLab)

mentioned in epic &5275

By Lin Jen-Shin on 2021-08-13T16:10:18 (imported from GitLab)

I think the better approach here could be introducing jh/db/structure.sql and when it's running under JH, it'll just ignore db/structure.sql and use jh/db/structure.sql as the source of truth.

@godfat-gitlab Agree with you!

By Baodong on 2021-08-20T03:49:53 (imported from GitLab)

changed time estimate to 8h

By Qian Zhang on 2021-08-20T07:31:52 (imported from GitLab)

added workflowin dev label

mentioned in merge request !84 (closed)

By Baodong on 2021-08-26T05:29:48 (imported from GitLab)

MR: https://gitlab.com/gitlab-jh/gitlab/-/merge_requests/84

Upgrade from CE to JH ️

Mock Case

To simplify the discussion, table and version are both mocked, such as:

CE/EE	JH	version
01.rb Create `users` table		1.0
02.rb Add `name` to `users`		1.1
	03.rb Add `phone` to `users`	1.2 (JH first release)
04.rb Rename `users` to `accounts`		1.3
06.rb Create `orders` table		1.4

Issue: A user want to upgrade from CE 1.4 to JH 1.4 .

The user needs to execute all the migration files he has not run, which is 03.rb .

This user now has no users table, the migration of 03.rb will fail.

Plan one

We add a constraint to JH repo, so that code of JH can only create new tables, or update the JH only tables.

CE/EE	JH	version
01.rb Create `users` table		1.0
02.rb Add `name` to `users`		1.1
	03.rb Create `user_phones` table	1.2 (JH first release)
04.rb Rename `users` to `accounts`		1.3
	05.rb Update `user_phones` foreign key	1.3
06.rb Create `orders` table		1.4

The user needs to execute 03.rb and 05.rb when upgrading, without failures.

Although this plan can solve the upgrading problem, obviously, this will add lots of inconvenience of the following work.

Plan two

Just like EE, the JH specific migration files are also added to db/migrate of Upstream repository.

CE/EE	JH	version
01.rb Create `users` table		1.0
02.rb Add `name` to `users`		1.1
03.rb Add `phone` to `users`		1.2 (JH first release)
04.rb Rename `users` to `accounts`		1.3
06.rb Create `orders` table		1.4

Functionally, this plan is perfect. However, it requires us to submit the migration file to Upstream repository in advance, which will cost some communication costs.

@pbair @godfat-gitlab Which plan do you prefer? Thanks!

cc @memorycancel @qianzhangxa

By Baodong on 2021-08-26T15:19:31 (imported from GitLab)

@icbd My understanding so far is that it's unlikely that we can accept JH migrations in EE, because it seems to be off the discussions pretty quickly. But indeed I never see that it's a definite NO right now. I think this is more a policy decision than a technical decision.

On the other hand, adding JH migrations in EE will make reviewing difficult, and it's not aligning with all other mechanisms we're building right now.

So maybe the question comes down to how much we want to persuade for converting between EE and JH.

I see that if you can limit JH migrations to be only adding new tables and updating those new tables, then the compatibility can likely be kept. (Unless EE now suddenly adds a new table with the same name as an old JH table, but I think we can avoid this)

By Lin Jen-Shin on 2021-08-27T07:33:36 (imported from GitLab)

@kwiebers Do you have any comments on the Plan two at https://gitlab.com/gitlab-jh/gitlab/-/issues/161#note_661788978? I think plan two will simplify our development a lot.

By Qian Zhang on 2021-08-30T03:18:28 (imported from GitLab)

I'm going to defer to @pbair & @godfat-gitlab to engage relevant team members on this.

How often do you believe JiHu users would be upgrading from CE to JH? I want to be cautious about increasing the cognitive load in gitlab-org/gitlab if this is an infrequent circumstance and start with the original plan.

I agree that Plan 2 would simplify development but adds a source for potential confusion in gitlab-org/gitlab with migrations that are not related to ce/ee features.

By Kyle Wiebers on 2021-08-30T19:25:42 (imported from GitLab)

How often do you believe JiHu users would be upgrading from CE to JH?

I realized that the term CE can be ambiguous now because is it the CE from EE, or the CE from JH? :P What if people run JH without a license? It should work like the JH version of CE, so it may already have JH-only migrations in places, then it should have no issue upgrading/downgrading between CE (JH) <=> JH

But if it's CE (EE) <=> JH then it would have issues.

I think plan two will simplify our development a lot.

I think the software complexity will be lower, but development might not be simpler, or even much more complex, because with that we need to split the codes between 2 different repositories, and it's definitely making reviewing much harder.

It may be simpler to fully test on JH and then bring the changes to EE, but then this will require all the reviewers from Inc go get an account on .cn and review from there. Otherwise they have to review something completely out of context.

I believe there will be a lot of workflow process to work out.

By Lin Jen-Shin on 2021-09-02T00:40:59 (imported from GitLab)

Plan one has its challenges. Adding separate tables for new data can have a negative impact on performance and complexity of the code to query it. In the example above, adding a phone number for a user, this could probably be easily managed in a new table. But adding new data that is more tightly integrated into application logic would be harder to implement.

But I also agree plan two could add a lot of process overhead, and slow development on JH. Reviews of migrations can already require multiple back-and-forth rounds in EE to work within our guidelines. More importantly, from a policy standpoint this is a complex decision, and a bigger one than I can give answer to. Migrations and query changes are already a source of instability within the EE codebase, so further complicating that requires input from a broader group.

@craig-gomes knowing existing database-related changes we have, do you think the above plan two would be likely to be accepted as a solution?

By Patrick Bair on 2021-08-31T21:31:41 (imported from GitLab)

@craig-gomes - This was mentioned during the 2021-09-01 JiHu Product and Engineering sync and the JiHu team needs a decision by end of week as a few features are dependent on this.

One requirement that is critical for the JiHu strategy is to support the CE -> JH upgrade path in a seamless manner as there is large CE usage in the JiHu market. Please keep this requirement in mind when evaluating the options above and the decisions.

We may need a modified Plan Two option where the JiHu migration is in the gitlab-org/gitlab repo and the usage context is consistent and clear in the migration to handle the upgrade requirement.

Craig - Jerome and I are happy to have a sync call on this if needed. Based on the urgency for JiHu we'd like a decision from the Database team by the end of this week.

/cc @jeromezng @qianzhangxa @joshlambert

By Kyle Wiebers on 2021-09-02T00:46:57 (imported from GitLab)

@kwiebers I scheduled a meeting to catch up with @pbair and members of the DB group that could attend tomorrow. I have also added you and Jerome. Feel free to invite others as needed. I'll start my comments in the agenda doc.

By Craig Gomes on 2021-09-01T13:54:03 (imported from GitLab)

@qianzhangxa @icbd - @godfat-gitlab @jeromezng and I met with @craig-gomes and @abrandl to discuss and decided to go with Plan 2 for the current feature. We will re-evaluate as needed after starting here but see Plan 2 as the best option.

This method will also allow the JiHu team to leverage the expertise from the GitLab Inc DB maintainers on how to model the data based on high SaaS usage data patterns and ensure that the migrations may not have a negative effect on GitLab.com.

It is very important that upstream migration MRs follow https://about.gitlab.com/handbook/ceo/chief-of-staff-team/jihu-support/#guidelines-for-proprietary-and-upstream-contributions so the full implementation context is understandable at review time.

/cc @meks

By Kyle Wiebers on 2021-09-02T23:24:11 (imported from GitLab)

This doc - JiHu Merge Request Escalation Process, which is referenced in the handbook entry [JiHu Support] -(https://about.gitlab.com/handbook/ceo/chief-of-staff-team/jihu-support/#guidelines-for-proprietary-and-upstream-contributions) also helps to communicate what is expected from JiHu upstream MRs.

There was a follow up item from @jeromezng How do you ensure that the GitLab maintainer to have that context? Can structure.sql flag those items added by the JiHu team?. I've added a discussion issue here - JiHu - How to identify JiHu specific structure changes - https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/192

By Craig Gomes on 2021-09-01T22:18:30 (imported from GitLab)

@craig-gomes Eventually we'll need to be able to use both options:

Plan 1: Introduce database objects specifically only for JH
Plan 2: Send database migrations upstream from JH to GitLab codebase

A reason we also need to be able to do (1) is that we should not be adding unwarranted overhead to the GitLab database. A good example is indexes: When JiHu adds columns to existing tables and needs an index on those, this can create significant overhead for us, e.g. on GitLab.com: The index is potentially large (indexing an existing table: takes up resources to create and store over time), needs to be maintained (write overhead on the primary), but will never be used on GitLab.com (hence unwarranted). With Plan 1, those indexes can be created for JiHu specifically and won't be present in a GitLab installation (e.g. on GitLab.com).

Another benefit of having the ability to use Plan 1 is that it allows for quicker cycles for JiHu for cases where additional tables need to be created anyways. Those don't necessarily need to go through upstream code review. An example is adding the phone number, which shouldn't go into the existing users table in any case and we might as well create a JiHu-specific table to capture it. So if we had Plan 1 in place too, JiHu could have merged the change without having to send it upstream (and without us needing to coordinate at all).

Capturing this here as we discussed about this today, so it doesn't get lost. cc @pbair

By Andreas Brandl on 2021-09-01T22:42:59 (imported from GitLab)

Thanks @kwiebers @jeromezng @craig-gomes @abrandl! We will go with Plan 2, and let's keep evaluating it and see what we need to improve in future.

BTW, I do not have the access to JiHu Merge Request Escalation Process, can you guys please share it with me?

/cc @icbd @memorycancel

By Qian Zhang on 2021-09-02T23:25:07 (imported from GitLab)

@abrandl Thanks for comments. We need add a index for phone in user table since we need check the phone is available which is like the username. See

See code https://gitlab.com/gitlab-jh/gitlab/-/merge_requests/67/diffs#8b47cfffaaeee5957ed3825d7df52e05039f2bfe_0_8

    def phone_exists
      render json: { exists: ::User.find_by_phone(params[:phone]).present? }
    end

Could you please give some suggestions?@abrandl Thanks! CC @qianzhangxa

By Zhu Shung on 2021-09-02T02:35:37 (imported from GitLab)

@abrandl I found the comments here https://gitlab.com/gitlab-jh/gitlab/-/merge_requests/67#note_666796590 which is LGTM , so I will move phone to user_details

By Zhu Shung on 2021-09-02T09:58:53 (imported from GitLab)

Not related to how we organize migrations - phone can be considered personally identifiable information (PII), do we want to store it encrypted (using something like TokenAuthenticatable)?

By Krasimir Angelov on 2021-09-02T05:03:16 (imported from GitLab)

We need add a index for phone in user table since we need check the phone is available which is like the username. See

@memorycancel Ah, that's a great example for the indexing topic. Simplifying, the migration would look like below and execute also on GitLab.com:

alter table user_details add column phone_number varchar(100) null;
create unique index on user_details (phone_number);

On GitLab.com, user_details is a 500 MB table with about 3.5M records. The resulting index we created above will be 73 MB - even though all records will have NULL for their phone_number.

This is part of the overhead that I meant above. We have to create, replicate, store and maintain this index on GitLab.com, even though we absolutely don't need it. For larger tables, this will be much more significant in terms of size (we have btree indexes with hundreds of GBs in size). This is not a one-time cost, but we add write overhead since this index needs to be maintained (updated/vacuumed, etc.) continuously.

I can see two ways of working around this:

Introduce this index specifically only for JiHu ("Plan 1" above)
Turn the index into a partial index

Following (2), we would create the index only for non-null values and end up with an empty index:

create unique index phone_partial on user_details (phone_number) where phone_number is not null;

I find this rather a hacky approach, compared to introducing it only for JiHu.

It also shows another problem: Attributes added to existing tables have to be optional, we can't specify NOT NULL without a default value.

This is because GitLab's code won't maintain the column at all. If we were to introduce a separate table for this, e.g. user_details_jihu (user_id integer.., phone_number varchar not null) - we regain the freedom to add NOT NULL constraints.

By Andreas Brandl on 2021-09-02T17:28:54 (imported from GitLab)

Thanks @abrandl, so what would you suggest? Will we have to go back to Plan 1?

/cc @kwiebers

By Qian Zhang on 2021-09-02T13:31:08 (imported from GitLab)

@abrandl I have refactor the codes that move phone to user_details, could you please review the codes directly? see the migrations and structure.sql in https://gitlab.com/gitlab-jh/gitlab/-/merge_requests/67/diffs , thanks!

By Zhu Shung on 2021-09-02T17:28:44 (imported from GitLab)

so what would you suggest? Will we have to go back to Plan 1?

@qianzhangxa For the in-flight change (user/phone number etc.) it's been decided to go with Plan 2 as far as I understand, and we can adapt and reconcile later, too. Other than that I defer to @craig-gomes to steer the discussion (DRI) further.

That said, I think it's clear that we want to be able to use "Plan 2" - as upstreaming changes is always an option.

What we need to figure out is if we want to pay the overhead for always having to upstream database changes. This overhead is:

For JiHu: Not being able to ship database migrations on their own limits flexibility and adds communication overhead for any database-related change (review times can be significantly reducing cycle time).
For GitLab: Having to accept operational overhead for database objects (mostly from indexes) we absolutely don't need. This incurs cost and reduces vertical scalability for GitLab.com.
For GitLab: Reviewing all database changes for JiHu increases the cognitive load on maintainers even more (missing context and I imagine we won't have comparable insight into operational SaaS aspects for GitLab.cn, as we do for GitLab.com - which drives a lot of aspects for code review).
For GitLab: Confusion around unused database objects/columns - https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/192 may be a remedy

In my mind, I would expect we benefit from being able to use "Plan 1", i.e. shipping database changes specifically only for JiHu, in cases where this makes more sense.

By Andreas Brandl on 2021-09-03T03:59:40 (imported from GitLab)

On GitLab.com, user_details is a 500 MB table with about 3.5M records. The resulting index we created above will be 73 MB - even though all records will have NULL for their phone_number.

@abrandl Thanks for providing details. I am not clear about why the index of phone with null will be 73MB. Is this estimated data or experimental data? since from my view, every index is generated like a b-tree, if the data of the field is null, so the index of b-tree is also null and this will not increased the disk space of gitlab.com due to index of phone, please correct me if I was wrong.

And I flip though the history of db/migrations/*.rb, and found that there is precedent https://gitlab.com/gitlab-org/gitlab/-/blob/master/db/migrate/20210303091651_add_index_to_environments_tier.rb#L13

I copy and change to add index of phone in user_details See https://gitlab.com/gitlab-jh/gitlab/-/merge_requests/67/diffs#e8fbae3423e9b3e5cba50a0b06b12d0411cdcfc3_0_13

  def up
    add_concurrent_index :user_details, [:phone], where: 'phone IS NOT NULL', name: 'index_user_details_on_phone'
  end

Is this feasible?

By Zhu Shung on 2021-09-03T08:31:01 (imported from GitLab)

I am not clear about why the index of phone with null will be 73MB.

@memorycancel Even though all the values in the btree are null, the index still needs to maintain pointers to the heap for all records (including for null values). This is what generates the overhead and explains why the index size is not zero. Since all records in the table are being indexed, these pointers also will need to be updated along with heap updates (hence the write overhead).

The numbers I provided are based on the actual, current production data for GitLab.com.

And I flip though the history of db/migrations/*.rb, and found that there is precedent

Yes, that's a good example and not uncommon. We have a few indexes excluding null values because of this.

I copy and change to add index of phone in user_details

Yes this is what I meant (strategy discussion aside).

By Andreas Brandl on 2021-09-03T08:39:03 (imported from GitLab)

mentioned in issue #199 (closed)

By Baodong on 2021-08-27T06:20:48 (imported from GitLab)

marked this issue as related to #199 (closed)

By Baodong on 2021-08-27T06:23:09 (imported from GitLab)

[Organize Code] Add migrations in JH repo

Plan A

advantage

disadvantage

Plan B