@craig-gomes Can you please comment on the two possible plans that @icbd mentioned in the description? Basically we need a long term solution for JH team to add DB migration files and initializers in JH repo. Thanks!
By Qian Zhang on 2021-08-11T07:52:18 (imported from GitLab)
Initializers should be a separate topic, unrelated to database. Can we make another issue for that? And I don't think there's an issue having jh/config/initializers. We just need to implement it.
By Lin Jen-Shin on 2021-08-11T08:57:00 (imported from GitLab)
The decision around migration files is secondary to the larger decision of how manage a separate schema from mainline GitLab.
In the linked comment: https://gitlab.com/gitlab-jh/gitlab/-/issues/114#note_627059593, it sounds like in this example the decision was to implement the functionality in upstream GitLab instead of the JH codebase. Is this not possible to do with other desired changes as well?
I don't have the historical context of problems encountered when GitLab had separate schemas between CE and EE, but taking a similar split here seems like it would be difficult to manage long term.
By Patrick Bair on 2021-08-11T14:39:13 (imported from GitLab)
And I don't think there's an issue having jh/config/initializers. We just need to implement it.
@godfat-gitlab Could you please elaborate a bit more? What do we need to implement?
In the linked comment: https://gitlab.com/gitlab-jh/gitlab/-/issues/114#note_627059593, it sounds like in this example the decision was to implement the functionality in upstream GitLab instead of the JH codebase. Is this not possible to do with other desired changes as well?
@pbair Yeah, for Zentao integration we have decided to directly implement in upstream repo since we all think it should be a CE & EE feature, but in future there will be some JH only features which may also need to introduce new migration files, that's why we have this issue to determine the long term solution for JH migration files.
By Qian Zhang on 2021-08-12T13:50:05 (imported from GitLab)
@qianzhangxa I didn't check where to change yet, but Rails is loading files inside config/initializers and there must be a way to make it also load files inside jh/config/initializers. We just need to make that happen.
By Lin Jen-Shin on 2021-08-12T16:05:07 (imported from GitLab)
Thanks for raising this and discussing it during our 1:1. Please reply to the thread above with the challenges we see with split migrations, the path forward and the support we need from the Database team.
By Kyle Wiebers on 2021-08-12T14:43:06 (imported from GitLab)
I think I'll use this thread because this is a new evidence :P
Given that it's unlikely that we can put https://gitlab.com/gitlab-jh/gitlab/-/merge_requests/67 to upstream because it requires integrating with Tencent SMS service and send user phone numbers to Tencent, I think we probably just have to accept that JH will need to have a different database schema.
I don't have the historical context of problems encountered when GitLab had separate schemas between CE and EE, but taking a similar split here seems like it would be difficult to manage long term.
@pbair As described in https://gitlab.com/gitlab-jh/gitlab/-/issues/114#note_627560888, I think the most difficult part is making downgrading and upgrading between editions work. This is especially difficult when we're moving features between EE and CE.
The other problems, the more notable one might be conflicting in db/schema.rb (back then it's that way). I think the better approach here could be introducing jh/db/structure.sql and when it's running under JH, it'll just ignore db/structure.sql and use jh/db/structure.sql as the source of truth.
All CI jobs related to database schema could do the same, that it'll just look at the counterpart in jh/ and ignore the original one.
Do you think this could be feasible moving forward?
By Lin Jen-Shin on 2021-08-12T17:25:09 (imported from GitLab)
@godfat-gitlab@icbd I also agree with the idea to keep a separate jh/db/structure.sql to avoid conflicts with the upstream.
I think it would be useful to keep JH migrations in their own jh/db/migrate to separate them from EE. This makes an initial conversion from EE -> JH easier, as it can be done by only running the JH migrations on top of an existing EE database.
If upgrading and downgrading between JH and EE is not a requirement, that makes this general problem much simpler. But I think it can be managed if necessary, in the same way EE is currently managed, as long as JH remains mostly a superset of EE.
We can think of JH migrations adding columns, tables, etc. to the schema so new models and functionality can be "patched" on top of the existing EE code. In the event of a downgrade, the "patched" code is removed while maintaining the same schema, and for the most part the additive changes should be easily ignored.
By Patrick Bair on 2021-08-25T12:00:59 (imported from GitLab)
I think it would be useful to keep JH migrations in their own jh/db/migrate to separate them from EE. This makes an initial conversion from EE -> JH easier, as it can be done by only running the JH migrations on top of an existing EE database.
@pbair This does not always work, because when an instance upgrades from EE to JH it might be long after the original migrations were introduced. Consider this scenario:
EE added "books" table in version X
JH added "isbn" column into "books" in version X
1 month later. EE renamed table "books" into "articles" in version X+1
An instance is running EE version X+1 so it has "articles" rather than "books"
Now this instance wants to run JH migrations, but it cannot add "isbn" column into "books" because it does not exist
Here to maintain the compatibility, when "books" is renamed to "articles", JH will need to update the migrations so if "books" exists, then add to that table, otherwise if "articles" exists, then add to that table instead. We need to maintain both paths because an instance could be upgrading from a newer EE or an older JH.
It's not impossible to maintain this, but it's going to be more and more tricky moving forward, and we'll need to define the upgrading/downgrading boundaries very clearly, and explicitly test against those scenarios.
I suppose we have only 2 choices.
If we want to allow upgrade/downgrade between EE and JH, we have to maintain a single schema, meaning that JH migrations will need to be included in EE.
If we're ok not officially supporting upgrade/downgrade between EE and JH, then we can just go with jh/db/structure.sql and jh/db/migrate.
Does this make sense to you? If so, I believe we can start implementing jh/db/structure.sql and jh/db/migrate and fix tests along the way.
By Lin Jen-Shin on 2021-08-25T12:00:50 (imported from GitLab)
@godfat-gitlab That's a valid scenario that would be difficult to manage. Using individual migrations would probably not be the best long-term solution to manage upgrades/downgrades for the reasons you mentioned, but it's also the only tool that exists at the moment. I agree that if upgrades and downgrades are allowed with a separate schema, it would have to be through only explicit scenarios that can be maintained and tested in a manageable way.
Even without the upgrade/downgrade there are other potential pitfalls I can think of. For example, there are sometimes lengthy data migrations which use triggers to sync data within the same table or multiple tables over many releases. If a concurrent migration modifies one of the tables in an incompatible way, this can break the migration, but may not be immediately apparent. Most migration methods are not aware of these long-running migrations, and we don't have an automated way to control this, but instead rely on manual processes.
This is a good opportunity to consider adding tooling to prevent these scenarios, which is also desired in the EE codebase but not previously prioritized.
Does this make sense to you? If so, I believe we can start implementing jh/db/structure.sql and jh/db/migrate and fix tests along the way.
Yes, this does. Having separate structure and migrations makes sense to me as the simplest solution to begin work now.
By Patrick Bair on 2021-08-23T15:54:57 (imported from GitLab)
For example, there are sometimes lengthy data migrations which use triggers to sync data within the same table or multiple tables over many releases.
Maybe we need a way to lock a specific table in terms of schema during the period. Perhaps a CI job which detects if the table does not change? When we want to lock the table, add it to the list. When we consider the migration done, we remove it from the list.
However, I am not very sure how to consider done, given that there might have other instances running in a different states? I suppose the boundaries are versions, and we need to make that clear.
Yes, this does. Having separate structure and migrations makes sense to me as the simplest solution to begin work now.
@icbd Since this is fully under jh/, do you think you would be able to do it? Do you need help to implement jh/db/migrate and jh/db/structure.sql?
By Lin Jen-Shin on 2021-08-23T16:11:42 (imported from GitLab)
Maybe we need a way to lock a specific table in terms of schema during the period. Perhaps a CI job which detects if the table does not change? When we want to lock the table, add it to the list. When we consider the migration done, we remove it from the list.
This topic has gotten additional attention recently since we had two migrations on EE which conflicted with one another. I created an issue to track and restart the conversation: gitlab-org/gitlab#339276
By Patrick Bair on 2021-08-25T09:50:52 (imported from GitLab)
I think the better approach here could be introducing jh/db/structure.sql and when it's running under JH, it'll just ignore db/structure.sql and use jh/db/structure.sql as the source of truth.
To simplify the discussion, table and version are both mocked, such as:
CE/EE
JH
version
01.rb Create users table
1.0
02.rb Add name to users
1.1
03.rb Add phone to users
1.2 (JH first release)
04.rb Rename users to accounts
1.3
06.rb Create orders table
1.4
Issue: A user want to upgrade from CE 1.4 to JH 1.4 .
The user needs to execute all the migration files he has not run, which is 03.rb .
This user now has no users table, the migration of 03.rb will fail.
Plan one
We add a constraint to JH repo, so that code of JH can only create new tables, or update the JH only tables.
CE/EE
JH
version
01.rb Create users table
1.0
02.rb Add name to users
1.1
03.rb Create user_phones table
1.2 (JH first release)
04.rb Rename users to accounts
1.3
05.rb Update user_phones foreign key
1.3
06.rb Create orders table
1.4
The user needs to execute 03.rb and 05.rb when upgrading, without failures.
Although this plan can solve the upgrading problem, obviously, this will add lots of inconvenience of the following work.
Plan two
Just like EE, the JH specific migration files are also added to db/migrate of Upstream repository.
CE/EE
JH
version
01.rb Create users table
1.0
02.rb Add name to users
1.1
03.rb Add phone to users
1.2 (JH first release)
04.rb Rename users to accounts
1.3
06.rb Create orders table
1.4
Functionally, this plan is perfect. However, it requires us to submit the migration file to Upstream repository in advance, which will cost some communication costs.
@icbd My understanding so far is that it's unlikely that we can accept JH migrations in EE, because it seems to be off the discussions pretty quickly. But indeed I never see that it's a definite NO right now. I think this is more a policy decision than a technical decision.
On the other hand, adding JH migrations in EE will make reviewing difficult, and it's not aligning with all other mechanisms we're building right now.
So maybe the question comes down to how much we want to persuade for converting between EE and JH.
I see that if you can limit JH migrations to be only adding new tables and updating those new tables, then the compatibility can likely be kept. (Unless EE now suddenly adds a new table with the same name as an old JH table, but I think we can avoid this)
By Lin Jen-Shin on 2021-08-27T07:33:36 (imported from GitLab)
I'm going to defer to @pbair & @godfat-gitlab to engage relevant team members on this.
How often do you believe JiHu users would be upgrading from CE to JH? I want to be cautious about increasing the cognitive load in gitlab-org/gitlab if this is an infrequent circumstance and start with the original plan.
I agree that Plan 2 would simplify development but adds a source for potential confusion in gitlab-org/gitlab with migrations that are not related to ce/ee features.
By Kyle Wiebers on 2021-08-30T19:25:42 (imported from GitLab)
How often do you believe JiHu users would be upgrading from CE to JH?
I realized that the term CE can be ambiguous now because is it the CE from EE, or the CE from JH? :P What if people run JH without a license? It should work like the JH version of CE, so it may already have JH-only migrations in places, then it should have no issue upgrading/downgrading between CE (JH) <=> JH
But if it's CE (EE) <=> JH then it would have issues.
I think plan two will simplify our development a lot.
I think the software complexity will be lower, but development might not be simpler, or even much more complex, because with that we need to split the codes between 2 different repositories, and it's definitely making reviewing much harder.
It may be simpler to fully test on JH and then bring the changes to EE, but then this will require all the reviewers from Inc go get an account on .cn and review from there. Otherwise they have to review something completely out of context.
I believe there will be a lot of workflow process to work out.
By Lin Jen-Shin on 2021-09-02T00:40:59 (imported from GitLab)
Plan one has its challenges. Adding separate tables for new data can have a negative impact on performance and complexity of the code to query it. In the example above, adding a phone number for a user, this could probably be easily managed in a new table. But adding new data that is more tightly integrated into application logic would be harder to implement.
But I also agree plan two could add a lot of process overhead, and slow development on JH. Reviews of migrations can already require multiple back-and-forth rounds in EE to work within our guidelines. More importantly, from a policy standpoint this is a complex decision, and a bigger one than I can give answer to. Migrations and query changes are already a source of instability within the EE codebase, so further complicating that requires input from a broader group.
@craig-gomes knowing existing database-related changes we have, do you think the above plan two would be likely to be accepted as a solution?
By Patrick Bair on 2021-08-31T21:31:41 (imported from GitLab)
@craig-gomes - This was mentioned during the 2021-09-01 JiHu Product and Engineering sync and the JiHu team needs a decision by end of week as a few features are dependent on this.
One requirement that is critical for the JiHu strategy is to support the CE -> JH upgrade path in a seamless manner as there is large CE usage in the JiHu market. Please keep this requirement in mind when evaluating the options above and the decisions.
We may need a modified Plan Two option where the JiHu migration is in the gitlab-org/gitlab repo and the usage context is consistent and clear in the migration to handle the upgrade requirement.
Craig - Jerome and I are happy to have a sync call on this if needed. Based on the urgency for JiHu we'd like a decision from the Database team by the end of this week.
@kwiebers I scheduled a meeting to catch up with @pbair and members of the DB group that could attend tomorrow. I have also added you and Jerome. Feel free to invite others as needed. I'll start my comments in the agenda doc.
By Craig Gomes on 2021-09-01T13:54:03 (imported from GitLab)
@qianzhangxa@icbd - @godfat-gitlab@jeromezng and I met with @craig-gomes and @abrandl to discuss and decided to go with Plan 2 for the current feature. We will re-evaluate as needed after starting here but see Plan 2 as the best option.
This method will also allow the JiHu team to leverage the expertise from the GitLab Inc DB maintainers on how to model the data based on high SaaS usage data patterns and ensure that the migrations may not have a negative effect on GitLab.com.
There was a follow up item from @jeromezngHow do you ensure that the GitLab maintainer to have that context? Can structure.sql flag those items added by the JiHu team?. I've added a discussion issue here - JiHu - How to identify JiHu specific structure changes - https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/192
By Craig Gomes on 2021-09-01T22:18:30 (imported from GitLab)
@craig-gomes Eventually we'll need to be able to use both options:
Plan 1: Introduce database objects specifically only for JH
Plan 2: Send database migrations upstream from JH to GitLab codebase
A reason we also need to be able to do (1) is that we should not be adding unwarranted overhead to the GitLab database. A good example is indexes: When JiHu adds columns to existing tables and needs an index on those, this can create significant overhead for us, e.g. on GitLab.com: The index is potentially large (indexing an existing table: takes up resources to create and store over time), needs to be maintained (write overhead on the primary), but will never be used on GitLab.com (hence unwarranted). With Plan 1, those indexes can be created for JiHu specifically and won't be present in a GitLab installation (e.g. on GitLab.com).
Another benefit of having the ability to use Plan 1 is that it allows for quicker cycles for JiHu for cases where additional tables need to be created anyways. Those don't necessarily need to go through upstream code review. An example is adding the phone number, which shouldn't go into the existing users table in any case and we might as well create a JiHu-specific table to capture it. So if we had Plan 1 in place too, JiHu could have merged the change without having to send it upstream (and without us needing to coordinate at all).
Capturing this here as we discussed about this today, so it doesn't get lost. cc @pbair
By Andreas Brandl on 2021-09-01T22:42:59 (imported from GitLab)
Not related to how we organize migrations - phone can be considered personally identifiable information (PII), do we want to store it encrypted (using something like TokenAuthenticatable)?
By Krasimir Angelov on 2021-09-02T05:03:16 (imported from GitLab)
We need add a index for phone in user table since we need check the phone is available which is like the username. See
@memorycancel Ah, that's a great example for the indexing topic. Simplifying, the migration would look like below and execute also on GitLab.com:
alter table user_details add column phone_number varchar(100) null;create unique index on user_details (phone_number);
On GitLab.com, user_details is a 500 MB table with about 3.5M records. The resulting index we created above will be 73 MB - even though all records will have NULL for their phone_number.
This is part of the overhead that I meant above. We have to create, replicate, store and maintain this index on GitLab.com, even though we absolutely don't need it. For larger tables, this will be much more significant in terms of size (we have btree indexes with hundreds of GBs in size). This is not a one-time cost, but we add write overhead since this index needs to be maintained (updated/vacuumed, etc.) continuously.
I can see two ways of working around this:
Introduce this index specifically only for JiHu ("Plan 1" above)
Turn the index into a partial index
Following (2), we would create the index only for non-null values and end up with an empty index:
create unique index phone_partial on user_details (phone_number) where phone_number is not null;
I find this rather a hacky approach, compared to introducing it only for JiHu.
It also shows another problem: Attributes added to existing tables have to be optional, we can't specify NOT NULL without a default value.
This is because GitLab's code won't maintain the column at all. If we were to introduce a separate table for this, e.g. user_details_jihu (user_id integer.., phone_number varchar not null) - we regain the freedom to add NOT NULL constraints.
By Andreas Brandl on 2021-09-02T17:28:54 (imported from GitLab)
so what would you suggest? Will we have to go back to Plan 1?
@qianzhangxa For the in-flight change (user/phone number etc.) it's been decided to go with Plan 2 as far as I understand, and we can adapt and reconcile later, too. Other than that I defer to @craig-gomes to steer the discussion (DRI) further.
That said, I think it's clear that we want to be able to use "Plan 2" - as upstreaming changes is always an option.
What we need to figure out is if we want to pay the overhead for always having to upstream database changes. This overhead is:
For JiHu: Not being able to ship database migrations on their own limits flexibility and adds communication overhead for any database-related change (review times can be significantly reducing cycle time).
For GitLab: Having to accept operational overhead for database objects (mostly from indexes) we absolutely don't need. This incurs cost and reduces vertical scalability for GitLab.com.
For GitLab: Reviewing all database changes for JiHu increases the cognitive load on maintainers even more (missing context and I imagine we won't have comparable insight into operational SaaS aspects for GitLab.cn, as we do for GitLab.com - which drives a lot of aspects for code review).
In my mind, I would expect we benefit from being able to use "Plan 1", i.e. shipping database changes specifically only for JiHu, in cases where this makes more sense.
By Andreas Brandl on 2021-09-03T03:59:40 (imported from GitLab)
On GitLab.com, user_details is a 500 MB table with about 3.5M records. The resulting index we created above will be 73 MB - even though all records will have NULL for their phone_number.
@abrandl Thanks for providing details. I am not clear about why the index of phone with null will be 73MB. Is this estimated data or experimental data? since from my view, every index is generated like a b-tree, if the data of the field is null, so the index of b-tree is also null and this will not increased the disk space of gitlab.com due to index of phone, please correct me if I was wrong.
I am not clear about why the index of phone with null will be 73MB.
@memorycancel Even though all the values in the btree are null, the index still needs to maintain pointers to the heap for all records (including for null values). This is what generates the overhead and explains why the index size is not zero. Since all records in the table are being indexed, these pointers also will need to be updated along with heap updates (hence the write overhead).
The numbers I provided are based on the actual, current production data for GitLab.com.
And I flip though the history of db/migrations/*.rb, and found that there is precedent
Yes, that's a good example and not uncommon. We have a few indexes excluding null values because of this.
I copy and change to add index of phone in user_details
Yes this is what I meant (strategy discussion aside).
By Andreas Brandl on 2021-09-03T08:39:03 (imported from GitLab)