Skip to content
代码片段 群组 项目
提交 702afc1e 编辑于 作者: Adam Hegyi's avatar Adam Hegyi
浏览文件

Add offset pagination optimization docs

上级 18dc5fe8
No related branches found
No related tags found
无相关合并请求
......@@ -79,6 +79,7 @@ info: Any user with at least the Maintainer role can merge updates to this conte
- [Ordering table columns](ordering_table_columns.md)
- [Pagination guidelines](pagination_guidelines.md)
- [Pagination performance guidelines](pagination_performance_guidelines.md)
- [Offset pagination optimization](offset_pagination_optimization.md)
- [Polymorphic associations](polymorphic_associations.md)
- [Query count limits](query_count_limits.md)
- [Query performance guidelines](query_performance.md)
......
---
stage: Data Stores
group: Database
info: Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review.
---
# Offset pagination optimization
In many REST APIs endpoints, we use [offset-based pagination](pagination_guidelines.md#offset-pagination) which uses the `page` URL parameter to paginate through the results. Offset pagination [scales linearly](pagination_guidelines.md#offset-on-a-large-dataset), the higher the page number, the slower the database query gets. This means that for large page numbers, the database query can time out. This usually happens when third-party integrations and scripts interact with the system as users are unlikely to deliberately visit a high page number.
The ideal way of dealing with scalability problems related to offset pagination is switching to [keyset pagination](pagination_guidelines.md#keyset-pagination) however, this is means a breaking API change. To provide a temporary, stop-gap measure, you can use the [`Gitlab::Pagination::Offset::PaginationWithIndexOnlyScan`](https://gitlab.com/gitlab-org/gitlab/-/tree/master/lib/gitlab/pagination/offset/pagination_with_index_only_scan.rb) class. The optimization can help in certain cases to improve the performance of offset-paginated queries when high `OFFSET` value is present. The performance improvement means that the queries will continue to scale linearly with improved query timings, which means that reaching database timeout will happen at a much higher `page` number if it happens at all.
## Requirements for using the optimization
The optimization avoids calling `SELECT *` when determining the records based on the `ORDER BY`, `OFFSET`, and `LIMIT` clauses and attempts to use an index only scan to reduce database I/O. To use the optimization, the same requirements must be met as for keyset pagination:
- `ORDER BY` clause is present.
- The `ORDER BY` clause uniquely identifies one database column.
- Good, uses the primary key: `ORDER BY id`
- Bad, `created_at` not unique: `ORDER BY created_at`
- Good, there is a [tie-breaker](pagination_performance_guidelines.md#tie-breaker-column): `ORDER BY created_at, id`
- The query is well-covered with a database index.
## How to use the optimizator class
The optimizator class can be used with `ActiveRecord::Relation` objects, as a result, it will return an optimized, [kaminari-paginated](https://github.com/kaminari/kaminari) `ActiveRecord::Relation` object. In case the optimization cannot be applied, the original `ActiveRecord::Relation` object will be used for the pagination.
Basic usage:
```ruby
scope = Issue.where(project_id: 1).order(:id)
records = Gitlab::Pagination::Offset::PaginationWithIndexOnlyScan.new(scope: scope, page: 5, per_page: 100).paginate_with_kaminari
puts records.to_a
```
Optimizations should be always rolled out with feature flags, you can also target the usage of the optimization when certain conditions are met.
```ruby
# - Only apply optimization for large page number lookups
# - When label_names filter parameter is given, the optimziation will not have effect (complex JOIN).
if params[:page] > 100 && params[:label_names].blank? && Feature.enabled?(:my_optimized_offet_query)
Gitlab::Pagination::Offset::PaginationWithIndexOnlyScan.new(scope: scope, page: params[:page], per_page: params[:per_page]).paginate_with_kaminari
else
scope.page(params[:page]).per(params[:per_page])
end
```
## How does the optimization work
The optimization takes the passed `ActiveRecord::Relation` object and moves it into a CTE. Within the CTE, the original query is altered to only
select the `ORDER BY` columns. This will make it possible for the database to use index only scan.
When the query is executed, the query within the CTE is evaluated first, the CTE will contain `LIMIT` number of rows with the selected columns.
Using the `ORDER BY` values, a `LATERAL` query will locate the full rows one by one. `LATERAL` query is used here in order to force out
a nested loop: for each row in the CTE, look up a full row in the table.
Original query:
- Reads `OFFSET + LIMIT` number of entries from the index.
- Reads `OFFSET + LIMIT` number of rows from the table.
Optimized query:
- Reads `OFFSET + LIMIT` number of entries from the index.
- Reads `LIMIT` number of rows from the table.
## Determine if the optimization helps
By running `EXPLAIN (buffers, analyze)` on the database query with a high (100_000) `OFFSET` value, we can clearly see if the optimization helps.
Look for the following:
- The optimized query plan must have an index only scan node.
- Comparing the cached buffer count and timing should be lower.
- This can be done when executing the same query 2 or 3 times.
Consider the following query:
```sql
SELECT issues.*
FROM issues
ORDER BY id
OFFSET 100000
LIMIT 100
```
It produces an execution plan which uses an index scan:
```plaintext
Limit (cost=27800.96..27828.77 rows=100 width=1491) (actual time=138.305..138.470 rows=100 loops=1)
Buffers: shared hit=73212
I/O Timings: read=0.000 write=0.000
-> Index Scan using issues_pkey on public.issues (cost=0.57..26077453.90 rows=93802448 width=1491) (actual time=0.063..133.688 rows=100100 loops=1)
Buffers: shared hit=73212
I/O Timings: read=0.000 write=0.000
Time: 143.779 ms
- planning: 5.222 ms
- execution: 138.557 ms
- I/O read: 0.000 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 73212 (~572.00 MiB) from the buffer pool
- reads: 0 from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
```
The optimized query:
```sql
WITH index_only_scan_pagination_cte AS MATERIALIZED
(SELECT id
FROM issues
ORDER BY id ASC
LIMIT 100
OFFSET 100000)
SELECT issues.*
FROM
(SELECT id
FROM index_only_scan_pagination_cte) index_only_scan_subquery,
LATERAL
(SELECT issues.*
FROM issues
WHERE issues.id = index_only_scan_subquery.id
LIMIT 1) issues
```
Execution plan:
```plaintext
Nested Loop (cost=2453.51..2815.44 rows=100 width=1491) (actual time=23.614..23.973 rows=100 loops=1)
Buffers: shared hit=56167
I/O Timings: read=0.000 write=0.000
CTE index_only_scan_pagination_cte
-> Limit (cost=2450.49..2452.94 rows=100 width=4) (actual time=23.590..23.621 rows=100 loops=1)
Buffers: shared hit=55667
I/O Timings: read=0.000 write=0.000
-> Index Only Scan using issues_pkey on public.issues issues_1 (cost=0.57..2298090.72 rows=93802448 width=4) (actual time=0.070..20.412 rows=100100 loops=1)
Heap Fetches: 1063
Buffers: shared hit=55667
I/O Timings: read=0.000 write=0.000
-> CTE Scan on index_only_scan_pagination_cte (cost=0.00..2.00 rows=100 width=4) (actual time=23.593..23.641 rows=100 loops=1)
Buffers: shared hit=55667
I/O Timings: read=0.000 write=0.000
-> Limit (cost=0.57..3.58 rows=1 width=1491) (actual time=0.003..0.003 rows=1 loops=100)
Buffers: shared hit=500
I/O Timings: read=0.000 write=0.000
-> Index Scan using issues_pkey on public.issues (cost=0.57..3.58 rows=1 width=1491) (actual time=0.003..0.003 rows=1 loops=100)
Index Cond: (issues.id = index_only_scan_pagination_cte.id)
Buffers: shared hit=500
I/O Timings: read=0.000 write=0.000
Time: 29.562 ms
- planning: 5.506 ms
- execution: 24.056 ms
- I/O read: 0.000 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 56167 (~438.80 MiB) from the buffer pool
- reads: 0 from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
```
......@@ -170,7 +170,7 @@ When we paginate over a large dataset, we might notice that the response time ge
From the user point of view, this might not be always noticeable. As the user paginates forward, the previous rows might be still in the buffer cache of the database. If the user shares the link with someone else and it's opened after a few minutes or hours, the response time might be significantly higher or it would even time out.
When requesting a large page number, the database needs to read `PAGE * PAGE_SIZE` rows. This makes offset pagination **unsuitable for large database tables**.
When requesting a large page number, the database needs to read `PAGE * PAGE_SIZE` rows. This makes offset pagination **unsuitable for large database tables** however, with an [optimization technique](offset_pagination_optimization.md) the overall performance of the database queries can be slightly improved.
Example: listing users on the Admin Area
......
......@@ -17,7 +17,7 @@ module Offset
#
# scope = Issue.where(project_id: 1).order(:id)
#
# records = PaginationWithIndexOnlyScan.new(scope: scope, page: 5, per_page: 100).pageinate_with_kaminari
# records = PaginationWithIndexOnlyScan.new(scope: scope, page: 5, per_page: 100).paginate_with_kaminari
# puts records.to_a
# puts records.total_count
class PaginationWithIndexOnlyScan
......@@ -32,7 +32,7 @@ def initialize(scope:, page:, per_page:)
@original_order_values = scope.order_values
end
def pageinate_with_kaminari
def paginate_with_kaminari
original_kaminari_query = scope.page(page).per(per_page)
# Check for keyset pagination support
......
......@@ -17,7 +17,7 @@
let(:per) { 2 }
let(:page) { 1 }
subject(:result) { optimization.pageinate_with_kaminari }
subject(:result) { optimization.paginate_with_kaminari }
RSpec::Matchers.define :match_pagination_results do |expected|
match do |actual|
......
0% 加载中 .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册