Skip to content
/ server Public

MDEV-32067 InnoDB linear read ahead had better be logical#4600

Open
Thirunarayanan wants to merge 1 commit intomainfrom
main_MDEV-32067
Open

MDEV-32067 InnoDB linear read ahead had better be logical#4600
Thirunarayanan wants to merge 1 commit intomainfrom
main_MDEV-32067

Conversation

@Thirunarayanan
Copy link
Member

@Thirunarayanan Thirunarayanan commented Jan 28, 2026

MDEV-32067 InnoDB linear read ahead had better be logical

The traditional linear read-ahead, enabled by innodb_read_ahead_threshold=56,
only works if pages are allocated on adjacent page numbers, which is not
always the case for B-tree leaf pages.

After this change, the exact nonzero values of
innodb_read_ahead_threshold matter only for the read-ahead of
undo log pages.

Introduced Multi-Range Read (MRR) aware read-ahead that collects
actual leaf page numbers during B-tree traversal

buf_read_ahead_undo(): Renamed from buf_read_ahead_linear().
This function will no longer be invoked on any BLOB pages
(for which FIL_PAGE_PREV and FIL_PAGE_NEXT were not initialized
consistently) nor on any index pages. For index leaf pages,
we will introduce buf_read_ahead_one() and buf_read_ahead_pages().

buf_read_ahead_one(): Read ahead one (sibling leaf) page.
This logic cannot be disabled.

buf_read_ahead_pages(): Read ahead B-tree index leaf pages.

buf_read_ahead_random(): Split the function into two parts: one
that determines which range of pages should be read, and another
that actually initiates a read of the pages.

btr_pcur_move_to_next_page(): Invoke buf_read_ahead_one()
instead of buf_read_ahead_linear().

btr_pcur_move_backward_from_page(): Implement a fast path of
trying to acquire a latch on the previous page without waiting,
and invoke buf_read_ahead_one() on the preceding page, with the
assumption that we may be accessing that page in the near future.

btr_copy_blob_prefix(): Simplify the logic. On other than
ROW_FORMAT=COMPRESSED BLOB pages, the FIL_PAGE_NEXT field is not
meaningfully initialized. The FIL_PAGE_PREV field is not pointing
to anything meaningful either. buf_read_ahead_linear() expects
these to be set meaningfully. Only the non-default setting
innodb_random_read_ahead=ON might be meaningful here.

btr_cur_t::search_leaf(): Add MRR read-ahead context to collect
leaf page numbers at PAGE_LEVEL=1 during B-tree traversal.
The collected page numbers represent actual leaf pages that
will be accessed, enabling more targeted
read-ahead than linear page number assumptions.

mrr_readahead_ctx_t: New structure for passing MRR context
through the call chain from ha_innobase -> row_search_mvcc()
-> btr_pcur_open() -> search_leaf() and it has
READ_AHEAD_PAGES=64 limit.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@Thirunarayanan Thirunarayanan force-pushed the main_MDEV-32067 branch 5 times, most recently from 1fc14cb to 7fa9c70 Compare February 2, 2026 07:21
The traditional linear read-ahead, enabled by innodb_read_ahead_threshold=56,
only works if pages are allocated on adjacent page numbers, which is not
always the case for B-tree leaf pages.

After this change, the exact nonzero values of
innodb_read_ahead_threshold matter only for the read-ahead of
undo log pages.

Introduced Multi-Range Read (MRR) aware read-ahead that collects
actual leaf page numbers during B-tree traversal

buf_read_ahead_undo(): Renamed from buf_read_ahead_linear().
This function will no longer be invoked on any BLOB pages
(for which FIL_PAGE_PREV and FIL_PAGE_NEXT were not initialized
consistently) nor on any index pages. For index leaf pages,
we will introduce buf_read_ahead_one() and buf_read_ahead_pages().

buf_read_ahead_one(): Read ahead one (sibling leaf) page.
This logic cannot be disabled.

buf_read_ahead_pages(): Read ahead B-tree index leaf pages.

buf_read_ahead_random(): Split the function into two parts: one
that determines which range of pages should be read, and another
that actually initiates a read of the pages.

btr_pcur_move_to_next_page(): Invoke buf_read_ahead_one()
instead of buf_read_ahead_linear().

btr_pcur_move_backward_from_page(): Implement a fast path of
trying to acquire a latch on the previous page without waiting,
and invoke buf_read_ahead_one() on the preceding page, with the
assumption that we may be accessing that page in the near future.

btr_copy_blob_prefix(): Simplify the logic. On other than
ROW_FORMAT=COMPRESSED BLOB pages, the FIL_PAGE_NEXT field is not
meaningfully initialized. The FIL_PAGE_PREV field is not pointing
to anything meaningful either. buf_read_ahead_linear() expects
these to be set meaningfully. Only the non-default setting
innodb_random_read_ahead=ON might be meaningful here.

btr_cur_t::search_leaf(): Add MRR read-ahead context to collect
leaf page numbers at PAGE_LEVEL=1 during B-tree traversal.
The collected page numbers represent actual leaf pages that
will be accessed, enabling more targeted
read-ahead than linear page number assumptions.

mrr_readahead_ctx_t: New structure for passing MRR context
through the call chain from ha_innobase -> row_search_mvcc()
-> btr_pcur_open() -> search_leaf() and it has
READ_AHEAD_PAGES=64 limit.
@Thirunarayanan Thirunarayanan requested a review from dr-m February 3, 2026 09:03
@Thirunarayanan Thirunarayanan marked this pull request as ready for review February 3, 2026 09:03
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed a small part of this so far. Please debug this with undo tablespace truncation enabled.

const unsigned zip_size= space->zip_size();
ulint count;

if (high_1.page_no() > space->last_page_number())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to have a race condition with undo tablespace truncation here. I understood that mtr_t::commit_shrink() would change space->committed_size as part of mtr_memo_slot_t::release(). There, that field is protected by exclusive space->latch. Here, we are not holding that latch, hence we could theoretically post a read-ahead request for a portion of an undo tablespace that is being truncated.

Acquiring space->latch here would lead to a significant performance regression. We will need to prevent this glitch in a different way. A possible way might be to set the STOPPING_READS flag in the mtr_t::commit_shrink() code path and clearing it after release(). This would allow space.acquire() to return false and therefore prevent us from entering this code path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants