Data migrations worker rework #27014

bashtanov · 2025-07-28T13:32:29Z

Previously work tracking logic was flawed. When for an NTP a work was
already running and backend requested another work (e.g. for a different
stage or for a new migration), it was handled inappropriately:

existing running work was only aborted by triggering abort source,
but not waited to actually complete;
work info, which is supplementary data a work uses, was overwritten
by the new one; the old work which was still running might access
deallocated or reused memory where the old work info was;
per-work abort sources were not in use, only main one was

Reorganised logic:

allow no more than one running work per NTP;
store its belongigns separately from the one requested if they are
different;
use both main and individual abort sources

Backports Required

Release Notes

none

vbotbuildovich · 2025-07-28T15:58:30Z

CI test results

test results on build#69755

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
EndToEndCloudTopicsTxTest	test_write	null	integration	https://buildkite.com/redpanda/redpanda/builds/69755#0198516c-59d7-4bbc-9397-756af75d70eb	FLAKY	20/21	upstream reliability is '94.84066767830045'. current run reliability is '95.23809523809523'. drift is -0.39743 and the allowed drift is set to 50. The test should PASS

test results on build#69827

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
FeaturesMultiNodeTest	test_license_upload_and_query	null	integration	https://buildkite.com/redpanda/redpanda/builds/69827#01985586-8aec-4d81-98fe-560b0d76c8ef	FLAKY	16/21	upstream reliability is '95.67567567567568'. current run reliability is '76.19047619047619'. drift is 19.4852 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest	test_node_operations	{"cloud_storage_type": 2, "compaction_mode": "adjacent_merge", "enable_failures": false, "mixed_versions": true, "with_iceberg": false}	integration	https://buildkite.com/redpanda/redpanda/builds/69827#01985586-8ae8-4feb-9bba-6ad374e9610a	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS

test results on build#69949

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
DataMigrationsApiTest	test_creating_and_listing_migrations	null	integration	https://buildkite.com/redpanda/redpanda/builds/69949#01985bb5-af59-49c4-97ad-54e561968789	FLAKY	19/21	upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS
DataMigrationsApiTest	test_higher_level_migration_api	null	integration	https://buildkite.com/redpanda/redpanda/builds/69949#01985bb3-ad2f-44f5-a757-8a7637903b2d	FLAKY	16/21	upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest	test_node_operations	{"cloud_storage_type": 2, "compaction_mode": "adjacent_merge", "enable_failures": false, "mixed_versions": false, "with_iceberg": false}	integration	https://buildkite.com/redpanda/redpanda/builds/69949#01985bb3-ad36-4f21-975e-d01b47933cf8	FLAKY	19/21	upstream reliability is '98.7551867219917'. current run reliability is '90.47619047619048'. drift is 8.279 and the allowed drift is set to 50. The test should PASS

test results on build#70067

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
DataMigrationsApiTest	test_higher_level_migration_api	null	integration	https://buildkite.com/redpanda/redpanda/builds/70067#01986553-ee36-45a8-9562-fe51d5d2502c	FLAKY	11/21	upstream reliability is '100.0'. current run reliability is '52.38095238095239'. drift is 47.61905 and the allowed drift is set to 50. The test should PASS
EnterpriseFeaturesTest	test_enable_features	{"disable_trial": true, "feature": 5, "install_license": true}	integration	https://buildkite.com/redpanda/redpanda/builds/70067#0198659a-c268-4936-9c44-733ee9d2a905	FLAKY	16/21	upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest	test_node_operations	{"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": false, "mixed_versions": false, "with_iceberg": false}	integration	https://buildkite.com/redpanda/redpanda/builds/70067#01986553-ee35-465d-bb06-e51d634ee93e	FLAKY	20/21	upstream reliability is '98.93617021276596'. current run reliability is '95.23809523809523'. drift is 3.69807 and the allowed drift is set to 50. The test should PASS
src/v/cluster_link/tests/base_task_test	src/v/cluster_link/tests/base_task_test		unit	https://buildkite.com/redpanda/redpanda/builds/70067#01986523-09f4-4e88-af4f-6d0f03da4a79	FAIL	0/1

dotnwat

is this fixing a bug we can reference to justify the backport or is ther more context around the motivation for backproting this?

This reverts commit 48f2cce.

bashtanov · 2025-07-29T12:19:48Z

@dotnwat just the bug. It invokes an UB, and we're lucky (or ignorant) it didn't result in anything serious. Should I add a release note line about it?

mmaslankaprv · 2025-07-30T06:35:42Z

@bashtanov can you add a bit more detailed commit message for the second commit in this PR ? Please provide the motivation for changes and describe the idea behind the new work tracking logic

src/v/cluster/data_migration_worker.cc

dotnwat · 2025-07-30T14:46:05Z

It invokes an UB,

Can you expand on this and why a refactor is needed as opposed to backporting a fix for UB and refactoring in upstream?

bashtanov · 2025-07-30T14:57:24Z

@dotnwat I've updated the main commit message and the PR description accordingly. I cannot think of a way to eliminate the UB without major chages in the logic. We do need to store up to two "work info" objects per NTP, and the rest of the change is about juggling them correctly.

dotnwat · 2025-07-31T01:52:49Z

I'm probably vastly simplifying things, but

work info, which is supplementary data a work uses, was overwritten
by the new one; the old work which was still running might access
deallocated or reused memory where the old work info was;

Sounds like it's just a matter of protecting a shared data structure?

bashtanov · 2025-07-31T10:54:06Z

Invalid memory access is not the only problem that needs to be fixed. Allowing concurrent work on the same NTP is also wrong because they can conflict or run in the wrong order. E.g. imagine a migration progressing up to some point and then cancelled. Quite possibly operations on affected partitions will be something opposite there, e.g. block writes when preparing to migrate away and then unblock to cancel the migration.

Sounds like it's just a matter of protecting a shared data structure?

Well, it is not shared, we need to store both separately, as one is needed for the running task and the other one for the requested one. We need them both physically on the shard, so we either need to alter ntp_state structure to accomodate them (which is what I did) or pass a shared pointer or a value copy to do_work (which would be a major change too).

To protect from concurrent execution we would need a mutex. It would introduce an implicit queue of those waiting for it, while in reality we need a simpler logic, as only the last one in the queue is needed.

All in all, my attempts to make less changes resulted in logic still quite broken or at least in much less confidence in its correctness.

dotnwat · 2025-08-01T01:14:43Z

@bashtanov please capture our discussions in the commit message. a person reading the commit in the future should be able to

easily understand what the UB was and how it was caused
what the fix was

Previously work tracking logic was flawed. When for an NTP a work was already running and backend requested another work (e.g. for a different stage or for a new migration), it was handled inappropriately. 1) Existing running work was only aborted by triggering abort source, but not waited to actually stop. As a result, works on the same NTP could run concurrently, in particular in the wrong order. This might lead to incorrect results, e. g. topic writes unblocked then blocked instead of the other way round. 2) Work info, which is supplementary data a work uses, was overwritten by the new one. The old work which was still running might access deallocated or reused memory where the old work info was, thus triggering an UB. 3) Per-work abort sources were not in use, only main one was. Reorganised logic: 1) Allow no more than one running work per NTP and no more than one queued. Semaphores and mutexes would not be able to help with it as they maintain a potentially long queue of waiters. We allow only one waiting. 2) Store its belongigns separately from the one requested if they are different. 3) Use both main and individual abort sources.

confluentinc/librdkafka#4963 Failed to fetch committed offsets for 0 partition(s)

vbotbuildovich · 2025-08-01T13:46:26Z

/backport v25.2.x

vbotbuildovich · 2025-08-01T13:46:27Z

/backport v25.1.x

vbotbuildovich · 2025-08-01T13:46:28Z

/backport v24.3.x

vbotbuildovich · 2025-08-01T13:46:29Z

/backport v24.2.x

vbotbuildovich · 2025-08-01T13:47:30Z

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27014-v24.3.x-634 remotes/upstream/v24.3.x
git cherry-pick -x 3beb2f0d98 d9f07a7bc2 12a4420c7a

Workflow run logs.

vbotbuildovich · 2025-08-01T13:47:30Z

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27014-v25.1.x-237 remotes/upstream/v25.1.x
git cherry-pick -x 3beb2f0d98 d9f07a7bc2 12a4420c7a

Workflow run logs.

vbotbuildovich · 2025-08-01T13:47:33Z

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27014-v24.2.x-37 remotes/upstream/v24.2.x
git cherry-pick -x 3beb2f0d98 d9f07a7bc2 12a4420c7a

Workflow run logs.

vbotbuildovich · 2025-08-01T13:47:33Z

Failed to create a backport PR to v25.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27014-v25.2.x-967 remotes/upstream/v25.2.x
git cherry-pick -x 3beb2f0d98 d9f07a7bc2 12a4420c7a

Workflow run logs.

github-actions bot added the area/redpanda label Jul 28, 2025

dotnwat reviewed Jul 28, 2025

View reviewed changes

Revert "tests/data-migrate: do not test with groups until bugs fixed"

3beb2f0

This reverts commit 48f2cce.

bashtanov force-pushed the data-migrations-worker-rework branch from 03bd809 to ee1c027 Compare July 29, 2025 08:41

bashtanov requested review from bharathv, mmaslankaprv and joe-redpanda July 29, 2025 12:18

mmaslankaprv reviewed Jul 30, 2025

View reviewed changes

src/v/cluster/data_migration_worker.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Jul 30, 2025

View reviewed changes

src/v/cluster/data_migration_worker.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Jul 30, 2025

View reviewed changes

src/v/cluster/data_migration_worker.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Jul 30, 2025

View reviewed changes

src/v/cluster/data_migration_worker.cc Show resolved Hide resolved

bashtanov force-pushed the data-migrations-worker-rework branch 2 times, most recently from 3acb6f0 to e272f5f Compare July 30, 2025 13:54

bashtanov requested review from mmaslankaprv and dotnwat July 30, 2025 17:10

bashtanov added 2 commits August 1, 2025 10:38

tests/data-migrate: retry on librdkafka "0 partition(s)" bug

12a4420

confluentinc/librdkafka#4963 Failed to fetch committed offsets for 0 partition(s)

bashtanov force-pushed the data-migrations-worker-rework branch from e272f5f to 12a4420 Compare August 1, 2025 10:16

mmaslankaprv approved these changes Aug 1, 2025

View reviewed changes

bashtanov merged commit 98cf9eb into redpanda-data:dev Aug 1, 2025
18 checks passed

This was referenced Aug 1, 2025

[v24.3.x] Data migrations worker rework #27091

Open

[v25.1.x] Data migrations worker rework #27092

Open

[v24.2.x] Data migrations worker rework #27093

Closed

[v25.2.x] Data migrations worker rework #27094

Open

bashtanov mentioned this pull request Aug 2, 2025

[v25.2.x] Data migrations worker rework #27107

Draft

Data migrations worker rework #27014

Data migrations worker rework #27014

Uh oh!

Conversation

bashtanov commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

vbotbuildovich commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

dotnwat left a comment

Choose a reason for hiding this comment

Uh oh!

bashtanov commented Jul 29, 2025

Uh oh!

mmaslankaprv commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dotnwat commented Jul 30, 2025

Uh oh!

bashtanov commented Jul 30, 2025

Uh oh!

dotnwat commented Jul 31, 2025

Uh oh!

bashtanov commented Jul 31, 2025

Uh oh!

dotnwat commented Aug 1, 2025

Uh oh!

Uh oh!

vbotbuildovich commented Aug 1, 2025

Uh oh!

vbotbuildovich commented Aug 1, 2025

Uh oh!

vbotbuildovich commented Aug 1, 2025

Uh oh!

vbotbuildovich commented Aug 1, 2025

Uh oh!

vbotbuildovich commented Aug 1, 2025

Uh oh!

vbotbuildovich commented Aug 1, 2025

Uh oh!

vbotbuildovich commented Aug 1, 2025

Uh oh!

vbotbuildovich commented Aug 1, 2025

Uh oh!

Uh oh!

bashtanov commented Jul 28, 2025 •

edited

Loading

vbotbuildovich commented Jul 28, 2025 •

edited

Loading