Skip to content

Data migrations worker rework #27014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

bashtanov
Copy link
Contributor

@bashtanov bashtanov commented Jul 28, 2025

Previously work tracking logic was flawed. When for an NTP a work was
already running and backend requested another work (e.g. for a different
stage or for a new migration), it was handled inappropriately:

  1. existing running work was only aborted by triggering abort source,
    but not waited to actually complete;
  2. work info, which is supplementary data a work uses, was overwritten
    by the new one; the old work which was still running might access
    deallocated or reused memory where the old work info was;
  3. per-work abort sources were not in use, only main one was

Reorganised logic:

  1. allow no more than one running work per NTP;
  2. store its belongigns separately from the one requested if they are
    different;
  3. use both main and individual abort sources

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

  • none

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 28, 2025

CI test results

test results on build#69755
test_class test_method test_arguments test_kind job_url test_status passed reason
EndToEndCloudTopicsTxTest test_write null integration https://buildkite.com/redpanda/redpanda/builds/69755#0198516c-59d7-4bbc-9397-756af75d70eb FLAKY 20/21 upstream reliability is '94.84066767830045'. current run reliability is '95.23809523809523'. drift is -0.39743 and the allowed drift is set to 50. The test should PASS
test results on build#69827
test_class test_method test_arguments test_kind job_url test_status passed reason
FeaturesMultiNodeTest test_license_upload_and_query null integration https://buildkite.com/redpanda/redpanda/builds/69827#01985586-8aec-4d81-98fe-560b0d76c8ef FLAKY 16/21 upstream reliability is '95.67567567567568'. current run reliability is '76.19047619047619'. drift is 19.4852 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "adjacent_merge", "enable_failures": false, "mixed_versions": true, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/69827#01985586-8ae8-4feb-9bba-6ad374e9610a FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
test results on build#69949
test_class test_method test_arguments test_kind job_url test_status passed reason
DataMigrationsApiTest test_creating_and_listing_migrations null integration https://buildkite.com/redpanda/redpanda/builds/69949#01985bb5-af59-49c4-97ad-54e561968789 FLAKY 19/21 upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS
DataMigrationsApiTest test_higher_level_migration_api null integration https://buildkite.com/redpanda/redpanda/builds/69949#01985bb3-ad2f-44f5-a757-8a7637903b2d FLAKY 16/21 upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "adjacent_merge", "enable_failures": false, "mixed_versions": false, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/69949#01985bb3-ad36-4f21-975e-d01b47933cf8 FLAKY 19/21 upstream reliability is '98.7551867219917'. current run reliability is '90.47619047619048'. drift is 8.279 and the allowed drift is set to 50. The test should PASS
test results on build#70067
test_class test_method test_arguments test_kind job_url test_status passed reason
DataMigrationsApiTest test_higher_level_migration_api null integration https://buildkite.com/redpanda/redpanda/builds/70067#01986553-ee36-45a8-9562-fe51d5d2502c FLAKY 11/21 upstream reliability is '100.0'. current run reliability is '52.38095238095239'. drift is 47.61905 and the allowed drift is set to 50. The test should PASS
EnterpriseFeaturesTest test_enable_features {"disable_trial": true, "feature": 5, "install_license": true} integration https://buildkite.com/redpanda/redpanda/builds/70067#0198659a-c268-4936-9c44-733ee9d2a905 FLAKY 16/21 upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": false, "mixed_versions": false, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/70067#01986553-ee35-465d-bb06-e51d634ee93e FLAKY 20/21 upstream reliability is '98.93617021276596'. current run reliability is '95.23809523809523'. drift is 3.69807 and the allowed drift is set to 50. The test should PASS
src/v/cluster_link/tests/base_task_test src/v/cluster_link/tests/base_task_test unit https://buildkite.com/redpanda/redpanda/builds/70067#01986523-09f4-4e88-af4f-6d0f03da4a79 FAIL 0/1

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this fixing a bug we can reference to justify the backport or is ther more context around the motivation for backproting this?

@bashtanov bashtanov force-pushed the data-migrations-worker-rework branch from 03bd809 to ee1c027 Compare July 29, 2025 08:41
@bashtanov
Copy link
Contributor Author

@dotnwat just the bug. It invokes an UB, and we're lucky (or ignorant) it didn't result in anything serious. Should I add a release note line about it?

@mmaslankaprv
Copy link
Member

@bashtanov can you add a bit more detailed commit message for the second commit in this PR ? Please provide the motivation for changes and describe the idea behind the new work tracking logic

@bashtanov bashtanov force-pushed the data-migrations-worker-rework branch 2 times, most recently from 3acb6f0 to e272f5f Compare July 30, 2025 13:54
@dotnwat
Copy link
Member

dotnwat commented Jul 30, 2025

It invokes an UB,

Can you expand on this and why a refactor is needed as opposed to backporting a fix for UB and refactoring in upstream?

@bashtanov
Copy link
Contributor Author

@dotnwat I've updated the main commit message and the PR description accordingly. I cannot think of a way to eliminate the UB without major chages in the logic. We do need to store up to two "work info" objects per NTP, and the rest of the change is about juggling them correctly.

@dotnwat
Copy link
Member

dotnwat commented Jul 31, 2025

I'm probably vastly simplifying things, but

  1. work info, which is supplementary data a work uses, was overwritten
    by the new one; the old work which was still running might access
    deallocated or reused memory where the old work info was;

Sounds like it's just a matter of protecting a shared data structure?

@bashtanov
Copy link
Contributor Author

Invalid memory access is not the only problem that needs to be fixed. Allowing concurrent work on the same NTP is also wrong because they can conflict or run in the wrong order. E.g. imagine a migration progressing up to some point and then cancelled. Quite possibly operations on affected partitions will be something opposite there, e.g. block writes when preparing to migrate away and then unblock to cancel the migration.

Sounds like it's just a matter of protecting a shared data structure?

Well, it is not shared, we need to store both separately, as one is needed for the running task and the other one for the requested one. We need them both physically on the shard, so we either need to alter ntp_state structure to accomodate them (which is what I did) or pass a shared pointer or a value copy to do_work (which would be a major change too).

To protect from concurrent execution we would need a mutex. It would introduce an implicit queue of those waiting for it, while in reality we need a simpler logic, as only the last one in the queue is needed.

All in all, my attempts to make less changes resulted in logic still quite broken or at least in much less confidence in its correctness.

@dotnwat
Copy link
Member

dotnwat commented Aug 1, 2025

@bashtanov please capture our discussions in the commit message. a person reading the commit in the future should be able to

  1. easily understand what the UB was and how it was caused
  2. what the fix was

Previously work tracking logic was flawed. When for an NTP a work was
already running and backend requested another work (e.g. for a different
stage or for a new migration), it was handled inappropriately.
1) Existing running work was only aborted by triggering abort source,
but not waited to actually stop. As a result, works on the same NTP
could run concurrently, in particular in the wrong order. This might
lead to incorrect results, e. g. topic writes unblocked then blocked
instead of the other way round.
2) Work info, which is supplementary data a work uses, was overwritten
by the new one. The old work which was still running might access
deallocated or reused memory where the old work info was, thus
triggering an UB.
3) Per-work abort sources were not in use, only main one was.

Reorganised logic:
1) Allow no more than one running work per NTP and no more than one
queued. Semaphores and mutexes would not be able to help with it as they
maintain a potentially long queue of waiters. We allow only one waiting.
2) Store its belongigns separately from the one requested if they are
different.
3) Use both main and individual abort sources.
@bashtanov bashtanov force-pushed the data-migrations-worker-rework branch from e272f5f to 12a4420 Compare August 1, 2025 10:16
@bashtanov bashtanov merged commit 98cf9eb into redpanda-data:dev Aug 1, 2025
18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27014-v24.3.x-634 remotes/upstream/v24.3.x
git cherry-pick -x 3beb2f0d98 d9f07a7bc2 12a4420c7a

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27014-v25.1.x-237 remotes/upstream/v25.1.x
git cherry-pick -x 3beb2f0d98 d9f07a7bc2 12a4420c7a

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27014-v24.2.x-37 remotes/upstream/v24.2.x
git cherry-pick -x 3beb2f0d98 d9f07a7bc2 12a4420c7a

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v25.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27014-v25.2.x-967 remotes/upstream/v25.2.x
git cherry-pick -x 3beb2f0d98 d9f07a7bc2 12a4420c7a

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants