Skip to content

[fix] shared production_status tensor across data partitions#127

Merged
0oshowero0 merged 4 commits into
Ascend:mainfrom
zTonyZhao:fix-shared-production_status
Jun 22, 2026
Merged

[fix] shared production_status tensor across data partitions#127
0oshowero0 merged 4 commits into
Ascend:mainfrom
zTonyZhao:fix-shared-production_status

Conversation

@zTonyZhao

Copy link
Copy Markdown
Contributor

Create a fresh production_status tensor for each DataPartitionStatus instance.

Root Cause

production_status was initialized as a tensor dataclass default. That tensor was created at class definition time and could be shared by multiple DataPartitionStatus instances. When one partition reused a released global index and marked it ready, another cleared partition could observe the same ready bit.

Impact

A cleared partition could incorrectly return stale ready metadata. In the KV path, because storage keys are generated as global_index@field, this could cause reads from one partition to include data written by another partition.

PoC

def client_api_poc():
    import torch
    import transfer_queue as tq
    from tensordict import TensorDict

    client = tq.get_client()
    field = "x"

    # p1 writes then clears one sample.
    # p1_meta.global_indexes == [0]
    p1_meta = client.put(
        data=TensorDict({field: torch.tensor([[1]])}, batch_size=[1]),
        partition_id="p1",
    )
    client.clear_samples(p1_meta)

    # p2 may reuse the released global_index 0.
    # Before the fix, this updates the shared production_status tensor
    # and makes the already-cleared p1 look ready again.
    client.put(
        data=TensorDict({field: torch.tensor([[2]])}, batch_size=[1]),
        partition_id="p2",
    )

    leaked = client.get_meta(
        data_fields=[field],
        batch_size=1,
        partition_id="p1",
        mode="fetch",
        task_name="repro",
    )

    # Before fix:
    #   leaked.size == 1
    #   leaked.global_indexes == [0]
    #   leaked.partition_ids == ["p1"]
    #   leaked.field_names == []
    #   leaked.is_ready == True
    #   leaked.production_status.tolist() == [1]
    #
    # After fix:
    #   leaked.size == 0
    #   leaked.global_indexes == []
    #   leaked.partition_ids == []
    #   leaked.field_names == []
    #   leaked.is_ready == False
    #   leaked.production_status.tolist() == []

    assert leaked.size == 0

Tests

  • python -m pytest tests/test_controller_data_partitions.py -q
  • python -m pytest tests/e2e/test_kv_interface_e2e.py::TestKVClearE2E::test_kv_clear_does_not_leak_reused_index_across_partitions -q

Create a fresh production_status tensor for each DataPartitionStatus instance.
@ascend-robot

Copy link
Copy Markdown

CLA Signature Guide

@zTonyZhao , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit Reason
[038db8d [fix] shared production_status ...](038db8d) the email used in the commit is not linked to a signed CLA!
please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

@ascend-robot

Copy link
Copy Markdown

CLA Signature Guide

@zTonyZhao , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit Reason
[5ff7cd3 [fix] typo](5ff7cd3) the email used in the commit is not linked to a signed CLA!
please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

1 similar comment
@ascend-robot

Copy link
Copy Markdown

CLA Signature Guide

@zTonyZhao , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit Reason
[5ff7cd3 [fix] typo](5ff7cd3) the email used in the commit is not linked to a signed CLA!
please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

@zTonyZhao

Copy link
Copy Markdown
Contributor Author

/check-cla

@ascend-robot

Copy link
Copy Markdown

CLA Signature Guide

@zTonyZhao , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit Reason
[5ff7cd3 [fix] typo](5ff7cd3) the email used in the commit is not linked to a signed CLA!
please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

@0oshowero0

Copy link
Copy Markdown
Collaborator

Thank you for your contribution! Please:

  1. sign the CLA
  2. run the precommit by pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
  3. [optional] we recommend to use git commit -s to pass the DCO check

@zTonyZhao

Copy link
Copy Markdown
Contributor Author

/check-cla

@ascend-robot

Copy link
Copy Markdown

CLA Signature Guide

@zTonyZhao , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit Reason
[5ff7cd3 [fix] typo](5ff7cd3) the email used in the commit is not linked to a signed CLA!
please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

1 similar comment
@ascend-robot

Copy link
Copy Markdown

CLA Signature Guide

@zTonyZhao , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit Reason
[5ff7cd3 [fix] typo](5ff7cd3) the email used in the commit is not linked to a signed CLA!
please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

zTonyZhao, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@dpj135

dpj135 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Please use git commit -s or amend signed-off-by to your commits

@0oshowero0 0oshowero0 merged commit cef0dc2 into Ascend:main Jun 22, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants