[recipe] Add recipe demo to use StreamingDataset & StreamingDataLoader by Jixixi2020 · Pull Request #93 · Ascend/TransferQueue

Jixixi2020 · 2026-05-08T03:31:27Z

Summary
Adds a new demo to illustrate how to use StreamingDataset and StreamingDataLoader in a simple data-centric, asynchronous RL-style workflow.

The demo shows a decentralized worker-per-stage pipeline where each stage independently consumes the fields it needs from the queue and writes its outputs back for downstream stages. It is designed as a readable example of how streaming data access can be used to connect multiple RL pipeline stages without tightly coupling execution to a centralized stage-by-stage scheduler.

Changes

Demonstrated StreamingDataset and StreamingDataLoader usage
Implemented a simple asynchronous RL-style pipeline with separate stages and Kept the workflow data-centric:
- workers read only the required fields for their stage
- workers write derived fields back to the same partition metadata flow
- downstream stages naturally continue from available data
Kept the demo intentionally lightweight and educational:
- simple synthetic tensor generation
- simplified RL-like field transformations
- explicit progress logging for step-level visibility

ascend-robot · 2026-05-08T03:31:37Z

CLA Signature Guide

@Jixixi2020 , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit	Reason
[`67153a1` [recipe] Add recipe demo to use...](`67153a1`)	the email used in the commit is not linked to a signed CLA! please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

0oshowero0 · 2026-05-08T03:37:52Z

@NINGBENZHE Please help to review this recipe~

Copilot

Pull request overview

Adds a new recipe-style demo (streaming_dataloader_demo.py) showing how to connect multiple asynchronous RL-like pipeline stages via StreamingDataset + StreamingDataLoader, where each stage reads only the fields it needs and writes derived fields back into the same partition.

Changes:

Introduces a Ray-based, decentralized “worker-per-stage” pipeline demo (rollout/ref/actor/reward/update).
Demonstrates field-level streaming reads with StreamingDataset(data_fields=...) and writes back via tq_client.put(..., metadata=batch_meta).
Adds a small driver loop that inserts prompts, waits for stage completion, simulates weight sync, and clears partitions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,428 @@
+import argparse


+    num_steps: int
+    pipeline_depth: int
+    global_batch_size: int
+    micro_batch_size: int
+    prompt_length: int
+    response_length: int


+    weight_sync_seconds: float
+    empty_poll_log_interval: int
+    num_data_storage_units: int


+    def _run_step(self, step: int) -> None:
+        partition_id = f"{self.cfg_demo.partition_prefix}_{step}"
+        dataloader = self._build_dataloader(partition_id)
+
+        for batch, batch_meta in dataloader:
+            sample_ids = batch["sample_id"].view(-1).tolist()
+            logger.info(f"[{self.worker_name}] step={step} consumed sample_ids={sample_ids}")
+
+            output, written_fields = self.compute(batch, batch_meta)
+            self.tq_client.put(output, metadata=batch_meta)
+
+            count = ray.get(self.tracker.record.remote(self.stage_name, step, len(sample_ids)))
+            logger.info(
+                f"[{self.worker_name}] step={step} done -> written_fields={written_fields}, "
+                f"{self.stage_name}_count={count}/{self.cfg_demo.global_batch_size}"
+            )
+
+        ray.get(self.tracker.record_done.remote(self.stage_name, step))
+        logger.info(f"[{self.worker_name}] step={step} worker_done recorded")
+
+    def _build_dataloader(self, partition_id: str) -> StreamingDataLoader:
+        dataset = StreamingDataset(
+            config=self.cfg,
+            batch_size=self.cfg_demo.micro_batch_size,
+            micro_batch_size=self.cfg_demo.micro_batch_size,
+            data_fields=self.input_fields(),
+            partition_id=partition_id,
+            task_name=f"{self.cfg_demo.task_name_prefix}_{self.stage_name}",
+            dp_rank=self.worker_id,
+            should_check_consumption_status=True,
+        )
+        return StreamingDataLoader(dataset=dataset, num_workers=0, prefetch_factor=None)
+


+        ray.get(refs)
+        logger.info("demo done!")
+        return []


+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s.%(msecs)03d - %(levelname)s - %(message)s", datefmt="%H:%M:%S")
+logger = logging.getLogger(__name__)


…oader_demo.py

ascend-robot · 2026-05-08T07:14:10Z

CLA Signature Guide

@Jixixi2020 , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit	Reason
[`67153a1` [recipe] Add recipe demo to use...](`67153a1`)	the email used in the commit is not linked to a signed CLA! please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

NINGBENZHE · 2026-05-08T07:40:20Z

+
+        for step in range(self.config.num_steps):
+            self._put_prompt(step)
+            self._wait_complete(step)


这里的数据流转是没有太大问题的，不过这里demo写的算是个on policy的场景，没有体现off policy的逻辑，可以考虑丰富一下

0oshowero0 · 2026-05-08T11:02:01Z

+
+    ray.init()
+    try:
+        demo = DecentralizedInheritedWorkerPipelineDemo(cfg, build_tq_config(cfg))


It's a little too long

0oshowero0 · 2026-05-08T11:07:32Z

+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+


Suggest to reference Relax here

And we can call it relax_demo.py directly?

0oshowero0 · 2026-05-08T11:11:16Z

+
+    def fit(self) -> list[dict]:
+        logger.info("=" * 72)
+        logger.info("TransferQueue StreamingDataLoader Decentralized Inherited Worker Pipeline Demo")


remember to modify this when changing the file name

0oshowero0 · 2026-05-08T11:11:52Z

+
+
+if __name__ == "__main__":
+    main()


Please reference to recipe-check.yml to add this file to workflow

Jixixi2020 added 2 commits May 7, 2026 20:19

[recipe] Add recipe demo to use StreamingDataset & StreamingDataLoader

67153a1

[chore] correct filename for streamingDataloader demo

700a31d

ascend-robot added the ascend-cla/no label May 8, 2026

0oshowero0 requested a review from Copilot May 8, 2026 03:38

Copilot started reviewing on behalf of 0oshowero0 May 8, 2026 03:39 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Jixixi2020 added 2 commits May 8, 2026 15:07

[chore] Add license header for recipe\simple_use_case\streaming_datal…

66aec8d

…oader_demo.py

[refactor] remove unused parameter pipeline_depth

68a0416

NINGBENZHE reviewed May 8, 2026

View reviewed changes

0oshowero0 reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[recipe] Add recipe demo to use StreamingDataset & StreamingDataLoader#93

[recipe] Add recipe demo to use StreamingDataset & StreamingDataLoader#93
Jixixi2020 wants to merge 4 commits intoAscend:mainfrom
Jixixi2020:streaming_dataloader_demo

Jixixi2020 commented May 8, 2026

Uh oh!

ascend-robot commented May 8, 2026

Uh oh!

0oshowero0 commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

ascend-robot commented May 8, 2026

Uh oh!

NINGBENZHE May 8, 2026

Uh oh!

0oshowero0 May 8, 2026

Uh oh!

0oshowero0 May 8, 2026

Uh oh!

0oshowero0 May 8, 2026

Uh oh!

0oshowero0 May 8, 2026

Uh oh!

0oshowero0 May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		logging.basicConfig(level=logging.INFO, format="%(asctime)s.%(msecs)03d - %(levelname)s - %(message)s", datefmt="%H:%M:%S")
		logger = logging.getLogger(__name__)

Conversation

Jixixi2020 commented May 8, 2026

Uh oh!

ascend-robot commented May 8, 2026

CLA Signature Guide

Uh oh!

0oshowero0 commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

ascend-robot commented May 8, 2026

CLA Signature Guide

Uh oh!

NINGBENZHE May 8, 2026

Choose a reason for hiding this comment

Uh oh!

0oshowero0 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

0oshowero0 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

0oshowero0 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

0oshowero0 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

0oshowero0 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants