Skip to content

[Data] Add time to first batch metric for dataset iterators#55758

Merged
justinvyu merged 12 commits intoray-project:masterfrom
xinyuangui2:xgui/add-first-batch-stats
Aug 25, 2025
Merged

[Data] Add time to first batch metric for dataset iterators#55758
justinvyu merged 12 commits intoray-project:masterfrom
xinyuangui2:xgui/add-first-batch-stats

Conversation

@xinyuangui2
Copy link
Contributor

@xinyuangui2 xinyuangui2 commented Aug 19, 2025

Why are these changes needed?

The time to first batch usually takes longer time than the subsequent batches. This is because the time to first batch includes the time needed for the pipeline to warm up. The iterator receives the batch once the first few blocks have made it through all stages of the data pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

Example:

Operator 0 FromItems: 1 tasks executed, 200 blocks produced in 0.28s
* Remote wall time: 930.56us min, 62.8ms max, 1.31ms mean, 261.67ms total
* Remote cpu time: 1.14ms min, 64.22ms max, 1.59ms mean, 318.9ms total
* UDF time: 0us min, 0us max, 0.0us mean, 0us total
* Peak heap memory usage (MiB): 0.0 min, 0.0 max, 0 mean
* Output num rows per block: 10 min, 10 max, 10 mean, 2000 total
* Output size bytes per block: 240 min, 240 max, 240 mean, 48000 total
* Output rows per task: 2000 min, 2000 max, 2000 mean, 1 tasks used
* Tasks per node: 1 min, 1 max, 1 mean; 1 nodes used
* Operator throughput:
        * Ray Data throughput: 7199.477407213313 rows/s
        * Estimated single node throughput: 7643.302102326079 rows/s

Dataset iterator time breakdown:
    * Total time in Ray Data iterator initialization code: 74.52ms
    * Total time user thread is blocked by Ray Data iter_batches: 167.44ms
    * Total time spent waiting for the first batch after starting iteration: 10.92ms
    * Total execution time for user thread: 424.62ms
* Batch iteration time breakdown (summed across prefetch threads):
    * In ray.get(): 537.44us min, 8.98ms max, 984.63us avg, 158.53ms total
    * In batch creation: 16.77us min, 513.99us max, 53.56us avg, 85.97ms total
    * In batch formatting: 66.41us min, 1.28ms max, 264.5us avg, 424.26ms total

Dataset throughput:
        * Ray Data throughput: 7199.477407213313 rows/s
        * Estimated single node throughput: 7643.302102326079 rows/s

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from a team as a code owner August 19, 2025 23:26
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable new statistic for tracking the blocking time of the first batch, which is a common performance bottleneck. The implementation is sound, but I've identified a couple of areas for improvement. I've suggested a refactoring in iter_batches.py to enhance code clarity and reduce duplication. Additionally, I've pointed out minor typos in the test expectations that should be corrected for consistency.

xinyuangui2 and others added 2 commits August 20, 2025 00:08
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
@ray-gardener ray-gardener bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Aug 20, 2025
@xinyuangui2 xinyuangui2 requested a review from justinvyu August 20, 2025 16:54
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you file a ticket for a follow-up PR to add this metric as a dashboard panel?

@xinyuangui2
Copy link
Contributor Author

Can you file a ticket for a follow-up PR to add this metric as a dashboard panel?

Added: https://anyscale1.atlassian.net/browse/TRAIN-626?atlOrigin=eyJpIjoiOTFjNzVmYzZiMGIyNGRiZGFjMGY1NGMwMWJmNjQ3NTkiLCJwIjoiamlyYS1zbGFjay1pbnQifQ

xinyuangui2 and others added 2 commits August 22, 2025 15:25
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from justinvyu August 22, 2025 22:30
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that this tracks time to first batch across epochs, so wanted to clarify that this is a cumulative metric. Good to merge after this

xinyuangui2 and others added 3 commits August 23, 2025 18:07
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
xinyuangui2 and others added 2 commits August 25, 2025 11:59
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
@justinvyu justinvyu enabled auto-merge (squash) August 25, 2025 21:33
@justinvyu justinvyu disabled auto-merge August 25, 2025 21:33
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Aug 25, 2025
@justinvyu justinvyu changed the title [Data] Add stats for the first batch blocking time [Data] Add time to first batch metric for dataset iterators Aug 25, 2025
@justinvyu justinvyu enabled auto-merge (squash) August 25, 2025 21:34
@justinvyu justinvyu merged commit db9b20d into ray-project:master Aug 25, 2025
8 checks passed
liulehui pushed a commit to liulehui/ray that referenced this pull request Aug 26, 2025
…ect#55758)

The time to first batch usually takes longer time than the subsequent
batches. This is because the time to first batch includes the time
needed for the pipeline to warm up. The iterator receives the batch once
the first few blocks have made it through all stages of the data
pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so
the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
…ect#55758)

The time to first batch usually takes longer time than the subsequent
batches. This is because the time to first batch includes the time
needed for the pipeline to warm up. The iterator receives the batch once
the first few blocks have made it through all stages of the data
pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so
the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
…ect#55758)

The time to first batch usually takes longer time than the subsequent
batches. This is because the time to first batch includes the time
needed for the pipeline to warm up. The iterator receives the batch once
the first few blocks have made it through all stages of the data
pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so
the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…ect#55758)

The time to first batch usually takes longer time than the subsequent
batches. This is because the time to first batch includes the time
needed for the pipeline to warm up. The iterator receives the batch once
the first few blocks have made it through all stages of the data
pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so
the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
The time to first batch usually takes longer time than the subsequent
batches. This is because the time to first batch includes the time
needed for the pipeline to warm up. The iterator receives the batch once
the first few blocks have made it through all stages of the data
pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so
the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ect#55758)

The time to first batch usually takes longer time than the subsequent
batches. This is because the time to first batch includes the time
needed for the pipeline to warm up. The iterator receives the batch once
the first few blocks have made it through all stages of the data
pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so
the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants