[Data] Add time to first batch metric for dataset iterators#55758

Merged

justinvyu merged 12 commits intoray-project:masterfrom

xinyuangui2:xgui/add-first-batch-stats

Aug 25, 2025

Contributor

xinyuangui2 commented Aug 19, 2025 •

edited by justinvyu

Loading

Why are these changes needed?

The time to first batch usually takes longer time than the subsequent batches. This is because the time to first batch includes the time needed for the pipeline to warm up. The iterator receives the batch once the first few blocks have made it through all stages of the data pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

Example:

Operator 0 FromItems: 1 tasks executed, 200 blocks produced in 0.28s
* Remote wall time: 930.56us min, 62.8ms max, 1.31ms mean, 261.67ms total
* Remote cpu time: 1.14ms min, 64.22ms max, 1.59ms mean, 318.9ms total
* UDF time: 0us min, 0us max, 0.0us mean, 0us total
* Peak heap memory usage (MiB): 0.0 min, 0.0 max, 0 mean
* Output num rows per block: 10 min, 10 max, 10 mean, 2000 total
* Output size bytes per block: 240 min, 240 max, 240 mean, 48000 total
* Output rows per task: 2000 min, 2000 max, 2000 mean, 1 tasks used
* Tasks per node: 1 min, 1 max, 1 mean; 1 nodes used
* Operator throughput:
        * Ray Data throughput: 7199.477407213313 rows/s
        * Estimated single node throughput: 7643.302102326079 rows/s

Dataset iterator time breakdown:
    * Total time in Ray Data iterator initialization code: 74.52ms
    * Total time user thread is blocked by Ray Data iter_batches: 167.44ms
    * Total time spent waiting for the first batch after starting iteration: 10.92ms
    * Total execution time for user thread: 424.62ms
* Batch iteration time breakdown (summed across prefetch threads):
    * In ray.get(): 537.44us min, 8.98ms max, 984.63us avg, 158.53ms total
    * In batch creation: 16.77us min, 513.99us max, 53.56us avg, 85.97ms total
    * In batch formatting: 66.41us min, 1.28ms max, 264.5us avg, 424.26ms total

Dataset throughput:
        * Ray Data throughput: 7199.477407213313 rows/s
        * Estimated single node throughput: 7643.302102326079 rows/s

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(


          add stats for the first block time

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from a team as a code owner

August 19, 2025 23:26

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request introduces a valuable new statistic for tracking the blocking time of the first batch, which is a common performance bottleneck. The implementation is sound, but I've identified a couple of areas for improvement. I've suggested a refactoring in iter_batches.py to enhance code clarity and reduce duplication. Additionally, I've pointed out minor typos in the test expectations that should be corrected for consistency.

python/ray/data/_internal/block_batching/iter_batches.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_stats.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_stats.py Outdated Show resolved Hide resolved

xinyuangui2 and others added 2 commits

August 20, 2025 00:08


          fix unittest

8b52f74

Signed-off-by: xgui <xgui@anyscale.com>


          Update python/ray/data/_internal/block_batching/iter_batches.py

3ab8f15

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

ray-gardener bot added data observability labels

xinyuangui2 requested a review from justinvyu

August 20, 2025 16:54

goutamvenkat-anyscale approved these changes

View reviewed changes

justinvyu approved these changes

View reviewed changes

Contributor

justinvyu left a comment

Can you file a ticket for a follow-up PR to add this metric as a dashboard panel?

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

python/ray/data/_internal/block_batching/iter_batches.py Outdated Show resolved Hide resolved

xinyuangui2 and others added 2 commits

August 21, 2025 14:53


          Merge branch 'master' into xgui/add-first-batch-stats

74cc4c2


          resolve comments

772a88d

Signed-off-by: xgui <xgui@anyscale.com>

Contributor Author

xinyuangui2 commented Aug 21, 2025

Can you file a ticket for a follow-up PR to add this metric as a dashboard panel?

Added: https://anyscale1.atlassian.net/browse/TRAIN-626?atlOrigin=eyJpIjoiOTFjNzVmYzZiMGIyNGRiZGFjMGY1NGMwMWJmNjQ3NTkiLCJwIjoiamlyYS1zbGFjay1pbnQifQ

justinvyu reviewed

View reviewed changes

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

xinyuangui2 and others added 2 commits

August 22, 2025 15:25


          Apply suggestion from @justinvyu

2263be0

Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>


          update log strings

702dab3

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from justinvyu

August 22, 2025 22:30

justinvyu approved these changes

View reviewed changes

Contributor

justinvyu left a comment

I realized that this tracks time to first batch across epochs, so wanted to clarify that this is a cumulative metric. Good to merge after this

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

xinyuangui2 and others added 3 commits

August 23, 2025 18:07


          Merge branch 'master' into xgui/add-first-batch-stats


          fix comments

399ffb0

Signed-off-by: xgui <xgui@anyscale.com>


          fix property names

e55102a

Signed-off-by: xgui <xgui@anyscale.com>

justinvyu approved these changes

View reviewed changes

python/ray/data/_internal/stats.py Outdated Show resolved Hide resolved

xinyuangui2 and others added 2 commits

August 25, 2025 11:59


          Merge branch 'master' into xgui/add-first-batch-stats

b329422


          Update python/ray/data/_internal/stats.py

9bbdfdc

Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

justinvyu enabled auto-merge (squash)

August 25, 2025 21:33

justinvyu disabled auto-merge

August 25, 2025 21:33

github-actions bot added the go label

justinvyu changed the title ~~[Data] Add stats for the first batch blocking time~~ [Data] Add time to first batch metric for dataset iterators

justinvyu enabled auto-merge (squash)

August 25, 2025 21:34

justinvyu merged commit db9b20d into ray-project:master

8 checks passed

liulehui pushed a commit to liulehui/ray that referenced this pull request


          [Data] Add time to first batch metric for dataset iterators (ray-proj…

d1bfe58

…ect#55758)

The time to first batch usually takes longer time than the subsequent
batches. This is because the time to first batch includes the time
needed for the pipeline to warm up. The iterator receives the batch once
the first few blocks have made it through all stages of the data
pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so
the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>

tohtana pushed a commit to tohtana/ray that referenced this pull request


          [Data] Add time to first batch metric for dataset iterators (ray-proj…

3401b03

…ect#55758)

The time to first batch usually takes longer time than the subsequent
batches. This is because the time to first batch includes the time
needed for the pipeline to warm up. The iterator receives the batch once
the first few blocks have made it through all stages of the data
pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so
the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana pushed a commit to tohtana/ray that referenced this pull request


          [Data] Add time to first batch metric for dataset iterators (ray-proj…

2dd934e

…ect#55758)

The time to first batch usually takes longer time than the subsequent
batches. This is because the time to first batch includes the time
needed for the pipeline to warm up. The iterator receives the batch once
the first few blocks have made it through all stages of the data
pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so
the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request


          [Data] Add time to first batch metric for dataset iterators (ray-proj…

c846f49

…ect#55758)

The time to first batch usually takes longer time than the subsequent
batches. This is because the time to first batch includes the time
needed for the pipeline to warm up. The iterator receives the batch once
the first few blocks have made it through all stages of the data
pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so
the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

dstrodtman pushed a commit that referenced this pull request


          [Data] Add time to first batch metric for dataset iterators (#55758)

e6842f8

The time to first batch usually takes longer time than the subsequent
batches. This is because the time to first batch includes the time
needed for the pipeline to warm up. The iterator receives the batch once
the first few blocks have made it through all stages of the data
pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so
the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request


          [Data] Add time to first batch metric for dataset iterators (ray-proj…

45b31be

…ect#55758)

The time to first batch usually takes longer time than the subsequent
batches. This is because the time to first batch includes the time
needed for the pipeline to warm up. The iterator receives the batch once
the first few blocks have made it through all stages of the data
pipeline and piped to the train worker consumers.

Since we do prefetching and the data pipeline is in a steady state, so
the time to produce subsequent batches is much lower.

In this PR, we added a metric to track the time to first batch.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data go observability