[Data] Add time to first batch metric for dataset iterators#55758
[Data] Add time to first batch metric for dataset iterators#55758justinvyu merged 12 commits intoray-project:masterfrom
Conversation
Signed-off-by: xgui <xgui@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable new statistic for tracking the blocking time of the first batch, which is a common performance bottleneck. The implementation is sound, but I've identified a couple of areas for improvement. I've suggested a refactoring in iter_batches.py to enhance code clarity and reduce duplication. Additionally, I've pointed out minor typos in the test expectations that should be corrected for consistency.
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
justinvyu
left a comment
There was a problem hiding this comment.
Can you file a ticket for a follow-up PR to add this metric as a dashboard panel?
Signed-off-by: xgui <xgui@anyscale.com>
|
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
I realized that this tracks time to first batch across epochs, so wanted to clarify that this is a cumulative metric. Good to merge after this
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
…ect#55758) The time to first batch usually takes longer time than the subsequent batches. This is because the time to first batch includes the time needed for the pipeline to warm up. The iterator receives the batch once the first few blocks have made it through all stages of the data pipeline and piped to the train worker consumers. Since we do prefetching and the data pipeline is in a steady state, so the time to produce subsequent batches is much lower. In this PR, we added a metric to track the time to first batch. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Lehui Liu <lehui@anyscale.com>
…ect#55758) The time to first batch usually takes longer time than the subsequent batches. This is because the time to first batch includes the time needed for the pipeline to warm up. The iterator receives the batch once the first few blocks have made it through all stages of the data pipeline and piped to the train worker consumers. Since we do prefetching and the data pipeline is in a steady state, so the time to produce subsequent batches is much lower. In this PR, we added a metric to track the time to first batch. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…ect#55758) The time to first batch usually takes longer time than the subsequent batches. This is because the time to first batch includes the time needed for the pipeline to warm up. The iterator receives the batch once the first few blocks have made it through all stages of the data pipeline and piped to the train worker consumers. Since we do prefetching and the data pipeline is in a steady state, so the time to produce subsequent batches is much lower. In this PR, we added a metric to track the time to first batch. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…ect#55758) The time to first batch usually takes longer time than the subsequent batches. This is because the time to first batch includes the time needed for the pipeline to warm up. The iterator receives the batch once the first few blocks have made it through all stages of the data pipeline and piped to the train worker consumers. Since we do prefetching and the data pipeline is in a steady state, so the time to produce subsequent batches is much lower. In this PR, we added a metric to track the time to first batch. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
The time to first batch usually takes longer time than the subsequent batches. This is because the time to first batch includes the time needed for the pipeline to warm up. The iterator receives the batch once the first few blocks have made it through all stages of the data pipeline and piped to the train worker consumers. Since we do prefetching and the data pipeline is in a steady state, so the time to produce subsequent batches is much lower. In this PR, we added a metric to track the time to first batch. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…ect#55758) The time to first batch usually takes longer time than the subsequent batches. This is because the time to first batch includes the time needed for the pipeline to warm up. The iterator receives the batch once the first few blocks have made it through all stages of the data pipeline and piped to the train worker consumers. Since we do prefetching and the data pipeline is in a steady state, so the time to produce subsequent batches is much lower. In this PR, we added a metric to track the time to first batch. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Why are these changes needed?
The time to first batch usually takes longer time than the subsequent batches. This is because the time to first batch includes the time needed for the pipeline to warm up. The iterator receives the batch once the first few blocks have made it through all stages of the data pipeline and piped to the train worker consumers.
Since we do prefetching and the data pipeline is in a steady state, so the time to produce subsequent batches is much lower.
In this PR, we added a metric to track the time to first batch.
Example:
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.