Bugfix: batch_size_warmup_scheduler was taking too long #205
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
BatchSizeWarmupScheduler was taking too long or was impossible for real world max_batch_size values
When trying to use the training script like the following:
the script was not giving any output for a long long while. So I started to read the code. I saw that the code was using sum(range(x, y)) idiom to summing the values along a range, this was inefficient for large y, especially impossible when y=50B or something.
Changes
Simplify BatchSizeWarmupScheduler Implementation
Summary
This PR simplifies the batch size warmup scheduling logic by replacing the step-based threshold calculation with a more straightforward token-based approach. The new implementation provides a more intuitive and mathematically precise way to handle batch size warmup during training.
Changes
_calculate_step_thresholds()
with_calculate_tokens_per_batch_size()
current_step
→current_token_count
)Technical Details
The new implementation:
(n(a₁ + aₙ))/2
Benefits
Discussions
If any, please include references to the relevant issues/previous PR/discord discussions around these changes.
Tests