Commit e04e1c2
committed
fix: port wrap_data_iterator pattern from PR #4659 to fix DCP test
The Phase 1 merge of #4716 took main's version of training.py per the
skill's "Files to Override from Main" rule, which uses the
HybridCPDataLoaderWrapper class wrapped once outside train_step. That
broke the DCP test (gpt3_mcore_te_tp2_pp1_cp4_dcp) in two ways:
1. RuntimeError: Trying to resize storage that is not resizable -
fixed by c3dbea7 (rename args.hybrid_context_parallel ->
args.dynamic_context_parallel).
2. AssertionError: data iterator is not wrapped with RerunDataIterator -
the outside-train_step wrap converted train_data_iterator from a
RerunDataIterator to a plain iterator, but rerun_state_machine's
should_run_forward_backward asserts the wrap.
PR #4659 resolved this by keeping dev's wrap_data_iterator pattern
instead of main's HybridCPDataLoaderWrapper, calling wrap_data_iterator
INSIDE train_step (after should_run_forward_backward) and inside the
eval loop. That keeps the original RerunDataIterator visible to the
assertion and only swaps in the packed iterator for the
forward_backward_func call.
Port that pattern verbatim from PR #4659's training.py:
- Replace HybridCPDataLoaderWrapper import with wrap_data_iterator
- Remove the outside-train_step wrap (was at line 3000-3001)
- Inside train_step: add the if config.sequence_packing_scheduler is
not None block before forward_backward_func, unpacking (data_iterator,
num_microbatches, seqlen_sum_this_global_batch,
seqlen_squared_sum_this_global_batch); pass num_microbatches=
num_microbatches to forward_backward_func
- Inside eval loop: add the same wrap with try/except StopIteration,
using packed_data_iterator and scheduled_eval_num_microbatches
Note: this leaves HybridCPDataLoaderWrapper and its imports (Any, List,
BalancedCPScheduler) as dead code in megatron/core/datasets/data_schedule.py.
Cleanup of that file (and of the remaining structural diff in
training.py / data_samplers.py / utils.py vs PR #4659's tree) is left
to follow-up.1 parent c3dbea7 commit e04e1c2
1 file changed
Lines changed: 44 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
181 | 181 | | |
182 | 182 | | |
183 | 183 | | |
184 | | - | |
| 184 | + | |
185 | 185 | | |
186 | 186 | | |
187 | 187 | | |
| |||
2030 | 2030 | | |
2031 | 2031 | | |
2032 | 2032 | | |
| 2033 | + | |
| 2034 | + | |
| 2035 | + | |
| 2036 | + | |
| 2037 | + | |
| 2038 | + | |
| 2039 | + | |
| 2040 | + | |
| 2041 | + | |
| 2042 | + | |
| 2043 | + | |
| 2044 | + | |
| 2045 | + | |
| 2046 | + | |
| 2047 | + | |
| 2048 | + | |
| 2049 | + | |
| 2050 | + | |
| 2051 | + | |
| 2052 | + | |
| 2053 | + | |
2033 | 2054 | | |
2034 | 2055 | | |
2035 | 2056 | | |
| |||
2041 | 2062 | | |
2042 | 2063 | | |
2043 | 2064 | | |
2044 | | - | |
| 2065 | + | |
2045 | 2066 | | |
2046 | 2067 | | |
2047 | 2068 | | |
| |||
2997 | 3018 | | |
2998 | 3019 | | |
2999 | 3020 | | |
3000 | | - | |
3001 | | - | |
3002 | | - | |
3003 | 3021 | | |
3004 | 3022 | | |
3005 | 3023 | | |
| |||
3699 | 3717 | | |
3700 | 3718 | | |
3701 | 3719 | | |
| 3720 | + | |
| 3721 | + | |
| 3722 | + | |
| 3723 | + | |
| 3724 | + | |
| 3725 | + | |
| 3726 | + | |
| 3727 | + | |
| 3728 | + | |
| 3729 | + | |
| 3730 | + | |
| 3731 | + | |
| 3732 | + | |
| 3733 | + | |
| 3734 | + | |
| 3735 | + | |
| 3736 | + | |
| 3737 | + | |
| 3738 | + | |
3702 | 3739 | | |
3703 | 3740 | | |
3704 | | - | |
| 3741 | + | |
3705 | 3742 | | |
3706 | | - | |
| 3743 | + | |
3707 | 3744 | | |
3708 | 3745 | | |
3709 | 3746 | | |
| |||
0 commit comments