chore: nightly sync main into dev (10_05_2026)#4716
Merged
Conversation
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: john2 <john2@jrlogin01.jureca>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: root <root@nvl72098-T17.cm.cluster> Co-authored-by: William Dykas <wdykas@oci-hsg-cs-001-vscode-03.cm.cluster> Co-authored-by: root <root@nvl72160-T13.cm.cluster>
…classmethod (#3812) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
#4403) Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>
Co-authored-by: Philip Petrakian <ppetrakian@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Co-authored-by: Siddharth Singh <sidsingh@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nference cuda graph scope for hybrid models (#4440)
…ss curve gaps for latent MoE models (#4433) Signed-off-by: root <jiemingz@nvidia.com>
…4158) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…4422) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rprenger <rprenger@nvidia.com>
Signed-off-by: qiyuw <qiyuw@nvidia.com> Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
dc4ca1f to
bfce39d
Compare
Member
|
/ok to test bfce39d |
Contributor
|
/ok to test fb297f0 |
FDecaYed
approved these changes
May 11, 2026
FDecaYed
left a comment
Contributor
There was a problem hiding this comment.
fixed last test fail. LGTM
…llel The Phase 1 merge took main's versions of these three files (per the nightly-sync skill's "Files to Override from Main" list), but main still references the deprecated arg name `hybrid_context_parallel`. Dev renamed it to `dynamic_context_parallel` in commit cde56a4 ("Fix for rope when enabling THD + Dynamic-CP; use the naming Dynamic-CP"). model_parallel_config.py keeps both attributes as a deprecation shim and copies `hybrid_context_parallel=True -> dynamic_context_parallel=True` in __post_init__, but not the other direction — so when a test passes `--dynamic-context-parallel: true`, args.hybrid_context_parallel is False and any code that gates on it silently takes the wrong branch. This caused gpt3_mcore_te_tp2_pp1_cp4_dcp to fail in CI: data_samplers.py:109 used `if args.hybrid_context_parallel:` to select the pass-through `collate_fn=lambda x: x`. With the test's --dynamic-context-parallel flag, the condition was False, so torch's default-collate ran on tensors from the SFT-mock THD path and hit non-resizable shared-memory storage: RuntimeError: Trying to resize storage that is not resizable in torch/utils/data/_utils/collate.py:274 collate_tensor_fn Renames 8 references across 3 files: - megatron/training/datasets/data_samplers.py: 2 references - megatron/training/utils.py: 5 references (TP-broadcast batch shape) - megatron/training/training.py: 1 reference (HybridCPDataLoaderWrapper iter wrap) All 3 files pass ast.parse; zero remaining `args.hybrid_context_parallel` references in megatron/. CODEOWNERS and the pyproject.toml/uv.lock/Dockerfile.ci.dev triple unchanged from origin/dev.
Member
|
/ok to test c3dbea7 |
The Phase 1 merge of #4716 took main's version of training.py per the skill's "Files to Override from Main" rule, which uses the HybridCPDataLoaderWrapper class wrapped once outside train_step. That broke the DCP test (gpt3_mcore_te_tp2_pp1_cp4_dcp) in two ways: 1. RuntimeError: Trying to resize storage that is not resizable - fixed by c3dbea7 (rename args.hybrid_context_parallel -> args.dynamic_context_parallel). 2. AssertionError: data iterator is not wrapped with RerunDataIterator - the outside-train_step wrap converted train_data_iterator from a RerunDataIterator to a plain iterator, but rerun_state_machine's should_run_forward_backward asserts the wrap. PR #4659 resolved this by keeping dev's wrap_data_iterator pattern instead of main's HybridCPDataLoaderWrapper, calling wrap_data_iterator INSIDE train_step (after should_run_forward_backward) and inside the eval loop. That keeps the original RerunDataIterator visible to the assertion and only swaps in the packed iterator for the forward_backward_func call. Port that pattern verbatim from PR #4659's training.py: - Replace HybridCPDataLoaderWrapper import with wrap_data_iterator - Remove the outside-train_step wrap (was at line 3000-3001) - Inside train_step: add the if config.sequence_packing_scheduler is not None block before forward_backward_func, unpacking (data_iterator, num_microbatches, seqlen_sum_this_global_batch, seqlen_squared_sum_this_global_batch); pass num_microbatches= num_microbatches to forward_backward_func - Inside eval loop: add the same wrap with try/except StopIteration, using packed_data_iterator and scheduled_eval_num_microbatches Note: this leaves HybridCPDataLoaderWrapper and its imports (Any, List, BalancedCPScheduler) as dead code in megatron/core/datasets/data_schedule.py. Cleanup of that file (and of the remaining structural diff in training.py / data_samplers.py / utils.py vs PR #4659's tree) is left to follow-up.
Member
|
/ok to test e04e1c2 |
Author
|
Superseded by today's nightly sync. |
svcnvidia-nemo-ci
added a commit
that referenced
this pull request
May 12, 2026
Merges 8 commits from main into dev. Dev already contains yesterday's sync (PR #4716) plus follow-up fixes, so this PR only carries main commits made after that sync. Notable changes: - 434368c build(deps): bump nvidia-modelopt to 0.43 (#4723) - e42e2fa ci: Major refactor of release-workflows (#4602) - 33d47e0 [ci] fix: treat cancelled run-main-script step as failure (#4727) - 5123f6a ci: revert bad uv.lock bump and label future bumps with Run functional tests (#4730) - ad58411 Add Python-side guardrail for DeepEP IB limits (#4719) - e93755e chore(beep boop): Bump (main) (2026-05-11) - a2ec5c1 Revert Add Python-side guardrail for HybridEP IB limit (#4718) - 5e31514 Create a Protocol for the MLP layer of TransformerLayer (#3435) Kept dev's pyproject.toml, uv.lock, docker/Dockerfile.ci.dev, and .github/CODEOWNERS (per nightly-sync skill). Ran black + isort on changed Python files.
This was referenced May 12, 2026
Phlip79
added a commit
to Phlip79/Megatron-LM
that referenced
this pull request
May 13, 2026
Adds a deterministic gate step (preservation workflow) and a prompt instruction (main sync workflow) to ensure the sync bot never modifies any `golden_values*` file. Reference outputs from successful training runs cannot be regenerated by the bot; PR NVIDIA#4716 modified ~89,652 lines across dozens of `golden_values_*.json` files. - Preservation gate: a pre-audit step fails the workflow if any file whose basename starts with `golden_values` differs from origin/dev. Reports the offending paths inline as annotations and emits the exact `git checkout` one-liner to restore them. - Sync prompt: instructs the agent to run that one-liner after the Phase 1 merge and before every push (Phase 1, Phase 3 fixes, amends).
Phlip79
added a commit
to Phlip79/Megatron-LM
that referenced
this pull request
May 14, 2026
…y schedule
Two changes to the main-to-dev nightly sync workflow:
1. Add a "Dev-feature preservation audit" to the existing pre-push
invariant checks in the nightly-sync skill
(`.claude/skills/nightly-sync/SKILL.md`). The audit is bash that
the sync bot must run before every push (Phase 1 and every Phase 3
fix-push):
for f in $(git diff --name-only "$BASE"...HEAD ...); do
missing=$(comm -23 \
<(git show "origin/dev:$f" | sort -u) \
<(git show "origin/main:$f" | sort -u) \
| comm -23 - <(sort -u "$f") \
| grep -E '[[:alnum:]_]' )
[ -n "$missing" ] && record_violation
done
Lines present on `origin/dev`, absent from `origin/main`, and absent
from the merged tree are the textbook "main2dev silently reverted a
dev-only feature" pattern that bit NVIDIA#4659 and NVIDIA#4716. The audit
exempts the skill's "Files to Override from Main" list (the bot may
legitimately keep main's version there) and skips the dependency
triple plus CODEOWNERS (already checked separately).
Recent regressions the audit would have caught:
- `transformer_layer.py` `_forward_mlp_router(input_ids=None)`
- `token_dispatcher.py` `num_sms_preprocessing_api=...` kwarg
- `moe_layer.py` `self._maybe_record_overload_factor(...)` call
- `gpt_dynamic_inference_with_coordinator.py`
`parse_and_validate_args` import
- `datasets/readme.md` "Packing Scheduler" section
- `data_samplers.py` / `utils.py` / `training.py`
`args.dynamic_context_parallel` references
2. Schedule change in `.github/workflows/nightly-sync-main-to-dev.yml`:
was `cron: '0 21 * * *'` (daily at 21:00 UTC); now
`cron: '0 15 * * 1,4'` (Mondays and Thursdays at 15:00 UTC, i.e.
8 AM PDT / 7 AM PST since GitHub Actions cron is UTC-only).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Nightly sync of
mainintodevfor 10_05_2026.mainintodev.git merge origin/main -X theirswith surgical resolution of squash-merge divergences and per-file overrides documented below.Files taken from
mainwholesaleThese files have known semantic conflicts where
dev's versions reference args/APIs thatmainremoved or renamed. Taken viagit checkout origin/main -- <file>:megatron/training/training.pymegatron/training/initialize.pymegatron/training/utils.pymegatron/training/datasets/data_samplers.pymegatron/core/optimizer/layer_wise_optimizer.pyFiles deleted in dev but restored from main
megatron/core/pipeline_parallel/hybrid_cp_schedule.py— restored becausemain'sHybridCPDataLoaderWrapper(appended todata_schedule.py) importsBalancedCPSchedulerfrom it.Modify/delete conflicts resolved
These files were deleted on
mainand modified ondev; we acceptedmain's deletion (only the deleted files themselves used them, andmegatron.legacy.fp16_deprecatedsurvives independently):megatron/legacy/model/__init__.py(deleted)megatron/legacy/model/transformer.py(deleted)tools/checkpoint/loader_legacy.py(deleted)tools/checkpoint/loader_llama_mistral.py(deleted)This file was deleted on
dev(PR #3576) and modified onmain; we keptdev's deletion per intent:.github/workflows/multi-approval-bot.yml(deleted)Special handling
megatron/core/datasets/data_schedule.py:mainanddevhave completely different classes. Keptdev's file (BasePackingScheduler,DpBalancedScheduler,DefaultDynamicCPScheduler,wrap_data_iterator,get_batch_on_this_rank_for_sequence_packing) and appendedmain'sHybridCPDataLoaderWrapperplus theAny,Listtyping imports andBalancedCPSchedulerimport.Dependency triple preserved from
devpyproject.toml,uv.lock, anddocker/Dockerfile.ci.devwere kept atdev's versions per the merge policy. Audit of[tool.uv.sources]diff:flash_mla,transformer-engine,nemo-run,emerging_optimizers: identical revisions, no action needed.fast-hadamard-transform: dev-only, kept.nvidia-resiliency-ext:devpins15a85156…(Apr 6 2026),mainpinsb2bb3d72…(Apr 16 2026). Keptdev's revision after verifying all submodules imported in the merged tree exist there (get_write_results_queuepresent, allcheckpointing.{async,local}.*andshared_utils.inject_faultpaths resolved).API mismatch audit (post-merge)
Verified caller/implementation alignment in hotspots called out by the sync skill:
multi_latent_attention.py(from main) →FineGrainedActivationOffloadingInterfacector(offload, tensor, name),group_offload,group_commit: ✅ matchmamba_model.py: no offload calls in current tree.init_chunk_handlercallers (hybrid_model.py,gpt_model.py): full 7-keyword-arg form matches definition. ✅gated_delta_net.py:_resolve_cu_seqlensboth called and defined inside the file. ✅distrib_optimizer.py:_is_distopt_quantized_paramand_expand_quantized_param_shard_for_castboth present and referenced consistently. ✅CODEOWNERS
Unchanged from
dev(verifiedgit diff origin/dev -- .github/CODEOWNERSis empty).Formatting
Ran
black==24.10.0thenisort==5.13.2on all 253 changed Python files in the merged tree.Test plan