chore: nightly sync main into dev (10_05_2026) by svcnvidia-nemo-ci · Pull Request #4716 · NVIDIA/Megatron-LM

svcnvidia-nemo-ci · 2026-05-10T21:21:43Z

Summary

Nightly sync of main into dev for 10_05_2026.

Merged 120 commits from main into dev.
Python lines: +36212 / -10810 across 269 files
Merge strategy: git merge origin/main -X theirs with surgical resolution of squash-merge divergences and per-file overrides documented below.

Files taken from `main` wholesale

These files have known semantic conflicts where dev's versions reference args/APIs that main removed or renamed. Taken via git checkout origin/main -- <file>:

megatron/training/training.py
megatron/training/initialize.py
megatron/training/utils.py
megatron/training/datasets/data_samplers.py
megatron/core/optimizer/layer_wise_optimizer.py

Files deleted in dev but restored from main

megatron/core/pipeline_parallel/hybrid_cp_schedule.py — restored because main's HybridCPDataLoaderWrapper (appended to data_schedule.py) imports BalancedCPScheduler from it.

Modify/delete conflicts resolved

These files were deleted on main and modified on dev; we accepted main's deletion (only the deleted files themselves used them, and megatron.legacy.fp16_deprecated survives independently):

megatron/legacy/model/__init__.py (deleted)
megatron/legacy/model/transformer.py (deleted)
tools/checkpoint/loader_legacy.py (deleted)
tools/checkpoint/loader_llama_mistral.py (deleted)

This file was deleted on dev (PR #3576) and modified on main; we kept dev's deletion per intent:

.github/workflows/multi-approval-bot.yml (deleted)

Special handling

megatron/core/datasets/data_schedule.py: main and dev have completely different classes. Kept dev's file (BasePackingScheduler, DpBalancedScheduler, DefaultDynamicCPScheduler, wrap_data_iterator, get_batch_on_this_rank_for_sequence_packing) and appended main's HybridCPDataLoaderWrapper plus the Any, List typing imports and BalancedCPScheduler import.

Dependency triple preserved from `dev`

pyproject.toml, uv.lock, and docker/Dockerfile.ci.dev were kept at dev's versions per the merge policy. Audit of [tool.uv.sources] diff:

flash_mla, transformer-engine, nemo-run, emerging_optimizers: identical revisions, no action needed.
fast-hadamard-transform: dev-only, kept.
nvidia-resiliency-ext: dev pins 15a85156… (Apr 6 2026), main pins b2bb3d72… (Apr 16 2026). Kept dev's revision after verifying all submodules imported in the merged tree exist there (get_write_results_queue present, all checkpointing.{async,local}.* and shared_utils.inject_fault paths resolved).

API mismatch audit (post-merge)

Verified caller/implementation alignment in hotspots called out by the sync skill:

multi_latent_attention.py (from main) → FineGrainedActivationOffloadingInterface ctor (offload, tensor, name), group_offload, group_commit: ✅ match
mamba_model.py: no offload calls in current tree.
init_chunk_handler callers (hybrid_model.py, gpt_model.py): full 7-keyword-arg form matches definition. ✅
gated_delta_net.py: _resolve_cu_seqlens both called and defined inside the file. ✅
distrib_optimizer.py: _is_distopt_quantized_param and _expand_quantized_param_shard_for_cast both present and referenced consistently. ✅

CODEOWNERS

Unchanged from dev (verified git diff origin/dev -- .github/CODEOWNERS is empty).

Formatting

Ran black==24.10.0 then isort==5.13.2 on all 253 changed Python files in the merged tree.

Test plan

Unit tests pass
Integration tests pass (or are skipped per maintainer policy)
Internal GitLab functional tests pass

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

Co-authored-by: john2 <john2@jrlogin01.jureca>

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Co-authored-by: root <root@nvl72098-T17.cm.cluster> Co-authored-by: William Dykas <wdykas@oci-hsg-cs-001-vscode-03.cm.cluster> Co-authored-by: root <root@nvl72160-T13.cm.cluster>

…classmethod (#3812) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

#4403) Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

Co-authored-by: Philip Petrakian <ppetrakian@nvidia.com>

Signed-off-by: Maanu Grover <maanug@nvidia.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

Co-authored-by: Siddharth Singh <sidsingh@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nference cuda graph scope for hybrid models (#4440)

…ss curve gaps for latent MoE models (#4433) Signed-off-by: root <jiemingz@nvidia.com>

…4158) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…4422) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: rprenger <rprenger@nvidia.com>

Signed-off-by: qiyuw <qiyuw@nvidia.com> Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Phlip79 · 2026-05-11T05:26:11Z

/ok to test bfce39d

Signed-off-by: Deyu Fu <deyuf@nvidia.com>

FDecaYed · 2026-05-11T10:28:10Z

/ok to test fb297f0

FDecaYed

fixed last test fail. LGTM

…llel The Phase 1 merge took main's versions of these three files (per the nightly-sync skill's "Files to Override from Main" list), but main still references the deprecated arg name `hybrid_context_parallel`. Dev renamed it to `dynamic_context_parallel` in commit cde56a4 ("Fix for rope when enabling THD + Dynamic-CP; use the naming Dynamic-CP"). model_parallel_config.py keeps both attributes as a deprecation shim and copies `hybrid_context_parallel=True -> dynamic_context_parallel=True` in __post_init__, but not the other direction — so when a test passes `--dynamic-context-parallel: true`, args.hybrid_context_parallel is False and any code that gates on it silently takes the wrong branch. This caused gpt3_mcore_te_tp2_pp1_cp4_dcp to fail in CI: data_samplers.py:109 used `if args.hybrid_context_parallel:` to select the pass-through `collate_fn=lambda x: x`. With the test's --dynamic-context-parallel flag, the condition was False, so torch's default-collate ran on tensors from the SFT-mock THD path and hit non-resizable shared-memory storage: RuntimeError: Trying to resize storage that is not resizable in torch/utils/data/_utils/collate.py:274 collate_tensor_fn Renames 8 references across 3 files: - megatron/training/datasets/data_samplers.py: 2 references - megatron/training/utils.py: 5 references (TP-broadcast batch shape) - megatron/training/training.py: 1 reference (HybridCPDataLoaderWrapper iter wrap) All 3 files pass ast.parse; zero remaining `args.hybrid_context_parallel` references in megatron/. CODEOWNERS and the pyproject.toml/uv.lock/Dockerfile.ci.dev triple unchanged from origin/dev.

Phlip79 · 2026-05-11T15:43:51Z

/ok to test c3dbea7

The Phase 1 merge of #4716 took main's version of training.py per the skill's "Files to Override from Main" rule, which uses the HybridCPDataLoaderWrapper class wrapped once outside train_step. That broke the DCP test (gpt3_mcore_te_tp2_pp1_cp4_dcp) in two ways: 1. RuntimeError: Trying to resize storage that is not resizable - fixed by c3dbea7 (rename args.hybrid_context_parallel -> args.dynamic_context_parallel). 2. AssertionError: data iterator is not wrapped with RerunDataIterator - the outside-train_step wrap converted train_data_iterator from a RerunDataIterator to a plain iterator, but rerun_state_machine's should_run_forward_backward asserts the wrap. PR #4659 resolved this by keeping dev's wrap_data_iterator pattern instead of main's HybridCPDataLoaderWrapper, calling wrap_data_iterator INSIDE train_step (after should_run_forward_backward) and inside the eval loop. That keeps the original RerunDataIterator visible to the assertion and only swaps in the packed iterator for the forward_backward_func call. Port that pattern verbatim from PR #4659's training.py: - Replace HybridCPDataLoaderWrapper import with wrap_data_iterator - Remove the outside-train_step wrap (was at line 3000-3001) - Inside train_step: add the if config.sequence_packing_scheduler is not None block before forward_backward_func, unpacking (data_iterator, num_microbatches, seqlen_sum_this_global_batch, seqlen_squared_sum_this_global_batch); pass num_microbatches= num_microbatches to forward_backward_func - Inside eval loop: add the same wrap with try/except StopIteration, using packed_data_iterator and scheduled_eval_num_microbatches Note: this leaves HybridCPDataLoaderWrapper and its imports (Any, List, BalancedCPScheduler) as dead code in megatron/core/datasets/data_schedule.py. Cleanup of that file (and of the remaining structural diff in training.py / data_samplers.py / utils.py vs PR #4659's tree) is left to follow-up.

Phlip79 · 2026-05-11T20:17:38Z

/ok to test e04e1c2

svcnvidia-nemo-ci · 2026-05-11T21:26:45Z

Superseded by today's nightly sync.

Merges 8 commits from main into dev. Dev already contains yesterday's sync (PR #4716) plus follow-up fixes, so this PR only carries main commits made after that sync. Notable changes: - 434368c build(deps): bump nvidia-modelopt to 0.43 (#4723) - e42e2fa ci: Major refactor of release-workflows (#4602) - 33d47e0 [ci] fix: treat cancelled run-main-script step as failure (#4727) - 5123f6a ci: revert bad uv.lock bump and label future bumps with Run functional tests (#4730) - ad58411 Add Python-side guardrail for DeepEP IB limits (#4719) - e93755e chore(beep boop): Bump (main) (2026-05-11) - a2ec5c1 Revert Add Python-side guardrail for HybridEP IB limit (#4718) - 5e31514 Create a Protocol for the MLP layer of TransformerLayer (#3435) Kept dev's pyproject.toml, uv.lock, docker/Dockerfile.ci.dev, and .github/CODEOWNERS (per nightly-sync skill). Ran black + isort on changed Python files.

Adds a deterministic gate step (preservation workflow) and a prompt instruction (main sync workflow) to ensure the sync bot never modifies any `golden_values*` file. Reference outputs from successful training runs cannot be regenerated by the bot; PR NVIDIA#4716 modified ~89,652 lines across dozens of `golden_values_*.json` files. - Preservation gate: a pre-audit step fails the workflow if any file whose basename starts with `golden_values` differs from origin/dev. Reports the offending paths inline as annotations and emits the exact `git checkout` one-liner to restore them. - Sync prompt: instructs the agent to run that one-liner after the Phase 1 merge and before every push (Phase 1, Phase 3 fixes, amends).

…y schedule Two changes to the main-to-dev nightly sync workflow: 1. Add a "Dev-feature preservation audit" to the existing pre-push invariant checks in the nightly-sync skill (`.claude/skills/nightly-sync/SKILL.md`). The audit is bash that the sync bot must run before every push (Phase 1 and every Phase 3 fix-push): for f in $(git diff --name-only "$BASE"...HEAD ...); do missing=$(comm -23 \ <(git show "origin/dev:$f" | sort -u) \ <(git show "origin/main:$f" | sort -u) \ | comm -23 - <(sort -u "$f") \ | grep -E '[[:alnum:]_]' ) [ -n "$missing" ] && record_violation done Lines present on `origin/dev`, absent from `origin/main`, and absent from the merged tree are the textbook "main2dev silently reverted a dev-only feature" pattern that bit NVIDIA#4659 and NVIDIA#4716. The audit exempts the skill's "Files to Override from Main" list (the bot may legitimately keep main's version there) and skips the dependency triple plus CODEOWNERS (already checked separately). Recent regressions the audit would have caught: - `transformer_layer.py` `_forward_mlp_router(input_ids=None)` - `token_dispatcher.py` `num_sms_preprocessing_api=...` kwarg - `moe_layer.py` `self._maybe_record_overload_factor(...)` call - `gpt_dynamic_inference_with_coordinator.py` `parse_and_validate_args` import - `datasets/readme.md` "Packing Scheduler" section - `data_samplers.py` / `utils.py` / `training.py` `args.dynamic_context_parallel` references 2. Schedule change in `.github/workflows/nightly-sync-main-to-dev.yml`: was `cron: '0 21 * * *'` (daily at 21:00 UTC); now `cron: '0 15 * * 1,4'` (Mondays and Thursdays at 15:00 UTC, i.e. 8 AM PDT / 7 AM PST since GitHub Actions cron is UTC-only).

minitu and others added 30 commits April 22, 2026 18:02

Fix nvtx_decorator to check _nvtx_enabled at call time (#4184)

9a3c927

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

fix merges_file typo in megatron_hf_tokenizer (#4392)

60f71e1

Co-authored-by: john2 <john2@jrlogin01.jureca>

Enable NullTokenizer for pretraining to reduce I/O access (#4057)

c9dfe34

docs: Add SECURITY.md (#4431)

7073492

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Mamba inference opt (#4414)

40627d0

Co-authored-by: root <root@nvl72098-T17.cm.cluster> Co-authored-by: William Dykas <wdykas@oci-hsg-cs-001-vscode-03.cm.cluster> Co-authored-by: root <root@nvl72160-T13.cm.cluster>

DDP refactoring: Extract parameter layout computation into optimizer …

55b8111

…classmethod (#3812) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Update PR template with explicit request for issue (#4409)

90e09b6

Misc inference fixes (#4397)

ab2b33d

Rename Mamba to Hybrid outside megatron/core (#4159)

60408d5

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Include mtp layers in token per expert logging (#4412)

a52014c

fix: NVRx async compatibility and defer resiliency import (#4420)

32275b2

Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

ci: add base_sha to codecov/codecov-action upload step (#4445)

9bb35a8

Signed-off-by: oliver könig <okoenig@nvidia.com>

Update copy-pr-bot.yaml [skip ci]

3034d86

fix(checkpoint_inspector): allow empty --param-to-param-group-map-json (

f78ed05

#4403) Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

Add the YARN support for hybrid_model (#4244)

4d6cdd5

Co-authored-by: Philip Petrakian <ppetrakian@nvidia.com>

[training migration] Add container class for config dataclasses (#4227)

41ffa83

Signed-off-by: Maanu Grover <maanug@nvidia.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

Inference: Fix broken functional tests on gitlab (#4454)

a1165fa

SafeUnpickler class for safe pickle usage (#4319)

d4cacef

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

get rid of weights_only=False (#4434)

109feda

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

Inference | Per-block MoE routing storage for prefix caching (#4301)

64870c1

Co-authored-by: Siddharth Singh <sidsingh@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add troubleshooting tip for 'access forbidden' (#4449)

017e684

Fix checkpoint loading with rerun state machine (#4448)

3d7bcd3

Add misc CUDA graph sugar to CudaGraphManager (#4425)

9b02206

Inference: Add the embedding and output layer in the full_iteration_i…

35f76df

…nference cuda graph scope for hybrid models (#4440)

Important bugfixes in local CG implementation that were leading to lo…

481efd0

…ss curve gaps for latent MoE models (#4433) Signed-off-by: root <jiemingz@nvidia.com>

fix: Replace polynomial rolling hash with SHA-256 for prefix caching (#…

e9abb6c

…4158) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ckpt): expose validate_access_integrity knob on dist-ckpt load (#…

377af02

…4422) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix multivalidation (#3388)

241a5ca

Signed-off-by: rprenger <rprenger@nvidia.com>

Add missing knob for reduce_scatter_with_fp32_accumulation (#4410)

f2dcd42

Signed-off-by: qiyuw <qiyuw@nvidia.com> Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

Enable CUDA graphs for MTP inference (#4260)

03f4111

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Phlip79 force-pushed the main2dev/10_05_2026 branch from dc4ca1f to bfce39d Compare May 11, 2026 05:25

copy-pr-bot Bot temporarily deployed to test May 11, 2026 05:27 Inactive

Phlip79 marked this pull request as ready for review May 11, 2026 05:27

Phlip79 requested review from a team as code owners May 11, 2026 05:27

svcnvidia-nemo-ci added the complexity: high label May 11, 2026

fix

fb297f0

Signed-off-by: Deyu Fu <deyuf@nvidia.com>

FDecaYed approved these changes May 11, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to test May 11, 2026 10:29 Inactive

copy-pr-bot Bot temporarily deployed to test May 11, 2026 15:45 Inactive

copy-pr-bot Bot temporarily deployed to test May 11, 2026 20:19 Inactive

svcnvidia-nemo-ci closed this May 11, 2026

Phlip79 reopened this May 11, 2026

balasaajay merged commit d338cc5 into dev May 11, 2026
178 of 180 checks passed

balasaajay deleted the main2dev/10_05_2026 branch May 11, 2026 23:57

This was referenced May 12, 2026

chore: nightly sync main into dev (11_05_2026) #4739

Closed

chore: nightly sync main into dev (12_05_2026) #4744

Closed

Victarry mentioned this pull request May 14, 2026

[Dev] Fix full CUDA graph capture reverted by pull main #4792

Merged

Phlip79 mentioned this pull request May 19, 2026

Add dev-feature preservation gate and change schedule #4773

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: nightly sync main into dev (10_05_2026)#4716

chore: nightly sync main into dev (10_05_2026)#4716
balasaajay merged 126 commits into
devfrom
main2dev/10_05_2026

svcnvidia-nemo-ci commented May 10, 2026

Uh oh!

Phlip79 commented May 11, 2026

Uh oh!

FDecaYed commented May 11, 2026

Uh oh!

FDecaYed left a comment

Uh oh!

Phlip79 commented May 11, 2026

Uh oh!

Phlip79 commented May 11, 2026

Uh oh!

svcnvidia-nemo-ci commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

svcnvidia-nemo-ci commented May 10, 2026

Summary

Files taken from main wholesale

Files deleted in dev but restored from main

Modify/delete conflicts resolved

Special handling

Dependency triple preserved from dev

API mismatch audit (post-merge)

CODEOWNERS

Formatting

Test plan

Uh oh!

Phlip79 commented May 11, 2026

Uh oh!

FDecaYed commented May 11, 2026

Uh oh!

FDecaYed left a comment

Choose a reason for hiding this comment

Uh oh!

Phlip79 commented May 11, 2026

Uh oh!

Phlip79 commented May 11, 2026

Uh oh!

svcnvidia-nemo-ci commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Files taken from `main` wholesale

Dependency triple preserved from `dev`