chore: nightly sync main into dev (06_05_2026) by svcnvidia-nemo-ci · Pull Request #4659 · NVIDIA/Megatron-LM

svcnvidia-nemo-ci · 2026-05-06T21:33:36Z

Summary

Nightly sync of main into dev.

102 commits merged from main
Python lines: +33575 / -8262 across 253 files
Merge strategy: git merge origin/main -X theirs --no-edit, with manual reconciliation for known conflicts.

Files taken from main

megatron/core/optimizer/layer_wise_optimizer.py (no-op; identical between main and dev currently)

Files kept on dev (overriding the skill's default of taking main's version)

The skill recommends taking main's version of these files for known semantic conflicts. In this sync the situation is reversed — dev's versions are the more current ones. Main's versions reference args.hybrid_context_parallel, but dev renamed that flag to args.dynamic_context_parallel (commit cde56a4 "Fix for rope when enabling THD + Dynamic-CP; use the naming Dynamic-CP"). Taking main's versions would cascade into AttributeError at runtime.

megatron/training/training.py
megatron/training/utils.py
megatron/training/initialize.py
megatron/training/datasets/data_samplers.py

Files deleted in main, accepted as deletion

These were legacy GPT loaders removed in main #4322 ("remove legacy GPT code"). Nothing in the merged tree references them.

tools/checkpoint/loader_legacy.py
tools/checkpoint/loader_llama_mistral.py

Files deleted in dev, NOT restored

megatron/core/pipeline_parallel/hybrid_cp_schedule.py was intentionally removed in dev (commit cde56a4) as part of the dynamic-CP refactor. Not restored, since the merged tree uses dev's wrap_data_iterator mechanism — no caller imports BalancedCPScheduler or HybridCPDataLoaderWrapper.

Dependency triple kept on dev

Per the skill's hard rule: pyproject.toml, uv.lock, docker/Dockerfile.ci.dev were restored from origin/dev. Dev's nvidia-resiliency-ext pinned revision (15a8515) was verified to contain all APIs the merged tree imports (get_write_results_queue, CheckpointMetadataCache, CachedMetadataFileSystemReader, etc.). No git-source reconciliation required.

API mismatch detection

After taking main's version of files (then later reverting), audited:

multi_latent_attention.py calls off_interface.group_offload() and off_interface.group_commit() — both exist on dev's FineGrainedActivationOffloadingInterface
gpt_model.py and hybrid_model.py call init_chunk_handler(6 kwargs) — matches dev's signature
_resolve_cu_seqlens exists on dev's GatedDeltaNet
_is_distopt_quantized_param exists on dev's DistributedOptimizer
CudaGraphScope exists in dev's enums.py

No active mismatches remain.

Linting

black --config pyproject.toml (24.10.0): no diff
isort (5.13.2): no diff
pylint on changed megatron/core/ files (84 files): 10.00/10

Remerge diff

Remerge diff stat (file-level summary)

Date:   Wed May 6 21:32:39 2026 +0000

    chore: nightly sync main into dev (06_05_2026)

 .github/workflows/cicd-main.yml                    |    5 -
 docker/Dockerfile.ci.dev                           |    4 -
 docs/conf.py                                       |   18 +-
 .../detxoify_lm/generate_samples_gpt.py            |   76 +-
 .../gpt/gpt_dynamic_inference_with_coordinator.py  |    6 +-
 examples/mimo/train.py                             |    6 +-
 examples/multimodal/layer_specs.py                 |    2 +-
 examples/multimodal/model.py                       |   85 +-
 examples/post_training/modelopt/convert_model.py   |   19 +-
 examples/post_training/modelopt/export.py          |    5 +-
 examples/post_training/modelopt/finetune.py        |   67 +-
 examples/post_training/modelopt/generate.py        |   27 +-
 examples/post_training/modelopt/mmlu.py            |   45 +-
 .../modelopt/offline_feature_extract.py            |   56 +-
 examples/post_training/modelopt/prune.py           |   13 +-
 examples/post_training/modelopt/quantize.py        |   55 +-
 examples/post_training/modelopt/validate.py        |   32 +-
 gpt_builders.py                                    |   77 +-
 hybrid_builders.py                                 |    4 +-
 megatron/core/datasets/readme.md                   |   64 --
 megatron/core/transformer/mlp.py                   |    4 -
 megatron/core/transformer/moe/fused_a2a.py         |   13 -
 megatron/core/transformer/moe/moe_layer.py         |    8 -
 megatron/core/transformer/moe/token_dispatcher.py  |    4 -
 megatron/core/transformer/transformer_config.py    |   27 -
 megatron/core/transformer/transformer_layer.py     |   13 -
 megatron/elastification/arguments.py               |    6 +-
 megatron/elastification/flextron_utils.py          |   11 +-
 megatron/elastification/pretrain_hybrid_flex.py    |  136 ++-
 .../elastification/router/hybrid_flex_router.py    |    7 +-
 megatron/legacy/model/__init__.py                  |    5 -
 megatron/post_training/arguments.py                |    7 +-
 megatron/post_training/model_builder.py            |   55 +-
 megatron/training/activation_logging.py            |   37 +-
 megatron/training/argument_utils.py                |   90 +-
 megatron/training/arguments.py                     |  589 +----------
 megatron/training/async_utils.py                   |    4 +-
 megatron/training/checkpointing.py                 |   33 +-
 megatron/training/config/__init__.py               |   27 +-
 megatron/training/config/container.py              |   40 +-
 megatron/training/config/instantiate_utils.py      |   46 +-
 megatron/training/config/training_config.py        |   24 +-
 megatron/training/config/utils.py                  |   13 +-
 megatron/training/config/yaml_utils.py             |   10 +-
 megatron/training/datasets/data_samplers.py        |   51 +-
 megatron/training/training.py                      |  261 +----
 megatron/training/utils.py                         |    9 -
 model_provider.py                                  |   12 +-
 pretrain_bert.py                                   |   32 +-
 pretrain_gpt.py                                    |   42 +-
 pretrain_hybrid.py                                 |   65 +-
 pretrain_mamba.py                                  |  363 -------
 pretrain_t5.py                                     |    2 +-
 pretrain_vlm.py                                    |   10 +-
 pyproject.toml                                     |   19 +-
 .../unit_tests/fusions/test_mla_yarn_rope_apply.py |   10 -
 tests/unit_tests/models/test_hybrid_moe_model.py   |   16 -
 tools/checkpoint/checkpoint_inspector.py           |    9 +-
 tools/checkpoint/convert.py                        |   62 +-
 tools/checkpoint/dist_checkpoint_io.py             |   45 +-
 tools/checkpoint/gpt_hybrid_conversion.py          |  171 +--
 tools/checkpoint/loader_legacy.py                  |  416 --------
 tools/checkpoint/loader_llama_mistral.py           |  751 -------------
 tools/checkpoint/loader_mixtral_hf.py              |   12 +-
 tools/checkpoint/remap_gpt_dsa_to_mamba.py         |    5 -
 tools/prepare_cache.py                             |    9 +-
 tools/preprocess_data.py                           |  217 ++--
 tools/preprocess_mmdata.py                         |  160 ++-
 train_rl.py                                        |   20 +-
 uv.lock                                            | 1114 +++-----------------
 70 files changed, 1258 insertions(+), 4500 deletions(-)

Full diff omitted to keep the PR body compact (~10k lines). Reviewers can run git show --remerge-diff 431ac5df05104bc1d5015f5ac1842285d1c5e6ee locally or browse the merge commit on GitHub.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

Co-authored-by: john2 <john2@jrlogin01.jureca>

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Co-authored-by: root <root@nvl72098-T17.cm.cluster> Co-authored-by: William Dykas <wdykas@oci-hsg-cs-001-vscode-03.cm.cluster> Co-authored-by: root <root@nvl72160-T13.cm.cluster>

…classmethod (#3812) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

#4403) Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

Co-authored-by: Philip Petrakian <ppetrakian@nvidia.com>

Signed-off-by: Maanu Grover <maanug@nvidia.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

Co-authored-by: Siddharth Singh <sidsingh@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nference cuda graph scope for hybrid models (#4440)

…ss curve gaps for latent MoE models (#4433) Signed-off-by: root <jiemingz@nvidia.com>

…4158) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…4422) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: rprenger <rprenger@nvidia.com>

Signed-off-by: qiyuw <qiyuw@nvidia.com> Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

svcnvidia-nemo-ci · 2026-05-07T03:05:10Z

/ok to test 46ee761

svcnvidia-nemo-ci · 2026-05-07T21:18:46Z

Superseded by today's nightly sync.

# Conflicts: # megatron/core/distributed/param_and_grad_buffer.py

Phlip79 · 2026-05-08T05:41:32Z

/ok to test 676f3fa

…but in dev Signed-off-by: Deyu Fu <deyuf@nvidia.com>

FDecaYed · 2026-05-08T08:09:03Z

/ok to test 0cb4ec3

Phlip79 · 2026-05-08T16:33:42Z

/ok to test 2207908

svcnvidia-nemo-ci · 2026-05-08T21:19:34Z

Superseded by today's nightly sync.

svcnvidia-nemo-ci · 2026-05-09T21:13:16Z

Superseded by today's nightly sync.

Brings forward six regressions previously fixed manually on PR #4659 (commits 676f3fa, d019432, 2207908) that the new sync of main into dev re-introduced, plus one further dropped import that surfaced as a runtime NameError on gpt_dynamic_inference CI: - transformer_layer.py: add input_ids=None to _forward_mlp_router so the body's input_ids reference (line 2103) resolves; callers already pass input_ids from _forward_mlp. - param_and_grad_buffer.py: replace inline fp8_params else-branch in finish_param_sync with self._post_param_sync() to match the three other call sites and dev's refactor (PR #4563). - moe_layer.py: call _maybe_record_overload_factor() on dispatched_input before the routed-experts conditional, as on dev. - token_dispatcher.py: pass num_sms_preprocessing_api=self.config.moe_hybridep_num_sms_preprocessing to _HybridEPManager constructor. - datasets/readme.md: restore the dev-only "Packing Scheduler" documentation section before main's "Offline cache preparation". - gpt_dynamic_inference_with_coordinator.py: restore `from megatron.training.arguments import parse_and_validate_args` import; the call at line 216 raised NameError on the gpt_dynamic_inference_tp2_pp2_dp2_583m_logitsmatch_zmq CI job. Verified: ast.parse OK on all 5 changed .py files; ruff F821 reports only pre-existing TYPE_CHECKING-guarded names; CODEOWNERS and the pyproject.toml/uv.lock/Dockerfile.ci.dev triple unchanged from origin/dev.

The Phase 1 merge of #4716 took main's version of training.py per the skill's "Files to Override from Main" rule, which uses the HybridCPDataLoaderWrapper class wrapped once outside train_step. That broke the DCP test (gpt3_mcore_te_tp2_pp1_cp4_dcp) in two ways: 1. RuntimeError: Trying to resize storage that is not resizable - fixed by c3dbea7 (rename args.hybrid_context_parallel -> args.dynamic_context_parallel). 2. AssertionError: data iterator is not wrapped with RerunDataIterator - the outside-train_step wrap converted train_data_iterator from a RerunDataIterator to a plain iterator, but rerun_state_machine's should_run_forward_backward asserts the wrap. PR #4659 resolved this by keeping dev's wrap_data_iterator pattern instead of main's HybridCPDataLoaderWrapper, calling wrap_data_iterator INSIDE train_step (after should_run_forward_backward) and inside the eval loop. That keeps the original RerunDataIterator visible to the assertion and only swaps in the packed iterator for the forward_backward_func call. Port that pattern verbatim from PR #4659's training.py: - Replace HybridCPDataLoaderWrapper import with wrap_data_iterator - Remove the outside-train_step wrap (was at line 3000-3001) - Inside train_step: add the if config.sequence_packing_scheduler is not None block before forward_backward_func, unpacking (data_iterator, num_microbatches, seqlen_sum_this_global_batch, seqlen_squared_sum_this_global_batch); pass num_microbatches= num_microbatches to forward_backward_func - Inside eval loop: add the same wrap with try/except StopIteration, using packed_data_iterator and scheduled_eval_num_microbatches Note: this leaves HybridCPDataLoaderWrapper and its imports (Any, List, BalancedCPScheduler) as dead code in megatron/core/datasets/data_schedule.py. Cleanup of that file (and of the remaining structural diff in training.py / data_samplers.py / utils.py vs PR #4659's tree) is left to follow-up.

…y schedule Two changes to the main-to-dev nightly sync workflow: 1. Add a "Dev-feature preservation audit" to the existing pre-push invariant checks in the nightly-sync skill (`.claude/skills/nightly-sync/SKILL.md`). The audit is bash that the sync bot must run before every push (Phase 1 and every Phase 3 fix-push): for f in $(git diff --name-only "$BASE"...HEAD ...); do missing=$(comm -23 \ <(git show "origin/dev:$f" | sort -u) \ <(git show "origin/main:$f" | sort -u) \ | comm -23 - <(sort -u "$f") \ | grep -E '[[:alnum:]_]' ) [ -n "$missing" ] && record_violation done Lines present on `origin/dev`, absent from `origin/main`, and absent from the merged tree are the textbook "main2dev silently reverted a dev-only feature" pattern that bit NVIDIA#4659 and NVIDIA#4716. The audit exempts the skill's "Files to Override from Main" list (the bot may legitimately keep main's version there) and skips the dependency triple plus CODEOWNERS (already checked separately). Recent regressions the audit would have caught: - `transformer_layer.py` `_forward_mlp_router(input_ids=None)` - `token_dispatcher.py` `num_sms_preprocessing_api=...` kwarg - `moe_layer.py` `self._maybe_record_overload_factor(...)` call - `gpt_dynamic_inference_with_coordinator.py` `parse_and_validate_args` import - `datasets/readme.md` "Packing Scheduler" section - `data_samplers.py` / `utils.py` / `training.py` `args.dynamic_context_parallel` references 2. Schedule change in `.github/workflows/nightly-sync-main-to-dev.yml`: was `cron: '0 21 * * *'` (daily at 21:00 UTC); now `cron: '0 15 * * 1,4'` (Mondays and Thursdays at 15:00 UTC, i.e. 8 AM PDT / 7 AM PST since GitHub Actions cron is UTC-only).

minitu and others added 30 commits April 22, 2026 18:02

Fix nvtx_decorator to check _nvtx_enabled at call time (#4184)

9a3c927

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

fix merges_file typo in megatron_hf_tokenizer (#4392)

60f71e1

Co-authored-by: john2 <john2@jrlogin01.jureca>

Enable NullTokenizer for pretraining to reduce I/O access (#4057)

c9dfe34

docs: Add SECURITY.md (#4431)

7073492

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Mamba inference opt (#4414)

40627d0

Co-authored-by: root <root@nvl72098-T17.cm.cluster> Co-authored-by: William Dykas <wdykas@oci-hsg-cs-001-vscode-03.cm.cluster> Co-authored-by: root <root@nvl72160-T13.cm.cluster>

DDP refactoring: Extract parameter layout computation into optimizer …

55b8111

…classmethod (#3812) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Update PR template with explicit request for issue (#4409)

90e09b6

Misc inference fixes (#4397)

ab2b33d

Rename Mamba to Hybrid outside megatron/core (#4159)

60408d5

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Include mtp layers in token per expert logging (#4412)

a52014c

fix: NVRx async compatibility and defer resiliency import (#4420)

32275b2

Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

ci: add base_sha to codecov/codecov-action upload step (#4445)

9bb35a8

Signed-off-by: oliver könig <okoenig@nvidia.com>

Update copy-pr-bot.yaml [skip ci]

3034d86

fix(checkpoint_inspector): allow empty --param-to-param-group-map-json (

f78ed05

#4403) Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

Add the YARN support for hybrid_model (#4244)

4d6cdd5

Co-authored-by: Philip Petrakian <ppetrakian@nvidia.com>

[training migration] Add container class for config dataclasses (#4227)

41ffa83

Signed-off-by: Maanu Grover <maanug@nvidia.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

Inference: Fix broken functional tests on gitlab (#4454)

a1165fa

SafeUnpickler class for safe pickle usage (#4319)

d4cacef

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

get rid of weights_only=False (#4434)

109feda

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

Inference | Per-block MoE routing storage for prefix caching (#4301)

64870c1

Co-authored-by: Siddharth Singh <sidsingh@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add troubleshooting tip for 'access forbidden' (#4449)

017e684

Fix checkpoint loading with rerun state machine (#4448)

3d7bcd3

Add misc CUDA graph sugar to CudaGraphManager (#4425)

9b02206

Inference: Add the embedding and output layer in the full_iteration_i…

35f76df

…nference cuda graph scope for hybrid models (#4440)

Important bugfixes in local CG implementation that were leading to lo…

481efd0

…ss curve gaps for latent MoE models (#4433) Signed-off-by: root <jiemingz@nvidia.com>

fix: Replace polynomial rolling hash with SHA-256 for prefix caching (#…

e9abb6c

…4158) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(ckpt): expose validate_access_integrity knob on dist-ckpt load (#…

377af02

…4422) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix multivalidation (#3388)

241a5ca

Signed-off-by: rprenger <rprenger@nvidia.com>

Add missing knob for reduce_scatter_with_fp32_accumulation (#4410)

f2dcd42

Signed-off-by: qiyuw <qiyuw@nvidia.com> Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

Enable CUDA graphs for MTP inference (#4260)

03f4111

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

copy-pr-bot Bot temporarily deployed to test May 7, 2026 03:07 Inactive

Phlip79 marked this pull request as ready for review May 7, 2026 07:02

Phlip79 requested review from a team as code owners May 7, 2026 07:02

svcnvidia-nemo-ci added the complexity: high label May 7, 2026

svcnvidia-nemo-ci closed this May 7, 2026

Phlip79 reopened this May 8, 2026

Merge remote-tracking branch 'origin/dev' into main2dev/06_05_2026

676f3fa

# Conflicts: # megatron/core/distributed/param_and_grad_buffer.py

copy-pr-bot Bot temporarily deployed to test May 8, 2026 05:42 Inactive

FDecaYed added 2 commits May 8, 2026 16:04

restore some missing changes post merge due to PR not merged to main …

d019432

…but in dev Signed-off-by: Deyu Fu <deyuf@nvidia.com>

Merge branch 'dev' into main2dev/06_05_2026

0cb4ec3

fix: correct misplaced colon in moe_layer.py inference guard

2207908

copy-pr-bot Bot temporarily deployed to test May 8, 2026 16:35 Inactive

svcnvidia-nemo-ci closed this May 8, 2026

Phlip79 reopened this May 8, 2026

svcnvidia-nemo-ci closed this May 9, 2026

svcnvidia-nemo-ci mentioned this pull request May 12, 2026

chore: nightly sync main into dev (12_05_2026) #4744

Closed

Phlip79 mentioned this pull request May 19, 2026

Add dev-feature preservation gate and change schedule #4773

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: nightly sync main into dev (06_05_2026)#4659

chore: nightly sync main into dev (06_05_2026)#4659
svcnvidia-nemo-ci wants to merge 108 commits into
devfrom
main2dev/06_05_2026

svcnvidia-nemo-ci commented May 6, 2026

Uh oh!

svcnvidia-nemo-ci commented May 7, 2026

Uh oh!

svcnvidia-nemo-ci commented May 7, 2026

Uh oh!

Phlip79 commented May 8, 2026

Uh oh!

FDecaYed commented May 8, 2026

Uh oh!

Phlip79 commented May 8, 2026

Uh oh!

svcnvidia-nemo-ci commented May 8, 2026

Uh oh!

svcnvidia-nemo-ci commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

svcnvidia-nemo-ci commented May 6, 2026

Summary

Files taken from main

Files kept on dev (overriding the skill's default of taking main's version)

Files deleted in main, accepted as deletion

Files deleted in dev, NOT restored

Dependency triple kept on dev

API mismatch detection

Linting

Remerge diff

Uh oh!

svcnvidia-nemo-ci commented May 7, 2026

Uh oh!

svcnvidia-nemo-ci commented May 7, 2026

Uh oh!

Phlip79 commented May 8, 2026

Uh oh!

FDecaYed commented May 8, 2026

Uh oh!

Phlip79 commented May 8, 2026

Uh oh!

svcnvidia-nemo-ci commented May 8, 2026

Uh oh!

svcnvidia-nemo-ci commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants