chore: nightly sync main into dev (27_05_2026) by svcnvidia-nemo-ci · Pull Request #5029 · NVIDIA/Megatron-LM

svcnvidia-nemo-ci · 2026-05-27T22:24:13Z

Summary

Nightly sync of main into dev. Brings in 119 commits from main since the last sync.

Python-only line stats: +20029 / -3729 across 244 files.

Files where main's version was taken (in addition to skill list)

The dev-feature audit's strict line-by-line comparison flagged a number of files where main intentionally rewrote the code that dev's older versions had. Each was verified by tracing back to a specific main-only commit:

Files	Main commit(s)
`examples/post_training/modelopt/.py`, `.md`, `.sh`, `megatron/post_training/.py`	`14aaa7e0e` Modernize post-training modelopt example scripts (#4807)
`megatron/core/resharding/**`, `tests/unit_tests/resharding/test_planner.py`	`20bf831da` refit clean up and refactoring (#4762)
`megatron/core/transformer/moe/experts.py` (delay_wgrad_compute / `_make_fused_ops` refactor), `megatron/core/transformer/moe/moe_layer.py` (`InferenceMode.is_active()`), `megatron/core/transformer/transformer_layer.py` (`as_mlp_submodule` / `MlpBuilder`)	TE op-fuser / mlp-builder refactors on main
`megatron/core/extensions/transformer_engine.py` (`TEFusedMLP.as_mlp_submodule`, `_normalize_grouped_parameter_keys` class method, `TEFusedMLPWithGroupedLinear`), `megatron/core/models/gpt/gpt_layer_specs.py` (rename `dense_grouped_gemm` → `use_grouped_gemm_for_dense_mlp` and switch to `as_mlp_submodule`), `megatron/core/transformer/transformer_config.py` rename, `tests/unit_tests/transformer/test_te_fused_mlp_with_grouped_linear_spec.py` (new, replaces deleted `test_te_fused_dense_mlp_spec.py`)	`fa7a23bad` Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (#4318) (#4786)
`megatron/core/transformer/transformer_block.py` (moved `_checkpointed_forward` to `megatron/core/recompute.py:checkpointed_forward`), addition of `megatron/core/recompute.py`	CUDA-graph / recompute refactor on main
`megatron/core/optimizer/optimizer.py` (extracted deferred MXFP8 param-sync into helper methods), `megatron/core/optimizer/distrib_optimizer.py` (`start_param_sync_for_bucket_group_subset`, FP8/FP4 path split)	LayerWise distributed optimizer / MXFP8 refactor on main
`megatron/training/argument_utils.py` (`hybrid_config_from_args`), `megatron/training/training.py`, `megatron/training/datasets/data_samplers.py`, `megatron/training/initialize.py`, `megatron/training/utils/common_utils.py`, `megatron/core/optimizer/layer_wise_optimizer.py`	Skill-list overrides (existing convention)
`tests/test_utils/recipes/h100/gpt.yaml` (new `gpt3_mcore_te_tp2_pp1_gdn_no_nvrx_*` cases alongside dev's `mhc` case), `tests/test_utils/recipes/moe2.0.yaml` (kept dev's `pretrain_gpt.py` script path)	Test-recipe refactors on main

Dev-only additions preserved across the merge

HyperConnectionTransformerLayer + mHC config block (enable_hyper_connections, num_residual_streams, mhc_*, use_fused_mhc, mhc_recompute_layer_num) in transformer_config.py / transformer_layer.py / gpt_layer_specs.py / experimental_attention_variant_module_specs.py.
MoEOverloadFactorTracker, record_dispatch_token_counts, and RecordDispatchTokenCountsFunction in megatron/core/transformer/moe/moe_logging.py and moe_utils.py.
self._maybe_record_overload_factor(...) invocation in moe_layer.py (paired with main's new InferenceMode.is_active() inference-dispatcher selection).
LinearCrossEntropyModule import + usage in gpt_model.py.
mhc_multistream, mhc_enabled, e_proj / h_proj paths in multi_token_prediction.py (combined with main's padding_mask plumbing through _get_embeddings).
cp_group swap in MTPBlock.forward driven by packed_seq_params.cp_group.
input_ids plumbing through MoE forward stack (router.py, moe_layer.py, transformer_layer.py, and the recompute path via a new input_ids kwarg added to megatron/core/recompute.py:checkpointed_forward).
THD sequence-packing padding in HybridEPTokenDispatcher.setup_metadata (_original_num_tokens, _padded_num_tokens, group-wide max via all-reduce).
Full 6-param init_chunk_handler(pp_rank, vp_size, vp_stage, min_offloaded_tensor_size, delta_offload_bytes_across_pp_ranks, activation_offload_fraction, ...) signature (and matching callers in gpt_model.py / hybrid_model.py).
CheckpointManager mHC recompute plan inside HybridStack.forward (_build_mhc_recompute_layer_plan, _finalize_mhc_recompute_layer).
delay_offload_until_cuda_graph, delta_offload_bytes_across_pp_ranks, activation_offload_fraction config fields.
paged_stash_init_chunk_handler import in gpt_model.py.
--dynamic-context-parallel / --min-dynamic-context-parallel-size plumbing across arguments.py, initialize.py, data_samplers.py, plus rename of args.hybrid_context_parallel → args.dynamic_context_parallel in training.py.
Dev's hybrid-attention _resolve_cu_seqlens test extended with main's cp_size divisibility check.

Files restored / handled specially

tests/functional_tests/test_cases/gpt/gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/ removed (deleted on main; dev had local modifications to its golden values).
tests/test_utils/recipes/h100/t5.yaml removed (deleted on dev; main had modifications).
tests/unit_tests/transformer/test_te_fused_dense_mlp_spec.py removed and replaced by main's test_te_fused_mlp_with_grouped_linear_spec.py (renamed class).
megatron/post_training/generate.py reverted to main's modernized version (replaces dev's simple_generate with simple_speculative_generate from main).
Kept dev's pyproject.toml, uv.lock, docker/Dockerfile.ci.dev, and .github/CODEOWNERS verbatim per skill convention.

API mismatches resolved

attention.py / multi_latent_attention.py: switched to main's apply_module(self.linear_proj) + off_interface.group_commit(...) pattern; added a static group_commit method to the merged-tree's FineGrainedActivationOffloadingInterface so both dev's instance group_offload and main's static group_commit callers work.
gated_delta_net.py: took main's _resolve_cu_seqlens(..., cp_size=...) signature including the divisibility check, since callers now pass cp_size=self.cp_size.
transformer_layer.py: switched to main's as_mlp_submodule / submodules.mlp(...) construction pattern; kept dev's **moe_kwargs (including input_ids) plumbing through _forward_mlp / _forward_mlp_router.
gpt_layer_specs.py get_mlp_module_spec_for_backend: combined dev's hyper-connection / MTP plumbing with main's rename + not_none(TEFusedMLPWithGroupedLinear) dispatch.
recompute.py:checkpointed_forward: added input_ids kwarg (and popped it for non-TransformerLayer layers) so the new free function still supports dev's hash-based MoE routing through the full-recompute path.

How this diff was produced

Conflict resolution followed the nightly-sync skill: started from origin/dev, ran git merge origin/main --no-edit, resolved ~67 conflicting paths surgically (preserving dev-only features unless I could identify a specific main commit removing them), ran black --config pyproject.toml + isort over the 244 changed Python files, kept dev's uv.lock / pyproject.toml / docker/Dockerfile.ci.dev / .github/CODEOWNERS verbatim, and audited the result with the pre-push hook before pushing.

CI status / external failures

Latest full CI for 32229e7353b6e74ed80bd939dba902d3f9f9f91d:

Megatron-LM CI run https://github.com/NVIDIA/Megatron-LM/actions/runs/26861405973 completed with Nemo_CICD_Test passing. Install, docs, wheels, lint, unit, and functional/integration jobs passed, including the previously failing DCP, dynamic-inference, FSDP DTensor, DSv4, and scoped-cudagraph cases.
codecov/patch is red at 70.81% vs the 80.00% patch target. Coverage contexts are exempt in the nightly-sync gate, and Coverage (unit-test) passed.
cicd-mbridge-testing failed in downstream Megatron-Bridge run https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/runs/26861493475. The failure is during MBridge unit-test collection, before Megatron-LM code executes, with pydantic_core.ValidationError: RunnablePassthrough.name Field required from langchain_core. MBridge main at b236c13319656d7c518ee57a7b87c671e1d069ea resolves pydantic 2.14.0a1 / pydantic-core 2.47.0.
This MBridge check is also red on recent merged dev PRs while the Megatron-LM gate passes, for example [Dev] Skip identity alltoall chunk sort #5102 (https://github.com/NVIDIA/Megatron-LM/actions/runs/26810627892), chore: Update Docker image version to 26.04-py3 on dev #5051 (https://github.com/NVIDIA/Megatron-LM/actions/runs/26641021478), and [dev] [fix] [DeepSeek-v4] fix dense loss and rope type in DSv4 Hybrid Attention #5018 (https://github.com/NVIDIA/Megatron-LM/actions/runs/26575319597).

🤖 Generated with Claude Code

Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

…ename seq_len (#4094)" (#4718) Signed-off-by: oliver könig <okoenig@nvidia.com>

…al tests` (#4730) Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: Shivanjan Chakravorty <shivanjanc@nvidia.com> Co-authored-by: Cory Ye <44509866+cspades@users.noreply.github.com>

Signed-off-by: Maanu Grover <maanug@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: Shivanjan Chakravorty <shivanjanc@nvidia.com>

Signed-off-by: Ajay Balasa <abalasa@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: Maanu Grover <maanug@nvidia.com>

Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Tuomas Rintamaki <trintamaki@nvidia.com> Co-authored-by: Tyler Poon <tylerpoon@gmail.com> Co-authored-by: Collin McCarthy <cmccarthy@nvidia.com> Co-authored-by: Matthieu Le <matthieul@nvidia.com> Co-authored-by: Piotr Zelasko <pzelasko@nvidia.com> Co-authored-by: Ehsan Hosseini Asl <ehosseiniasl@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-authored-by: Siddharth Singh <sidsingh@nvidia.com>

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

…parallel_size * expert_tensor_parallel_size (#4678) Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>

#4509) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Maanu Grover <maanug@nvidia.com>

#4492) Signed-off-by: Xiaowei Ren <xren@nvidia.com>

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-authored-by: William Dykas <wdykas@oci-hsg-cs-001-vscode-03.cm.cluster>

…parallelism correctly (#4687)

Phlip79 · 2026-06-02T21:01:41Z

/ok to test ce959b5

Phlip79 · 2026-06-02T23:57:03Z

/ok to test d6aa731

Phlip79 · 2026-06-02T23:59:38Z

/ok to test 16fda8d

Phlip79 · 2026-06-03T01:31:52Z

/ok to test 68c9447

Phlip79 · 2026-06-03T02:07:04Z

/ok to test d1a0bca

Phlip79 · 2026-06-03T03:12:35Z

/ok to test 32229e7

Phlip79 · 2026-06-03T04:50:00Z

CI update for latest SHA 32229e7353b6e74ed80bd939dba902d3f9f9f91d:

Nemo_CICD_Test passed on run https://github.com/NVIDIA/Megatron-LM/actions/runs/26861405973.
Install, docs, wheels, linting, unit tests, and MCore functional tests passed.
The previously failing DCP/dynamic-inference/FSDP/DSv4 jobs are now green.
cicd-mbridge-testing failed, but the failure is downstream in Megatron-Bridge dependency resolution: MBridge is resolving pydantic 2.14.0a1/pydantic-core 2.47.0, and unit-test collection fails with pydantic_core.ValidationError: RunnablePassthrough.name Field required from langchain_core.
This is not a Megatron-LM code traceback. Recent merged dev PRs also have cicd-mbridge-testing red while Nemo_CICD_Test passes, e.g. PR [Dev] Skip identity alltoall chunk sort #5102 run https://github.com/NVIDIA/Megatron-LM/actions/runs/26810627892 and PR chore: Update Docker image version to 26.04-py3 on dev #5051 run https://github.com/NVIDIA/Megatron-LM/actions/runs/26641021478.
codecov/patch is red at 70.81% vs target 80.00%; the nightly-sync CI gate treats codecov/coverage as exempt, and the MCore coverage job itself passed.

No further Megatron-LM code failure is currently visible in the latest completed CI run.

Phlip79 · 2026-06-03T05:00:56Z

/ok to test fb257b4

Phlip79 · 2026-06-03T08:16:58Z

/ok to test ec339b425a4f7bfae395cae5393a30ec56004368

copy-pr-bot · 2026-06-03T08:17:02Z

/ok to test ec339b425a4f7bfae395cae5393a30ec56004368

@Phlip79, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

Phlip79 · 2026-06-03T08:17:04Z

/ok to test ec339b4

Phlip79 · 2026-06-03T08:20:31Z

/ok to test 7ebdee6

Phlip79 · 2026-06-03T10:13:13Z

/ok to test a4a4457

Phlip79 · 2026-06-03T11:17:52Z

/ok to test e8356b0

Phlip79 · 2026-06-03T12:25:45Z

/ok to test 40f39e5

Phlip79 · 2026-06-03T14:08:20Z

/ok to test 8d9d7ac

# Conflicts: # megatron/training/training.py # tests/unit_tests/transformer/test_multi_token_prediction.py

Phlip79 · 2026-06-04T15:13:13Z

/ok to test 9f50973

Phlip79 · 2026-06-04T16:21:29Z

/ok to test b87cdc0

Signed-off-by: Deyu Fu <deyuf@nvidia.com>

FDecaYed · 2026-06-05T09:07:44Z

Added couple changes:

fix merge issue on linear_proj offloading in attention/multi_latent_attenion
cherry-pick needed change for dsv4, fix(layer_wise): tag MTP-stage word_embeddings as is_embedding_or_output_parameter #5034 which just missed the pull window
cherry-picked [Dev] Add diagnostic warnings to TEGroupedMLP fused impl checks #4269 that's dropped due to the merge

FDecaYed · 2026-06-05T09:08:05Z

/ok to test e4ce64d

FDecaYed · 2026-06-05T11:22:02Z

/ok to test 9f8c466

Phlip79 · 2026-06-05T20:49:12Z

/ok to test 954ab1c

svcnvidia-nemo-ci · 2026-06-06T16:00:53Z

Superseded by today's nightly sync.

nschank and others added 30 commits May 10, 2026 20:59

Create a Protocol for the MLP layer of TransformerLayer (#3435)

5e31514

Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>

Revert "Add Python-side guardrail for HybridEP InfiniBand limit and r…

a2ec5c1

…ename seq_len (#4094)" (#4718) Signed-off-by: oliver könig <okoenig@nvidia.com>

chore(beep boop 🤖): Bump (main) (2026-05-11)

e93755e

Add Python-side guardrail for DeepEP IB limits (#4719)

ad58411

ci: revert bad uv.lock bump and label future bumps with `Run function…

5123f6a

…al tests` (#4730) Signed-off-by: oliver könig <okoenig@nvidia.com>

[ci] fix: treat cancelled run-main-script step as failure (#4727)

33d47e0

Signed-off-by: oliver könig <okoenig@nvidia.com>

ci: Major refactor of release-workflows (#4602)

e42e2fa

Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

build(deps): bump nvidia-modelopt to 0.43 (#4723)

434368c

Signed-off-by: oliver könig <okoenig@nvidia.com>

fix(fsdp): recognize legacy GDN TP metadata (#4664)

74687fe

Signed-off-by: Shivanjan Chakravorty <shivanjanc@nvidia.com> Co-authored-by: Cory Ye <44509866+cspades@users.noreply.github.com>

Fixes for Nemotron3 Super release test config (#4544)

9718f7d

Signed-off-by: Maanu Grover <maanug@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com>

feat(gpt): add output postprocess hook (#4686)

97f3bce

Signed-off-by: Shivanjan Chakravorty <shivanjanc@nvidia.com>

Add bump-base-image skill and update golden value comparison (#4733)

f744215

Signed-off-by: Ajay Balasa <abalasa@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Guard omegaconf imports (#4685)

6486d52

Signed-off-by: Maanu Grover <maanug@nvidia.com>

Fix a regression introduced by #4625 for nightly runs (#4734)

7d24b28

Support transfomers 5.x.x for text generation server (#4732)

86bf476

Update transformer-engine dependency to version 2.15.0 (#4682)

815c83d

Increase CG cover from max_requests to max_tokens (#4214)

72dd053

Co-authored-by: Siddharth Singh <sidsingh@nvidia.com>

chore: rotate oncall schedule

f8c942b

fully remove legacy code (#4759)

10b514b

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

fix legacy torch save when tensor_model_parallel_size > expert_model_…

fc41581

…parallel_size * expert_tensor_parallel_size (#4678) Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

Wire --rl-inference-parsers into MRL (#4768)

d802f09

Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>

Integrate LayerWiseDistributedOptimizer with DDP buffer infrastructure (

c1e938b

#4509) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[training migration] Migrate mamba builder (#4550)

0dc36df

Signed-off-by: Maanu Grover <maanug@nvidia.com>

NCCL UB fix: reduce memory cost and correctly deregister NCCL mem pool (

e35d4e5

#4492) Signed-off-by: Xiaowei Ren <xren@nvidia.com>

fix: use no_mask in local ViT layer spec (#4395)

1ba0aa9

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refit clean up and refactoring (#4762)

20bf831

Co-authored-by: William Dykas <wdykas@oci-hsg-cs-001-vscode-03.cm.cluster>

Make weight and optimizer memory estimation take into account expert …

a7c9e8c

…parallelism correctly (#4687)

Support recomputing in HybridModel (#4496)

118933a

One single flag that determines if we are in inference (#4617)

925422c

fix: post-CI corrections

8d9d7ac

Merge remote-tracking branch 'origin/dev' into main2dev/27_05_2026

9f50973

# Conflicts: # megatron/training/training.py # tests/unit_tests/transformer/test_multi_token_prediction.py

fix: align MTP test helper return signature

b87cdc0

Wohox mentioned this pull request Jun 5, 2026

[Dev] fix(layer_wise): tag MTP-stage word_embeddings as is_embedding_or_output_parameter #5180

Closed

Pick in some changes dropped due to merge

e4ce64d

Signed-off-by: Deyu Fu <deyuf@nvidia.com>

Fix cherry-pick errors

9f8c466

fix: restore MoE activation offload manager

954ab1c

Conversation

svcnvidia-nemo-ci commented May 27, 2026 • edited by Phlip79 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files where main's version was taken (in addition to skill list)

Dev-only additions preserved across the merge

Files restored / handled specially

API mismatches resolved

How this diff was produced

CI status / external failures

Uh oh!

Phlip79 commented Jun 2, 2026

Uh oh!

Phlip79 commented Jun 2, 2026

Uh oh!

Phlip79 commented Jun 2, 2026

Uh oh!

Phlip79 commented Jun 3, 2026

Uh oh!

Phlip79 commented Jun 3, 2026

Uh oh!

Phlip79 commented Jun 3, 2026

Uh oh!

Phlip79 commented Jun 3, 2026

Uh oh!

Phlip79 commented Jun 3, 2026

Uh oh!

Phlip79 commented Jun 3, 2026

Uh oh!

copy-pr-bot Bot commented Jun 3, 2026

Uh oh!

Phlip79 commented Jun 3, 2026

Uh oh!

Phlip79 commented Jun 3, 2026

Uh oh!

Phlip79 commented Jun 3, 2026

Uh oh!

Phlip79 commented Jun 3, 2026

Uh oh!

Phlip79 commented Jun 3, 2026

Uh oh!

Phlip79 commented Jun 3, 2026

Uh oh!

Phlip79 commented Jun 4, 2026

Uh oh!

Phlip79 commented Jun 4, 2026

Uh oh!

FDecaYed commented Jun 5, 2026

Uh oh!

FDecaYed commented Jun 5, 2026

Uh oh!

FDecaYed commented Jun 5, 2026

Uh oh!

Phlip79 commented Jun 5, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

svcnvidia-nemo-ci commented May 27, 2026 •

edited by Phlip79

Loading