Skip to content

chore: nightly sync main into dev (27_05_2026)#5029

Closed
svcnvidia-nemo-ci wants to merge 126 commits into
devfrom
main2dev/27_05_2026
Closed

chore: nightly sync main into dev (27_05_2026)#5029
svcnvidia-nemo-ci wants to merge 126 commits into
devfrom
main2dev/27_05_2026

Conversation

@svcnvidia-nemo-ci

@svcnvidia-nemo-ci svcnvidia-nemo-ci commented May 27, 2026

Copy link
Copy Markdown

Summary

Nightly sync of main into dev. Brings in 119 commits from main since the last sync.

Python-only line stats: +20029 / -3729 across 244 files.

Files where main's version was taken (in addition to skill list)

The dev-feature audit's strict line-by-line comparison flagged a number of files where main intentionally rewrote the code that dev's older versions had. Each was verified by tracing back to a specific main-only commit:

Files Main commit(s)
examples/post_training/modelopt/*.py, *.md, *.sh, megatron/post_training/*.py 14aaa7e0e Modernize post-training modelopt example scripts (#4807)
megatron/core/resharding/**, tests/unit_tests/resharding/test_planner.py 20bf831da refit clean up and refactoring (#4762)
megatron/core/transformer/moe/experts.py (delay_wgrad_compute / _make_fused_ops refactor), megatron/core/transformer/moe/moe_layer.py (InferenceMode.is_active()), megatron/core/transformer/transformer_layer.py (as_mlp_submodule / MlpBuilder) TE op-fuser / mlp-builder refactors on main
megatron/core/extensions/transformer_engine.py (TEFusedMLP.as_mlp_submodule, _normalize_grouped_parameter_keys class method, TEFusedMLPWithGroupedLinear), megatron/core/models/gpt/gpt_layer_specs.py (rename dense_grouped_gemmuse_grouped_gemm_for_dense_mlp and switch to as_mlp_submodule), megatron/core/transformer/transformer_config.py rename, tests/unit_tests/transformer/test_te_fused_mlp_with_grouped_linear_spec.py (new, replaces deleted test_te_fused_dense_mlp_spec.py) fa7a23bad Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (#4318) (#4786)
megatron/core/transformer/transformer_block.py (moved _checkpointed_forward to megatron/core/recompute.py:checkpointed_forward), addition of megatron/core/recompute.py CUDA-graph / recompute refactor on main
megatron/core/optimizer/optimizer.py (extracted deferred MXFP8 param-sync into helper methods), megatron/core/optimizer/distrib_optimizer.py (start_param_sync_for_bucket_group_subset, FP8/FP4 path split) LayerWise distributed optimizer / MXFP8 refactor on main
megatron/training/argument_utils.py (hybrid_config_from_args), megatron/training/training.py, megatron/training/datasets/data_samplers.py, megatron/training/initialize.py, megatron/training/utils/common_utils.py, megatron/core/optimizer/layer_wise_optimizer.py Skill-list overrides (existing convention)
tests/test_utils/recipes/h100/gpt.yaml (new gpt3_mcore_te_tp2_pp1_gdn_no_nvrx_* cases alongside dev's mhc case), tests/test_utils/recipes/moe2.0.yaml (kept dev's pretrain_gpt.py script path) Test-recipe refactors on main

Dev-only additions preserved across the merge

  • HyperConnectionTransformerLayer + mHC config block (enable_hyper_connections, num_residual_streams, mhc_*, use_fused_mhc, mhc_recompute_layer_num) in transformer_config.py / transformer_layer.py / gpt_layer_specs.py / experimental_attention_variant_module_specs.py.
  • MoEOverloadFactorTracker, record_dispatch_token_counts, and RecordDispatchTokenCountsFunction in megatron/core/transformer/moe/moe_logging.py and moe_utils.py.
  • self._maybe_record_overload_factor(...) invocation in moe_layer.py (paired with main's new InferenceMode.is_active() inference-dispatcher selection).
  • LinearCrossEntropyModule import + usage in gpt_model.py.
  • mhc_multistream, mhc_enabled, e_proj / h_proj paths in multi_token_prediction.py (combined with main's padding_mask plumbing through _get_embeddings).
  • cp_group swap in MTPBlock.forward driven by packed_seq_params.cp_group.
  • input_ids plumbing through MoE forward stack (router.py, moe_layer.py, transformer_layer.py, and the recompute path via a new input_ids kwarg added to megatron/core/recompute.py:checkpointed_forward).
  • THD sequence-packing padding in HybridEPTokenDispatcher.setup_metadata (_original_num_tokens, _padded_num_tokens, group-wide max via all-reduce).
  • Full 6-param init_chunk_handler(pp_rank, vp_size, vp_stage, min_offloaded_tensor_size, delta_offload_bytes_across_pp_ranks, activation_offload_fraction, ...) signature (and matching callers in gpt_model.py / hybrid_model.py).
  • CheckpointManager mHC recompute plan inside HybridStack.forward (_build_mhc_recompute_layer_plan, _finalize_mhc_recompute_layer).
  • delay_offload_until_cuda_graph, delta_offload_bytes_across_pp_ranks, activation_offload_fraction config fields.
  • paged_stash_init_chunk_handler import in gpt_model.py.
  • --dynamic-context-parallel / --min-dynamic-context-parallel-size plumbing across arguments.py, initialize.py, data_samplers.py, plus rename of args.hybrid_context_parallelargs.dynamic_context_parallel in training.py.
  • Dev's hybrid-attention _resolve_cu_seqlens test extended with main's cp_size divisibility check.

Files restored / handled specially

  • tests/functional_tests/test_cases/gpt/gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/ removed (deleted on main; dev had local modifications to its golden values).
  • tests/test_utils/recipes/h100/t5.yaml removed (deleted on dev; main had modifications).
  • tests/unit_tests/transformer/test_te_fused_dense_mlp_spec.py removed and replaced by main's test_te_fused_mlp_with_grouped_linear_spec.py (renamed class).
  • megatron/post_training/generate.py reverted to main's modernized version (replaces dev's simple_generate with simple_speculative_generate from main).
  • Kept dev's pyproject.toml, uv.lock, docker/Dockerfile.ci.dev, and .github/CODEOWNERS verbatim per skill convention.

API mismatches resolved

  • attention.py / multi_latent_attention.py: switched to main's apply_module(self.linear_proj) + off_interface.group_commit(...) pattern; added a static group_commit method to the merged-tree's FineGrainedActivationOffloadingInterface so both dev's instance group_offload and main's static group_commit callers work.
  • gated_delta_net.py: took main's _resolve_cu_seqlens(..., cp_size=...) signature including the divisibility check, since callers now pass cp_size=self.cp_size.
  • transformer_layer.py: switched to main's as_mlp_submodule / submodules.mlp(...) construction pattern; kept dev's **moe_kwargs (including input_ids) plumbing through _forward_mlp / _forward_mlp_router.
  • gpt_layer_specs.py get_mlp_module_spec_for_backend: combined dev's hyper-connection / MTP plumbing with main's rename + not_none(TEFusedMLPWithGroupedLinear) dispatch.
  • recompute.py:checkpointed_forward: added input_ids kwarg (and popped it for non-TransformerLayer layers) so the new free function still supports dev's hash-based MoE routing through the full-recompute path.

How this diff was produced

Conflict resolution followed the nightly-sync skill: started from origin/dev, ran git merge origin/main --no-edit, resolved ~67 conflicting paths surgically (preserving dev-only features unless I could identify a specific main commit removing them), ran black --config pyproject.toml + isort over the 244 changed Python files, kept dev's uv.lock / pyproject.toml / docker/Dockerfile.ci.dev / .github/CODEOWNERS verbatim, and audited the result with the pre-push hook before pushing.

CI status / external failures

Latest full CI for 32229e7353b6e74ed80bd939dba902d3f9f9f91d:

🤖 Generated with Claude Code

nschank and others added 30 commits May 10, 2026 20:59
Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>
…ename seq_len (#4094)" (#4718)

Signed-off-by: oliver könig <okoenig@nvidia.com>
…al tests` (#4730)

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Shivanjan Chakravorty <shivanjanc@nvidia.com>
Co-authored-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Shivanjan Chakravorty <shivanjanc@nvidia.com>
Signed-off-by: Ajay Balasa <abalasa@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Tuomas Rintamaki <trintamaki@nvidia.com>
Co-authored-by: Tyler Poon <tylerpoon@gmail.com>
Co-authored-by: Collin McCarthy <cmccarthy@nvidia.com>
Co-authored-by: Matthieu Le <matthieul@nvidia.com>
Co-authored-by: Piotr Zelasko <pzelasko@nvidia.com>
Co-authored-by: Ehsan Hosseini Asl <ehosseiniasl@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Siddharth Singh <sidsingh@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
…parallel_size * expert_tensor_parallel_size (#4678)

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>
#4509)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: William Dykas <wdykas@oci-hsg-cs-001-vscode-03.cm.cluster>
@Phlip79

Phlip79 commented Jun 2, 2026

Copy link
Copy Markdown
Member

/ok to test ce959b5

@Phlip79

Phlip79 commented Jun 2, 2026

Copy link
Copy Markdown
Member

/ok to test d6aa731

@Phlip79

Phlip79 commented Jun 2, 2026

Copy link
Copy Markdown
Member

/ok to test 16fda8d

@Phlip79

Phlip79 commented Jun 3, 2026

Copy link
Copy Markdown
Member

/ok to test 68c9447

@Phlip79

Phlip79 commented Jun 3, 2026

Copy link
Copy Markdown
Member

/ok to test d1a0bca

@Phlip79

Phlip79 commented Jun 3, 2026

Copy link
Copy Markdown
Member

/ok to test 32229e7

@Phlip79

Phlip79 commented Jun 3, 2026

Copy link
Copy Markdown
Member

CI update for latest SHA 32229e7353b6e74ed80bd939dba902d3f9f9f91d:

No further Megatron-LM code failure is currently visible in the latest completed CI run.

@Phlip79

Phlip79 commented Jun 3, 2026

Copy link
Copy Markdown
Member

/ok to test fb257b4

@Phlip79

Phlip79 commented Jun 3, 2026

Copy link
Copy Markdown
Member

/ok to test ec339b425a4f7bfae395cae5393a30ec56004368

@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

/ok to test ec339b425a4f7bfae395cae5393a30ec56004368

@Phlip79, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@Phlip79

Phlip79 commented Jun 3, 2026

Copy link
Copy Markdown
Member

/ok to test ec339b4

@Phlip79

Phlip79 commented Jun 3, 2026

Copy link
Copy Markdown
Member

/ok to test 7ebdee6

@Phlip79

Phlip79 commented Jun 3, 2026

Copy link
Copy Markdown
Member

/ok to test a4a4457

@Phlip79

Phlip79 commented Jun 3, 2026

Copy link
Copy Markdown
Member

/ok to test e8356b0

@Phlip79

Phlip79 commented Jun 3, 2026

Copy link
Copy Markdown
Member

/ok to test 40f39e5

@Phlip79

Phlip79 commented Jun 3, 2026

Copy link
Copy Markdown
Member

/ok to test 8d9d7ac

# Conflicts:
#	megatron/training/training.py
#	tests/unit_tests/transformer/test_multi_token_prediction.py
@Phlip79

Phlip79 commented Jun 4, 2026

Copy link
Copy Markdown
Member

/ok to test 9f50973

@Phlip79

Phlip79 commented Jun 4, 2026

Copy link
Copy Markdown
Member

/ok to test b87cdc0

Signed-off-by: Deyu Fu <deyuf@nvidia.com>
@FDecaYed

FDecaYed commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Added couple changes:

@FDecaYed

FDecaYed commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

/ok to test e4ce64d

@FDecaYed

FDecaYed commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

/ok to test 9f8c466

@Phlip79

Phlip79 commented Jun 5, 2026

Copy link
Copy Markdown
Member

/ok to test 954ab1c

@svcnvidia-nemo-ci

Copy link
Copy Markdown
Author

Superseded by today's nightly sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: high Run functional tests Run MBridge tests Attach this for testing this PR against MBridge main

Projects

None yet

Development

Successfully merging this pull request may close these issues.