chore: nightly sync main into dev (27_05_2026)#5029
chore: nightly sync main into dev (27_05_2026)#5029svcnvidia-nemo-ci wants to merge 126 commits into
Conversation
Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>
…al tests` (#4730) Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Shivanjan Chakravorty <shivanjanc@nvidia.com> Co-authored-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Shivanjan Chakravorty <shivanjanc@nvidia.com>
Signed-off-by: Ajay Balasa <abalasa@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Tuomas Rintamaki <trintamaki@nvidia.com> Co-authored-by: Tyler Poon <tylerpoon@gmail.com> Co-authored-by: Collin McCarthy <cmccarthy@nvidia.com> Co-authored-by: Matthieu Le <matthieul@nvidia.com> Co-authored-by: Piotr Zelasko <pzelasko@nvidia.com> Co-authored-by: Ehsan Hosseini Asl <ehosseiniasl@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Siddharth Singh <sidsingh@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
…parallel_size * expert_tensor_parallel_size (#4678) Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>
#4509) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
#4492) Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: William Dykas <wdykas@oci-hsg-cs-001-vscode-03.cm.cluster>
…parallelism correctly (#4687)
|
/ok to test ce959b5 |
|
/ok to test d6aa731 |
|
/ok to test 16fda8d |
|
/ok to test 68c9447 |
|
/ok to test d1a0bca |
|
/ok to test 32229e7 |
|
CI update for latest SHA
No further Megatron-LM code failure is currently visible in the latest completed CI run. |
|
/ok to test fb257b4 |
|
/ok to test ec339b425a4f7bfae395cae5393a30ec56004368 |
@Phlip79, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test ec339b4 |
|
/ok to test 7ebdee6 |
|
/ok to test a4a4457 |
|
/ok to test e8356b0 |
|
/ok to test 40f39e5 |
|
/ok to test 8d9d7ac |
# Conflicts: # megatron/training/training.py # tests/unit_tests/transformer/test_multi_token_prediction.py
|
/ok to test 9f50973 |
|
/ok to test b87cdc0 |
Signed-off-by: Deyu Fu <deyuf@nvidia.com>
|
Added couple changes:
|
|
/ok to test e4ce64d |
|
/ok to test 9f8c466 |
|
/ok to test 954ab1c |
|
Superseded by today's nightly sync. |
Summary
Nightly sync of
mainintodev. Brings in 119 commits frommainsince the last sync.Python-only line stats: +20029 / -3729 across 244 files.
Files where main's version was taken (in addition to skill list)
The dev-feature audit's strict line-by-line comparison flagged a number of files where
mainintentionally rewrote the code that dev's older versions had. Each was verified by tracing back to a specific main-only commit:examples/post_training/modelopt/*.py,*.md,*.sh,megatron/post_training/*.py14aaa7e0eModernize post-training modelopt example scripts (#4807)megatron/core/resharding/**,tests/unit_tests/resharding/test_planner.py20bf831darefit clean up and refactoring (#4762)megatron/core/transformer/moe/experts.py(delay_wgrad_compute /_make_fused_opsrefactor),megatron/core/transformer/moe/moe_layer.py(InferenceMode.is_active()),megatron/core/transformer/transformer_layer.py(as_mlp_submodule/MlpBuilder)megatron/core/extensions/transformer_engine.py(TEFusedMLP.as_mlp_submodule,_normalize_grouped_parameter_keysclass method,TEFusedMLPWithGroupedLinear),megatron/core/models/gpt/gpt_layer_specs.py(renamedense_grouped_gemm→use_grouped_gemm_for_dense_mlpand switch toas_mlp_submodule),megatron/core/transformer/transformer_config.pyrename,tests/unit_tests/transformer/test_te_fused_mlp_with_grouped_linear_spec.py(new, replaces deletedtest_te_fused_dense_mlp_spec.py)fa7a23badAdd TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (#4318) (#4786)megatron/core/transformer/transformer_block.py(moved_checkpointed_forwardtomegatron/core/recompute.py:checkpointed_forward), addition ofmegatron/core/recompute.pymegatron/core/optimizer/optimizer.py(extracted deferred MXFP8 param-sync into helper methods),megatron/core/optimizer/distrib_optimizer.py(start_param_sync_for_bucket_group_subset, FP8/FP4 path split)megatron/training/argument_utils.py(hybrid_config_from_args),megatron/training/training.py,megatron/training/datasets/data_samplers.py,megatron/training/initialize.py,megatron/training/utils/common_utils.py,megatron/core/optimizer/layer_wise_optimizer.pytests/test_utils/recipes/h100/gpt.yaml(newgpt3_mcore_te_tp2_pp1_gdn_no_nvrx_*cases alongside dev'smhccase),tests/test_utils/recipes/moe2.0.yaml(kept dev'spretrain_gpt.pyscript path)Dev-only additions preserved across the merge
HyperConnectionTransformerLayer+ mHC config block (enable_hyper_connections,num_residual_streams,mhc_*,use_fused_mhc,mhc_recompute_layer_num) intransformer_config.py/transformer_layer.py/gpt_layer_specs.py/experimental_attention_variant_module_specs.py.MoEOverloadFactorTracker,record_dispatch_token_counts, andRecordDispatchTokenCountsFunctioninmegatron/core/transformer/moe/moe_logging.pyandmoe_utils.py.self._maybe_record_overload_factor(...)invocation inmoe_layer.py(paired with main's newInferenceMode.is_active()inference-dispatcher selection).LinearCrossEntropyModuleimport + usage ingpt_model.py.mhc_multistream,mhc_enabled,e_proj/h_projpaths inmulti_token_prediction.py(combined with main'spadding_maskplumbing through_get_embeddings).cp_groupswap inMTPBlock.forwarddriven bypacked_seq_params.cp_group.input_idsplumbing through MoE forward stack (router.py,moe_layer.py,transformer_layer.py, and the recompute path via a newinput_idskwarg added tomegatron/core/recompute.py:checkpointed_forward).HybridEPTokenDispatcher.setup_metadata(_original_num_tokens,_padded_num_tokens, group-wide max via all-reduce).init_chunk_handler(pp_rank, vp_size, vp_stage, min_offloaded_tensor_size, delta_offload_bytes_across_pp_ranks, activation_offload_fraction, ...)signature (and matching callers ingpt_model.py/hybrid_model.py).CheckpointManagermHC recompute plan insideHybridStack.forward(_build_mhc_recompute_layer_plan,_finalize_mhc_recompute_layer).delay_offload_until_cuda_graph,delta_offload_bytes_across_pp_ranks,activation_offload_fractionconfig fields.paged_stash_init_chunk_handlerimport ingpt_model.py.--dynamic-context-parallel/--min-dynamic-context-parallel-sizeplumbing acrossarguments.py,initialize.py,data_samplers.py, plus rename ofargs.hybrid_context_parallel→args.dynamic_context_parallelintraining.py._resolve_cu_seqlenstest extended with main'scp_sizedivisibility check.Files restored / handled specially
tests/functional_tests/test_cases/gpt/gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq/removed (deleted on main; dev had local modifications to its golden values).tests/test_utils/recipes/h100/t5.yamlremoved (deleted on dev; main had modifications).tests/unit_tests/transformer/test_te_fused_dense_mlp_spec.pyremoved and replaced by main'stest_te_fused_mlp_with_grouped_linear_spec.py(renamed class).megatron/post_training/generate.pyreverted to main's modernized version (replaces dev'ssimple_generatewithsimple_speculative_generatefrom main).pyproject.toml,uv.lock,docker/Dockerfile.ci.dev, and.github/CODEOWNERSverbatim per skill convention.API mismatches resolved
attention.py/multi_latent_attention.py: switched to main'sapply_module(self.linear_proj)+off_interface.group_commit(...)pattern; added a staticgroup_commitmethod to the merged-tree'sFineGrainedActivationOffloadingInterfaceso both dev's instancegroup_offloadand main's staticgroup_commitcallers work.gated_delta_net.py: took main's_resolve_cu_seqlens(..., cp_size=...)signature including the divisibility check, since callers now passcp_size=self.cp_size.transformer_layer.py: switched to main'sas_mlp_submodule/submodules.mlp(...)construction pattern; kept dev's**moe_kwargs(includinginput_ids) plumbing through_forward_mlp/_forward_mlp_router.gpt_layer_specs.pyget_mlp_module_spec_for_backend: combined dev's hyper-connection / MTP plumbing with main's rename +not_none(TEFusedMLPWithGroupedLinear)dispatch.recompute.py:checkpointed_forward: addedinput_idskwarg (and popped it for non-TransformerLayerlayers) so the new free function still supports dev's hash-based MoE routing through the full-recompute path.How this diff was produced
Conflict resolution followed the nightly-sync skill: started from
origin/dev, rangit merge origin/main --no-edit, resolved ~67 conflicting paths surgically (preserving dev-only features unless I could identify a specific main commit removing them), ranblack --config pyproject.toml+isortover the 244 changed Python files, kept dev'suv.lock/pyproject.toml/docker/Dockerfile.ci.dev/.github/CODEOWNERSverbatim, and audited the result with the pre-push hook before pushing.CI status / external failures
Latest full CI for
32229e7353b6e74ed80bd939dba902d3f9f9f91d:Nemo_CICD_Testpassing. Install, docs, wheels, lint, unit, and functional/integration jobs passed, including the previously failing DCP, dynamic-inference, FSDP DTensor, DSv4, and scoped-cudagraph cases.codecov/patchis red at 70.81% vs the 80.00% patch target. Coverage contexts are exempt in the nightly-sync gate, andCoverage (unit-test)passed.cicd-mbridge-testingfailed in downstream Megatron-Bridge run https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/runs/26861493475. The failure is during MBridge unit-test collection, before Megatron-LM code executes, withpydantic_core.ValidationError: RunnablePassthrough.name Field requiredfromlangchain_core. MBridgemainatb236c13319656d7c518ee57a7b87c671e1d069earesolvespydantic 2.14.0a1/pydantic-core 2.47.0.devPRs while the Megatron-LM gate passes, for example [Dev] Skip identity alltoall chunk sort #5102 (https://github.com/NVIDIA/Megatron-LM/actions/runs/26810627892), chore: Update Docker image version to 26.04-py3 on dev #5051 (https://github.com/NVIDIA/Megatron-LM/actions/runs/26641021478), and [dev] [fix] [DeepSeek-v4] fix dense loss and rope type in DSv4 Hybrid Attention #5018 (https://github.com/NVIDIA/Megatron-LM/actions/runs/26575319597).🤖 Generated with Claude Code