test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node by ko3n1g · Pull Request #4984 · NVIDIA/Megatron-LM

ko3n1g · 2026-05-26T10:17:42Z

Claude summary

Re-enables the GB200 1-node MoE functional test gpt3_mcore_te_tp2_pp1_resume_torch_dist_te_8experts2parallel_multi_dist_optimizer_instances_1node by flipping its recipe scope from mr-github-broken back to mr-github.

Why it was broken

The 1-node variant was added in #4334. Initially the config inherited --expert-model-parallel-size: 2 from the 2-node original. With world_size 4 (TP=2, PP=1, EP=2):

expert_tensor_model_pipeline_parallel_size = ETP * EP * PP = 2 * 2 * 1 = 4
expert_data_parallel_size                  = world_size / 4         = 1
num_distributed_optimizer_instances        = 2

so the assertion at megatron/core/parallel_state.py:1248 triggered:

AssertionError: Expert data parallel size should be divisible by partial DistOpt shard factor

(1 % 2 != 0). That is the failure captured in #4342.

Why it works now

#4334's final merge already reduced --expert-model-parallel-size from 2 to 1 in the test config. With EP=1 the arithmetic becomes:

expert_tensor_model_pipeline_parallel_size = 2 * 1 * 1 = 2
expert_data_parallel_size                  = 4 / 2     = 2
2 % 2                                                 = 0   OK

so the assertion passes. The recipe was left at mr-github-broken defensively until the EP fix was validated; this PR completes the re-enable.

Change

- test_case: [gpt3_mcore_te_tp2_pp1_resume_torch_dist_te_8experts2parallel_multi_dist_optimizer_instances_1node]
  products:
    - environment: [dev]
-     scope: [mr-github-broken]
+     scope: [mr-github]
      platforms: [dgx_gb200]

Golden values

Populated golden_values_dev_dgx_gb200.json from GitHub Actions workflow run 26446415718 (job 77882977766). The previous file was an empty {} placeholder; this is the first set of recorded metrics (lm loss, num-zeros, mem-allocated-bytes, mem-max-allocated-bytes, iteration-time).

Closes #4342

…1node The 1-node GB200 variant of this test was added in NVIDIA#4334 with the 2-node config's `--expert-model-parallel-size: 2`. With world_size=4 (TP=2, PP=1, EP=2) that yields `expert_data_parallel_size = 1`, which violates `expert_data_parallel_size % num_distributed_optimizer_instances == 0` at `parallel_state.py:1248` and triggered NVIDIA#4342. PR NVIDIA#4334 was merged with EP reduced to 1, but the recipe scope was left at `mr-github-broken` defensively. With EP=1, EDP=2 and NDOI=2 the assertion passes. Re-enable the test. Closes NVIDIA#4342 Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot · 2026-05-26T10:17:46Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

ko3n1g · 2026-05-26T10:17:54Z

/ok to test 67b7ec3

Populate golden_values_dev_dgx_gb200.json for the re-enabled 8experts2parallel_multi_dist_optimizer_instances_1node test from GitHub Actions workflow run 26446415718. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-05-26T14:32:55Z

/ok to test

svcnvidia-nemo-ci · 2026-05-26T16:15:15Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26460509529

…1node (NVIDIA#4984) Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* origin/main: (50 commits) Drain predecessor reduce-scatter at dispatch time (NVIDIA#4940) ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (NVIDIA#4905) fix(tests): initialize num_microbatches calculator in vision cudagraph tests (NVIDIA#4986) test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (NVIDIA#4985) ci: Add support for MBridge job gating based on PR labels (NVIDIA#4926) test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (NVIDIA#4984) test: re-enable paged stashing MoE tests (NVIDIA#4978) Fix elastification unwrap_model import (NVIDIA#4972) Avoid offsetting functional test master port (NVIDIA#4973) test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (NVIDIA#4931) chore(beep boop 🤖): Bump (main) (2026-05-25) test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (NVIDIA#4932) Fix `get_batch` return order to ignore BlendedDataset provenance fields (NVIDIA#4952) ci: restore perf test torchrun logs (NVIDIA#4951) Various training utils (NVIDIA#4872) ci: Update training script paths in BERT and T5 (NVIDIA#4939) [MXFP8/FP4-param-gather] Post processing after forced param AG in eval (NVIDIA#4562) Fix mxfp8 param gather numerical issue when DP overlap is off (NVIDIA#4800) Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (NVIDIA#4318) (NVIDIA#4786) Fix paged stashing test submodules lookup (NVIDIA#4925) ... # Conflicts: # megatron/training/training.py

…1node (NVIDIA#4984) Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ko3n1g added the Run functional tests label May 26, 2026

copy-pr-bot Bot temporarily deployed to public May 26, 2026 10:18 Inactive

copy-pr-bot Bot temporarily deployed to test May 26, 2026 10:18 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 10:21 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 10:22 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 10:29 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 14:33 Inactive

copy-pr-bot Bot temporarily deployed to test May 26, 2026 14:34 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 14:36 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 14:37 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 14:44 Inactive

ko3n1g marked this pull request as ready for review May 26, 2026 15:02

ko3n1g requested a review from a team as a code owner May 26, 2026 15:02

ko3n1g enabled auto-merge May 26, 2026 15:02

svcnvidia-nemo-ci requested a review from a team May 26, 2026 15:02

svcnvidia-nemo-ci added the complexity: low label May 26, 2026

thomasdhc approved these changes May 26, 2026

View reviewed changes

svcnvidia-nemo-ci added the Approved All necessary approvals have been made label May 26, 2026

ko3n1g added this pull request to the merge queue May 26, 2026

Merged via the queue into NVIDIA:main with commit ff64743 May 26, 2026
370 of 376 checks passed

ko3n1g deleted the ko3n1g/fix/re-enable-4342 branch May 26, 2026 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node#4984

test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node#4984
ko3n1g merged 2 commits into
NVIDIA:mainfrom
ko3n1g:ko3n1g/fix/re-enable-4342

ko3n1g commented May 26, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

svcnvidia-nemo-ci commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ko3n1g commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why it was broken

Why it works now

Change

Golden values

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

svcnvidia-nemo-ci commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ko3n1g commented May 26, 2026 •

edited

Loading