Skip to content

test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node#4984

Merged
ko3n1g merged 2 commits into
NVIDIA:mainfrom
ko3n1g:ko3n1g/fix/re-enable-4342
May 26, 2026
Merged

test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node#4984
ko3n1g merged 2 commits into
NVIDIA:mainfrom
ko3n1g:ko3n1g/fix/re-enable-4342

Conversation

@ko3n1g

@ko3n1g ko3n1g commented May 26, 2026

Copy link
Copy Markdown
Contributor
Claude summary

Re-enables the GB200 1-node MoE functional test gpt3_mcore_te_tp2_pp1_resume_torch_dist_te_8experts2parallel_multi_dist_optimizer_instances_1node by flipping its recipe scope from mr-github-broken back to mr-github.

Why it was broken

The 1-node variant was added in #4334. Initially the config inherited --expert-model-parallel-size: 2 from the 2-node original. With world_size 4 (TP=2, PP=1, EP=2):

expert_tensor_model_pipeline_parallel_size = ETP * EP * PP = 2 * 2 * 1 = 4
expert_data_parallel_size                  = world_size / 4         = 1
num_distributed_optimizer_instances        = 2

so the assertion at megatron/core/parallel_state.py:1248 triggered:

AssertionError: Expert data parallel size should be divisible by partial DistOpt shard factor

(1 % 2 != 0). That is the failure captured in #4342.

Why it works now

#4334's final merge already reduced --expert-model-parallel-size from 2 to 1 in the test config. With EP=1 the arithmetic becomes:

expert_tensor_model_pipeline_parallel_size = 2 * 1 * 1 = 2
expert_data_parallel_size                  = 4 / 2     = 2
2 % 2                                                 = 0   OK

so the assertion passes. The recipe was left at mr-github-broken defensively until the EP fix was validated; this PR completes the re-enable.

Change

- test_case: [gpt3_mcore_te_tp2_pp1_resume_torch_dist_te_8experts2parallel_multi_dist_optimizer_instances_1node]
  products:
    - environment: [dev]
-     scope: [mr-github-broken]
+     scope: [mr-github]
      platforms: [dgx_gb200]

Golden values

Populated golden_values_dev_dgx_gb200.json from GitHub Actions workflow run 26446415718 (job 77882977766). The previous file was an empty {} placeholder; this is the first set of recorded metrics (lm loss, num-zeros, mem-allocated-bytes, mem-max-allocated-bytes, iteration-time).

Closes #4342

…1node

The 1-node GB200 variant of this test was added in NVIDIA#4334 with the
2-node config's `--expert-model-parallel-size: 2`. With world_size=4
(TP=2, PP=1, EP=2) that yields `expert_data_parallel_size = 1`, which
violates `expert_data_parallel_size % num_distributed_optimizer_instances == 0`
at `parallel_state.py:1248` and triggered NVIDIA#4342.

PR NVIDIA#4334 was merged with EP reduced to 1, but the recipe scope was left
at `mr-github-broken` defensively. With EP=1, EDP=2 and NDOI=2 the
assertion passes. Re-enable the test.

Closes NVIDIA#4342

Signed-off-by: oliver könig <okoenig@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 26, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ko3n1g

ko3n1g commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 67b7ec3

Populate golden_values_dev_dgx_gb200.json for the re-enabled
8experts2parallel_multi_dist_optimizer_instances_1node test from
GitHub Actions workflow run 26446415718.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g

ko3n1g commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Approved All necessary approvals have been made label May 26, 2026
@ko3n1g ko3n1g added this pull request to the merge queue May 26, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26460509529

Merged via the queue into NVIDIA:main with commit ff64743 May 26, 2026
370 of 376 checks passed
@ko3n1g ko3n1g deleted the ko3n1g/fix/re-enable-4342 branch May 26, 2026 16:55
santhnm2 pushed a commit to santhnm2/Megatron-LM that referenced this pull request May 26, 2026
…1node (NVIDIA#4984)

Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Victarry added a commit to yanring/Megatron-LM that referenced this pull request May 27, 2026
* origin/main: (50 commits)
  Drain predecessor reduce-scatter at dispatch time (NVIDIA#4940)
  ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (NVIDIA#4905)
  fix(tests): initialize num_microbatches calculator in vision cudagraph tests (NVIDIA#4986)
  test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (NVIDIA#4985)
  ci: Add support for MBridge job gating based on PR labels  (NVIDIA#4926)
  test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (NVIDIA#4984)
  test: re-enable paged stashing MoE tests (NVIDIA#4978)
  Fix elastification unwrap_model import (NVIDIA#4972)
  Avoid offsetting functional test master port (NVIDIA#4973)
  test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (NVIDIA#4931)
  chore(beep boop 🤖): Bump  (main) (2026-05-25)
  test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (NVIDIA#4932)
  Fix `get_batch` return order to ignore BlendedDataset provenance fields (NVIDIA#4952)
  ci: restore perf test torchrun logs (NVIDIA#4951)
  Various training utils (NVIDIA#4872)
  ci: Update training script paths in BERT and T5 (NVIDIA#4939)
  [MXFP8/FP4-param-gather] Post processing after forced param AG in eval (NVIDIA#4562)
  Fix mxfp8 param gather numerical issue when DP overlap is off (NVIDIA#4800)
  Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (NVIDIA#4318) (NVIDIA#4786)
  Fix paged stashing test submodules lookup (NVIDIA#4925)
  ...

# Conflicts:
#	megatron/training/training.py
janEbert pushed a commit to janEbert/Megatron-LM that referenced this pull request Jun 2, 2026
…1node (NVIDIA#4984)

Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: low Run functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛 CI failure: gpt3_mcore_te_tp2_pp1_resume_torch_dist_te_8experts2parallel_multi_dist_optimizer_instances_1node

3 participants