Skip to content

ci: restore perf test torchrun logs#4951

Merged
chtruong814 merged 1 commit into
NVIDIA:mainfrom
chtruong814:chtruong/fix-perf-tests
May 23, 2026
Merged

ci: restore perf test torchrun logs#4951
chtruong814 merged 1 commit into
NVIDIA:mainfrom
chtruong814:chtruong/fix-perf-tests

Conversation

@chtruong814

Copy link
Copy Markdown
Contributor

Summary

  • Restore torchrun per-rank log emission in the perf test harness.
  • Create {assets_dir}/logs/1 beside {assets_dir}/perf_results so launch_jet_workload.py can find std*.log assets.
  • Fixes the gpt_16b_perf retry loop introduced when PR Perf tests #4917 removed the torchrun log arguments.

Test Plan

  • bash -n tests/performance_tests/shell_test_utils/run_perf_test.sh
  • git diff --check -- tests/performance_tests/shell_test_utils/run_perf_test.sh

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 23, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@chtruong814 chtruong814 marked this pull request as ready for review May 23, 2026 01:48
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team May 23, 2026 01:48
@chtruong814

Copy link
Copy Markdown
Contributor Author

Fast merging to resolve internal testing issue. This script is only used on internal tests.

@chtruong814 chtruong814 merged commit f7f584d into NVIDIA:main May 23, 2026
28 checks passed
santhnm2 pushed a commit to santhnm2/Megatron-LM that referenced this pull request May 26, 2026
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
shanmugamr1992 added a commit to shanmugamr1992/Megatron-LM that referenced this pull request May 26, 2026
These three torchrun args were added by NVIDIA#4951 on main but lost when
perf-fix branched off perf-tests (which predates NVIDIA#4951). The merge of
main into perf-fix did not pick them up cleanly. Restoring so the file
matches main exactly — the PR no longer touches run_perf_test.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Victarry added a commit to yanring/Megatron-LM that referenced this pull request May 27, 2026
* origin/main: (50 commits)
  Drain predecessor reduce-scatter at dispatch time (NVIDIA#4940)
  ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (NVIDIA#4905)
  fix(tests): initialize num_microbatches calculator in vision cudagraph tests (NVIDIA#4986)
  test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (NVIDIA#4985)
  ci: Add support for MBridge job gating based on PR labels  (NVIDIA#4926)
  test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (NVIDIA#4984)
  test: re-enable paged stashing MoE tests (NVIDIA#4978)
  Fix elastification unwrap_model import (NVIDIA#4972)
  Avoid offsetting functional test master port (NVIDIA#4973)
  test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (NVIDIA#4931)
  chore(beep boop 🤖): Bump  (main) (2026-05-25)
  test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (NVIDIA#4932)
  Fix `get_batch` return order to ignore BlendedDataset provenance fields (NVIDIA#4952)
  ci: restore perf test torchrun logs (NVIDIA#4951)
  Various training utils (NVIDIA#4872)
  ci: Update training script paths in BERT and T5 (NVIDIA#4939)
  [MXFP8/FP4-param-gather] Post processing after forced param AG in eval (NVIDIA#4562)
  Fix mxfp8 param gather numerical issue when DP overlap is off (NVIDIA#4800)
  Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (NVIDIA#4318) (NVIDIA#4786)
  Fix paged stashing test submodules lookup (NVIDIA#4925)
  ...

# Conflicts:
#	megatron/training/training.py
janEbert pushed a commit to janEbert/Megatron-LM that referenced this pull request Jun 2, 2026
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants