ci: restore perf test torchrun logs by chtruong814 · Pull Request #4951 · NVIDIA/Megatron-LM

chtruong814 · 2026-05-23T01:47:55Z

Summary

Restore torchrun per-rank log emission in the perf test harness.
Create {assets_dir}/logs/1 beside {assets_dir}/perf_results so launch_jet_workload.py can find std*.log assets.
Fixes the gpt_16b_perf retry loop introduced when PR Perf tests #4917 removed the torchrun log arguments.

Test Plan

bash -n tests/performance_tests/shell_test_utils/run_perf_test.sh
git diff --check -- tests/performance_tests/shell_test_utils/run_perf_test.sh

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot · 2026-05-23T01:47:59Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

chtruong814 · 2026-05-23T01:50:50Z

Fast merging to resolve internal testing issue. This script is only used on internal tests.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

These three torchrun args were added by NVIDIA#4951 on main but lost when perf-fix branched off perf-tests (which predates NVIDIA#4951). The merge of main into perf-fix did not pick them up cleanly. Restoring so the file matches main exactly — the PR no longer touches run_perf_test.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* origin/main: (50 commits) Drain predecessor reduce-scatter at dispatch time (NVIDIA#4940) ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (NVIDIA#4905) fix(tests): initialize num_microbatches calculator in vision cudagraph tests (NVIDIA#4986) test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (NVIDIA#4985) ci: Add support for MBridge job gating based on PR labels (NVIDIA#4926) test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (NVIDIA#4984) test: re-enable paged stashing MoE tests (NVIDIA#4978) Fix elastification unwrap_model import (NVIDIA#4972) Avoid offsetting functional test master port (NVIDIA#4973) test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (NVIDIA#4931) chore(beep boop 🤖): Bump (main) (2026-05-25) test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (NVIDIA#4932) Fix `get_batch` return order to ignore BlendedDataset provenance fields (NVIDIA#4952) ci: restore perf test torchrun logs (NVIDIA#4951) Various training utils (NVIDIA#4872) ci: Update training script paths in BERT and T5 (NVIDIA#4939) [MXFP8/FP4-param-gather] Post processing after forced param AG in eval (NVIDIA#4562) Fix mxfp8 param gather numerical issue when DP overlap is off (NVIDIA#4800) Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (NVIDIA#4318) (NVIDIA#4786) Fix paged stashing test submodules lookup (NVIDIA#4925) ... # Conflicts: # megatron/training/training.py

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: restore perf test torchrun logs

cbdce56

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 marked this pull request as ready for review May 23, 2026 01:48

svcnvidia-nemo-ci requested a review from a team May 23, 2026 01:48

copy-pr-bot Bot temporarily deployed to public May 23, 2026 01:49 Inactive

copy-pr-bot Bot temporarily deployed to test May 23, 2026 01:49 Inactive

svcnvidia-nemo-ci added the complexity: low label May 23, 2026

chtruong814 merged commit f7f584d into NVIDIA:main May 23, 2026
28 checks passed

copy-pr-bot Bot temporarily deployed to public May 23, 2026 01:52 Inactive

copy-pr-bot Bot temporarily deployed to public May 23, 2026 02:00 Inactive

santhnm2 pushed a commit to santhnm2/Megatron-LM that referenced this pull request May 26, 2026

ci: restore perf test torchrun logs (NVIDIA#4951)

f6edc4e

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

janEbert pushed a commit to janEbert/Megatron-LM that referenced this pull request Jun 2, 2026

ci: restore perf test torchrun logs (NVIDIA#4951)

b390e81

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: restore perf test torchrun logs#4951

ci: restore perf test torchrun logs#4951
chtruong814 merged 1 commit into
NVIDIA:mainfrom
chtruong814:chtruong/fix-perf-tests

chtruong814 commented May 23, 2026

Uh oh!

copy-pr-bot Bot commented May 23, 2026

Uh oh!

chtruong814 commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chtruong814 commented May 23, 2026

Summary

Test Plan

Uh oh!

copy-pr-bot Bot commented May 23, 2026

Uh oh!

chtruong814 commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants