Add bump-base-image skill and update golden value comparison#4733
Merged
Conversation
Document the end-to-end workflow for upgrading the NVIDIA PyTorch base container (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI, distilled from PRs NVIDIA#4611 (the bump itself) and NVIDIA#4688 (the GitLab follow-up that was needed because NVIDIA#4611 only touched the GitHub pin). The skill encodes: - The two pin sites that must move together: `docker/.ngc_version.dev` (GitHub CI, via `Dockerfile.ci.dev`) and the `IMAGE_TYPE: dev` rows of `.gitlab/stages/01.build.yml` (GitLab CI). A pre-merge `rg` snippet enforces the sync to prevent recurrence of the NVIDIA#4688 trap. - The `Run functional tests` label, which routes the PR into SCOPE=L1 / N_REPEAT=5 / cadence-bypass in `.github/workflows/cicd-main.yml` so the first CI run already exercises the full functional matrix. - The `copy-pr-bot` `/ok to test <sha>` flow for fork PRs: authorization is per-SHA, so every push needs a fresh comment. - Hand-off to the `update-golden-values` skill for refreshing drifted goldens via `download_golden_values.py --only-failing`, with the KL summary becoming the PR description blurb (78 files in NVIDIA#4611). - Triage for real regressions: file a tracking issue (e.g. NVIDIA#4654, NVIDIA#4657), flip the recipe's `scope:` from `[mr, mr-github]` to `[mr-broken, mr-github-broken]` with an inline comment, and let the bump merge on its own concern. The workflow is resumable across CI rounds because state lives in the PR itself, not the skill. A typical bump flows S1 (PR open, both pins moved, label applied) -> CI -> S2 (classify failures) -> S3a (refresh goldens) and/or S3b (flip scope to broken) -> CI -> ... -> S4 (Step 7 sync check) -> merge. Each invocation reads `git status`, the latest CI run, and the recipe scopes to determine which state it's in; nothing is carried in skill memory between sessions. Wall-clock "Day 1/2/3" is just shorthand for "human pings the agent after each ~6h CI round". The skill deliberately stays out of scope for: bumping LTS in the same PR (separate cadence), hand-editing golden JSONs (use the dedicated skill), and fixing real regressions inline with the bump (use `mr-broken` + an issue instead). Signed-off-by: Ajay Balasa <abalasa@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
- Enhanced the `bump-base-image` skill documentation to clarify the workflow for updating the PyTorch base image, emphasizing the importance of synchronizing GitHub and GitLab CI pins. - Updated the `update-golden-values` skill to reflect changes in the scoring method, now using average normalized relative differences instead of KL divergence for golden value comparisons. - Modified the `compare_golden_values_kl.py` script to compute and report average relative differences, improving clarity and usability for users comparing golden values. These changes aim to streamline the process of updating golden values and ensure accurate reporting of differences, enhancing the overall CI workflow.
Contributor
Author
|
/ok to test 3110a17 |
Contributor
Author
|
/ok to test 3110a17 |
chtruong814
approved these changes
May 11, 2026
ko3n1g
approved these changes
May 11, 2026
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25701539485 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Add the
bump-base-imageskill capturing the end-to-end workflow for upgrading the NVIDIA PyTorch base container (both the GitHubdocker/.ngc_version.devpin and the GitLab.gitlab/stages/01.build.ymlIMAGE_TYPE: devrows), including the post-bump CI loop, golden-value refresh hand-off, andmr-brokentriage path.Simplify
compare_golden_values_kl.pyto a single per-(file, metric) signed average normalized relative difference (avg_rel_diff = mean((old − new) / old)), dropping KL / median / max-rel-diff, and update theupdate-golden-valuesskill (description, summary template, triage rules, gotchas) to match.Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.