Skip to content

chore: nightly sync main into dev (10_05_2026)#4716

Merged
balasaajay merged 126 commits into
devfrom
main2dev/10_05_2026
May 11, 2026
Merged

chore: nightly sync main into dev (10_05_2026)#4716
balasaajay merged 126 commits into
devfrom
main2dev/10_05_2026

Conversation

@svcnvidia-nemo-ci

Copy link
Copy Markdown

Summary

Nightly sync of main into dev for 10_05_2026.

  • Merged 120 commits from main into dev.
  • Python lines: +36212 / -10810 across 269 files
  • Merge strategy: git merge origin/main -X theirs with surgical resolution of squash-merge divergences and per-file overrides documented below.

Files taken from main wholesale

These files have known semantic conflicts where dev's versions reference args/APIs that main removed or renamed. Taken via git checkout origin/main -- <file>:

  • megatron/training/training.py
  • megatron/training/initialize.py
  • megatron/training/utils.py
  • megatron/training/datasets/data_samplers.py
  • megatron/core/optimizer/layer_wise_optimizer.py

Files deleted in dev but restored from main

  • megatron/core/pipeline_parallel/hybrid_cp_schedule.py — restored because main's HybridCPDataLoaderWrapper (appended to data_schedule.py) imports BalancedCPScheduler from it.

Modify/delete conflicts resolved

These files were deleted on main and modified on dev; we accepted main's deletion (only the deleted files themselves used them, and megatron.legacy.fp16_deprecated survives independently):

  • megatron/legacy/model/__init__.py (deleted)
  • megatron/legacy/model/transformer.py (deleted)
  • tools/checkpoint/loader_legacy.py (deleted)
  • tools/checkpoint/loader_llama_mistral.py (deleted)

This file was deleted on dev (PR #3576) and modified on main; we kept dev's deletion per intent:

  • .github/workflows/multi-approval-bot.yml (deleted)

Special handling

  • megatron/core/datasets/data_schedule.py: main and dev have completely different classes. Kept dev's file (BasePackingScheduler, DpBalancedScheduler, DefaultDynamicCPScheduler, wrap_data_iterator, get_batch_on_this_rank_for_sequence_packing) and appended main's HybridCPDataLoaderWrapper plus the Any, List typing imports and BalancedCPScheduler import.

Dependency triple preserved from dev

pyproject.toml, uv.lock, and docker/Dockerfile.ci.dev were kept at dev's versions per the merge policy. Audit of [tool.uv.sources] diff:

  • flash_mla, transformer-engine, nemo-run, emerging_optimizers: identical revisions, no action needed.
  • fast-hadamard-transform: dev-only, kept.
  • nvidia-resiliency-ext: dev pins 15a85156… (Apr 6 2026), main pins b2bb3d72… (Apr 16 2026). Kept dev's revision after verifying all submodules imported in the merged tree exist there (get_write_results_queue present, all checkpointing.{async,local}.* and shared_utils.inject_fault paths resolved).

API mismatch audit (post-merge)

Verified caller/implementation alignment in hotspots called out by the sync skill:

  • multi_latent_attention.py (from main) → FineGrainedActivationOffloadingInterface ctor (offload, tensor, name), group_offload, group_commit: ✅ match
  • mamba_model.py: no offload calls in current tree.
  • init_chunk_handler callers (hybrid_model.py, gpt_model.py): full 7-keyword-arg form matches definition. ✅
  • gated_delta_net.py: _resolve_cu_seqlens both called and defined inside the file. ✅
  • distrib_optimizer.py: _is_distopt_quantized_param and _expand_quantized_param_shard_for_cast both present and referenced consistently. ✅

CODEOWNERS

Unchanged from dev (verified git diff origin/dev -- .github/CODEOWNERS is empty).

Formatting

Ran black==24.10.0 then isort==5.13.2 on all 253 changed Python files in the merged tree.

Test plan

  • Unit tests pass
  • Integration tests pass (or are skipped per maintainer policy)
  • Internal GitLab functional tests pass

minitu and others added 30 commits April 22, 2026 18:02
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: john2 <john2@jrlogin01.jureca>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: root <root@nvl72098-T17.cm.cluster>
Co-authored-by: William Dykas <wdykas@oci-hsg-cs-001-vscode-03.cm.cluster>
Co-authored-by: root <root@nvl72160-T13.cm.cluster>
…classmethod (#3812)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
#4403)

Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>
Co-authored-by: Philip Petrakian <ppetrakian@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Co-authored-by: Siddharth Singh <sidsingh@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ss curve gaps for latent MoE models (#4433)

Signed-off-by: root <jiemingz@nvidia.com>
…4158)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…4422)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rprenger <rprenger@nvidia.com>
Signed-off-by: qiyuw <qiyuw@nvidia.com>
Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
@Phlip79 Phlip79 force-pushed the main2dev/10_05_2026 branch from dc4ca1f to bfce39d Compare May 11, 2026 05:25
@Phlip79

Phlip79 commented May 11, 2026

Copy link
Copy Markdown
Member

/ok to test bfce39d

@Phlip79 Phlip79 marked this pull request as ready for review May 11, 2026 05:27
@Phlip79 Phlip79 requested review from a team as code owners May 11, 2026 05:27
Signed-off-by: Deyu Fu <deyuf@nvidia.com>
@FDecaYed

Copy link
Copy Markdown
Contributor

/ok to test fb297f0

@FDecaYed FDecaYed left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed last test fail. LGTM

…llel

The Phase 1 merge took main's versions of these three files (per the
nightly-sync skill's "Files to Override from Main" list), but main
still references the deprecated arg name `hybrid_context_parallel`.
Dev renamed it to `dynamic_context_parallel` in commit cde56a4
("Fix for rope when enabling THD + Dynamic-CP; use the naming
Dynamic-CP"). model_parallel_config.py keeps both attributes as a
deprecation shim and copies `hybrid_context_parallel=True ->
dynamic_context_parallel=True` in __post_init__, but not the other
direction — so when a test passes `--dynamic-context-parallel: true`,
args.hybrid_context_parallel is False and any code that gates on it
silently takes the wrong branch.

This caused gpt3_mcore_te_tp2_pp1_cp4_dcp to fail in CI:
data_samplers.py:109 used `if args.hybrid_context_parallel:` to
select the pass-through `collate_fn=lambda x: x`. With the test's
--dynamic-context-parallel flag, the condition was False, so torch's
default-collate ran on tensors from the SFT-mock THD path and hit
non-resizable shared-memory storage:
  RuntimeError: Trying to resize storage that is not resizable
  in torch/utils/data/_utils/collate.py:274 collate_tensor_fn

Renames 8 references across 3 files:
- megatron/training/datasets/data_samplers.py: 2 references
- megatron/training/utils.py: 5 references (TP-broadcast batch shape)
- megatron/training/training.py: 1 reference (HybridCPDataLoaderWrapper
  iter wrap)

All 3 files pass ast.parse; zero remaining `args.hybrid_context_parallel`
references in megatron/. CODEOWNERS and the
pyproject.toml/uv.lock/Dockerfile.ci.dev triple unchanged from origin/dev.
@Phlip79

Phlip79 commented May 11, 2026

Copy link
Copy Markdown
Member

/ok to test c3dbea7

The Phase 1 merge of #4716 took main's version of training.py per the
skill's "Files to Override from Main" rule, which uses the
HybridCPDataLoaderWrapper class wrapped once outside train_step. That
broke the DCP test (gpt3_mcore_te_tp2_pp1_cp4_dcp) in two ways:

1. RuntimeError: Trying to resize storage that is not resizable -
   fixed by c3dbea7 (rename args.hybrid_context_parallel ->
   args.dynamic_context_parallel).
2. AssertionError: data iterator is not wrapped with RerunDataIterator -
   the outside-train_step wrap converted train_data_iterator from a
   RerunDataIterator to a plain iterator, but rerun_state_machine's
   should_run_forward_backward asserts the wrap.

PR #4659 resolved this by keeping dev's wrap_data_iterator pattern
instead of main's HybridCPDataLoaderWrapper, calling wrap_data_iterator
INSIDE train_step (after should_run_forward_backward) and inside the
eval loop. That keeps the original RerunDataIterator visible to the
assertion and only swaps in the packed iterator for the
forward_backward_func call.

Port that pattern verbatim from PR #4659's training.py:
- Replace HybridCPDataLoaderWrapper import with wrap_data_iterator
- Remove the outside-train_step wrap (was at line 3000-3001)
- Inside train_step: add the if config.sequence_packing_scheduler is
  not None block before forward_backward_func, unpacking (data_iterator,
  num_microbatches, seqlen_sum_this_global_batch,
  seqlen_squared_sum_this_global_batch); pass num_microbatches=
  num_microbatches to forward_backward_func
- Inside eval loop: add the same wrap with try/except StopIteration,
  using packed_data_iterator and scheduled_eval_num_microbatches

Note: this leaves HybridCPDataLoaderWrapper and its imports (Any, List,
BalancedCPScheduler) as dead code in megatron/core/datasets/data_schedule.py.
Cleanup of that file (and of the remaining structural diff in
training.py / data_samplers.py / utils.py vs PR #4659's tree) is left
to follow-up.
@Phlip79

Phlip79 commented May 11, 2026

Copy link
Copy Markdown
Member

/ok to test e04e1c2

@svcnvidia-nemo-ci

Copy link
Copy Markdown
Author

Superseded by today's nightly sync.

@Phlip79 Phlip79 reopened this May 11, 2026
@balasaajay balasaajay merged commit d338cc5 into dev May 11, 2026
178 of 180 checks passed
@balasaajay balasaajay deleted the main2dev/10_05_2026 branch May 11, 2026 23:57
svcnvidia-nemo-ci added a commit that referenced this pull request May 12, 2026
Merges 8 commits from main into dev. Dev already contains yesterday's
sync (PR #4716) plus follow-up fixes, so this PR only carries main
commits made after that sync.

Notable changes:
- 434368c build(deps): bump nvidia-modelopt to 0.43 (#4723)
- e42e2fa ci: Major refactor of release-workflows (#4602)
- 33d47e0 [ci] fix: treat cancelled run-main-script step as failure (#4727)
- 5123f6a ci: revert bad uv.lock bump and label future bumps with
  Run functional tests (#4730)
- ad58411 Add Python-side guardrail for DeepEP IB limits (#4719)
- e93755e chore(beep boop): Bump (main) (2026-05-11)
- a2ec5c1 Revert Add Python-side guardrail for HybridEP IB limit (#4718)
- 5e31514 Create a Protocol for the MLP layer of TransformerLayer (#3435)

Kept dev's pyproject.toml, uv.lock, docker/Dockerfile.ci.dev, and
.github/CODEOWNERS (per nightly-sync skill).

Ran black + isort on changed Python files.
Phlip79 added a commit to Phlip79/Megatron-LM that referenced this pull request May 13, 2026
Adds a deterministic gate step (preservation workflow) and a prompt
instruction (main sync workflow) to ensure the sync bot never modifies
any `golden_values*` file. Reference outputs from successful training
runs cannot be regenerated by the bot; PR NVIDIA#4716 modified ~89,652 lines
across dozens of `golden_values_*.json` files.

- Preservation gate: a pre-audit step fails the workflow if any file
  whose basename starts with `golden_values` differs from origin/dev.
  Reports the offending paths inline as annotations and emits the
  exact `git checkout` one-liner to restore them.
- Sync prompt: instructs the agent to run that one-liner after the
  Phase 1 merge and before every push (Phase 1, Phase 3 fixes, amends).
Phlip79 added a commit to Phlip79/Megatron-LM that referenced this pull request May 14, 2026
…y schedule

Two changes to the main-to-dev nightly sync workflow:

1. Add a "Dev-feature preservation audit" to the existing pre-push
   invariant checks in the nightly-sync skill
   (`.claude/skills/nightly-sync/SKILL.md`). The audit is bash that
   the sync bot must run before every push (Phase 1 and every Phase 3
   fix-push):

       for f in $(git diff --name-only "$BASE"...HEAD ...); do
         missing=$(comm -23 \
           <(git show "origin/dev:$f"  | sort -u) \
           <(git show "origin/main:$f" | sort -u) \
           | comm -23 - <(sort -u "$f") \
           | grep -E '[[:alnum:]_]' )
         [ -n "$missing" ] && record_violation
       done

   Lines present on `origin/dev`, absent from `origin/main`, and absent
   from the merged tree are the textbook "main2dev silently reverted a
   dev-only feature" pattern that bit NVIDIA#4659 and NVIDIA#4716. The audit
   exempts the skill's "Files to Override from Main" list (the bot may
   legitimately keep main's version there) and skips the dependency
   triple plus CODEOWNERS (already checked separately).

   Recent regressions the audit would have caught:
   - `transformer_layer.py` `_forward_mlp_router(input_ids=None)`
   - `token_dispatcher.py` `num_sms_preprocessing_api=...` kwarg
   - `moe_layer.py` `self._maybe_record_overload_factor(...)` call
   - `gpt_dynamic_inference_with_coordinator.py`
     `parse_and_validate_args` import
   - `datasets/readme.md` "Packing Scheduler" section
   - `data_samplers.py` / `utils.py` / `training.py`
     `args.dynamic_context_parallel` references

2. Schedule change in `.github/workflows/nightly-sync-main-to-dev.yml`:
   was `cron: '0 21 * * *'` (daily at 21:00 UTC); now
   `cron: '0 15 * * 1,4'` (Mondays and Thursdays at 15:00 UTC, i.e.
   8 AM PDT / 7 AM PST since GitHub Actions cron is UTC-only).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: high Run functional tests Run MBridge tests Attach this for testing this PR against MBridge main

Projects

None yet

Development

Successfully merging this pull request may close these issues.