Skip to content

Submission/fastattn mtp dr#1691

Closed
AVINASH0052 wants to merge 5 commits intoopenai:mainfrom
AVINASH0052:submission/fastattn-mtp-dr
Closed

Submission/fastattn mtp dr#1691
AVINASH0052 wants to merge 5 commits intoopenai:mainfrom
AVINASH0052:submission/fastattn-mtp-dr

Conversation

@AVINASH0052
Copy link
Copy Markdown

No description provided.

@AVINASH0052 AVINASH0052 marked this pull request as ready for review April 17, 2026 07:06
Copilot AI review requested due to automatic review settings April 17, 2026 07:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new /records/track_10min_16mb submission snapshot (“FastAttn_MTP_DepthRec”) that builds on the baseline trainer to incorporate depth recurrence and a multi-token-prediction auxiliary loss, along with launch scripts and record documentation/metadata.

Changes:

  • Introduces a modified train_gpt.py with depth recurrence (NUM_REPS) and MTP auxiliary loss (MTP_WEIGHT).
  • Adds leaderboard + smoke-test launch scripts for 8xH100 and 1 GPU.
  • Adds record README and submission.json metadata for the submission folder.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/train_gpt.py New training script snapshot with depth recurrence + MTP integrated into the baseline trainer.
records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/submission.json Record metadata for the submission (currently schema/fields differ from other records).
records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/run_smoke_1gpu.sh 1-GPU smoke test runner to validate compilation/training.
records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/run_leaderboard_8xh100.sh 8xH100 leaderboard run launcher with tuned env/config defaults.
records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/README.md Documentation describing the approach, architecture, and how to reproduce runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)
self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32))
# Per-extra-rep gate (init 0 so training starts as a pure U-Net, then learns to use recurrence).
self.rep_gates = nn.Parameter(torch.zeros(max(num_reps - 1, 1), model_dim, dtype=torch.float32))
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rep_gates is created even when num_reps==1 (max(num_reps - 1, 1)), but it is never used in forward() because the recurrence loop is skipped. With DDP (used whenever running under torchrun), an always-unused parameter typically triggers "Expected to have finished reduction" / unused-parameter errors unless find_unused_parameters=True. Consider not registering rep_gates when num_reps==1 (e.g., register_parameter(..., None)), or make it a 0-element parameter and adjust the indexing/optimizer logic accordingly.

Suggested change
self.rep_gates = nn.Parameter(torch.zeros(max(num_reps - 1, 1), model_dim, dtype=torch.float32))
# Do not register an unused parameter when there is only a single repetition.
if num_reps > 1:
self.rep_gates = nn.Parameter(torch.zeros(num_reps - 1, model_dim, dtype=torch.float32))
else:
self.register_parameter("rep_gates", None)

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +37
# Default Simple Baseline run:
# - 9 transformer blocks at width 512
# - 8 attention heads with 4 KV heads (GQA) and 2x MLP expansion
# - vocab size 1024, sequence length 1024, tied embeddings
# - 524,288 train tokens per step for 20,000 iterations with a ~10 minute cap
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment describes the "Default Simple Baseline" as 9 layers / 512 width / KV heads=4, but this record script’s defaults are now NUM_LAYERS=7, MODEL_DIM=576, NUM_KV_HEADS=2, plus NUM_REPS/MTP_WEIGHT. Updating/removing this block would avoid misleading readers about what this script actually runs by default.

Suggested change
# Default Simple Baseline run:
# - 9 transformer blocks at width 512
# - 8 attention heads with 4 KV heads (GQA) and 2x MLP expansion
# - vocab size 1024, sequence length 1024, tied embeddings
# - 524,288 train tokens per step for 20,000 iterations with a ~10 minute cap
# Default run settings for this record are defined by the `Hyperparameters`
# class below. Keep any high-level summary here aligned with those defaults to
# avoid stale architecture/training-shape notes as the record evolves.

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +12
"track": "track_10min_16mb",
"name": "FastAttn_MTP_DepthRec",
"date": "2026-04-17",
"val_loss": null,
"val_bpb": null,
"bytes_total": null,
"bytes_code": null,
"bytes_model": null,
"training_time_seconds": 600,
"gpus": "8xH100",
"notes": "Fork of proven baseline. Adds (1) depth recurrence NUM_REPS=2, (2) multi-token prediction MTP_WEIGHT=0.3, (3) width 576 (vs 512). 7 physical layers, 14 effective via recurrence."
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submission.json doesn’t match the schema used by other records/track_10min_16mb/*/submission.json files (they include author, github_id, blurb, ISO-8601 date, and typically concrete val_* / bytes_* values). This variant with track/name plus many null fields is likely to break any tooling that expects the standard record metadata format; consider aligning the keys/date format and filling in the measured metrics from the run log.

Suggested change
"track": "track_10min_16mb",
"name": "FastAttn_MTP_DepthRec",
"date": "2026-04-17",
"val_loss": null,
"val_bpb": null,
"bytes_total": null,
"bytes_code": null,
"bytes_model": null,
"training_time_seconds": 600,
"gpus": "8xH100",
"notes": "Fork of proven baseline. Adds (1) depth recurrence NUM_REPS=2, (2) multi-token prediction MTP_WEIGHT=0.3, (3) width 576 (vs 512). 7 physical layers, 14 effective via recurrence."
"author": "",
"github_id": "",
"date": "2026-04-17",
"blurb": "Fork of proven baseline. Adds (1) depth recurrence NUM_REPS=2, (2) multi-token prediction MTP_WEIGHT=0.3, (3) width 576 (vs 512). 7 physical layers, 14 effective via recurrence.",
"val_loss": null,
"val_bpb": null,
"bytes_total": null,
"bytes_code": null,
"bytes_model": null,
"training_time_seconds": 600,
"gpus": "8xH100"

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +23
Everything else (Muon, SDPA/FlashAttn, U-Net skips, tied embeddings, int8+zlib
GPTQ, logit softcap) is inherited verbatim from the baseline.
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README claims the quantization/export method includes "int8+zlib GPTQ", but train_gpt.py implements a simple per-row/per-tensor int8 quantization with saved scales (no GPTQ optimization step). This wording is misleading; consider renaming it to match the actual implementation (e.g., "int8 per-row + zlib"), or documenting GPTQ only if it’s truly used.

Suggested change
Everything else (Muon, SDPA/FlashAttn, U-Net skips, tied embeddings, int8+zlib
GPTQ, logit softcap) is inherited verbatim from the baseline.
Everything else (Muon, SDPA/FlashAttn, U-Net skips, tied embeddings, int8
per-row + zlib, logit softcap) is inherited verbatim from the baseline.

Copilot uses AI. Check for mistakes.

```bash
# one-time
pip install brotli sentencepiece -q
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repro instructions suggest pip install brotli ..., but this record’s scripts appear to only require sentencepiece (and PyTorch/Numpy); nothing in train_gpt.py imports or uses brotli. Consider dropping it from the install line to avoid confusion about dependencies.

Suggested change
pip install brotli sentencepiece -q
pip install sentencepiece -q

Copilot uses AI. Check for mistakes.
@AVINASH0052
Copy link
Copy Markdown
Author

Closing: pivoting to a unique approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants