Submission/fastattn mtp dr#1691
Conversation
…nce and multi-token prediction
There was a problem hiding this comment.
Pull request overview
Adds a new /records/track_10min_16mb submission snapshot (“FastAttn_MTP_DepthRec”) that builds on the baseline trainer to incorporate depth recurrence and a multi-token-prediction auxiliary loss, along with launch scripts and record documentation/metadata.
Changes:
- Introduces a modified
train_gpt.pywith depth recurrence (NUM_REPS) and MTP auxiliary loss (MTP_WEIGHT). - Adds leaderboard + smoke-test launch scripts for 8xH100 and 1 GPU.
- Adds record README and
submission.jsonmetadata for the submission folder.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/train_gpt.py | New training script snapshot with depth recurrence + MTP integrated into the baseline trainer. |
| records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/submission.json | Record metadata for the submission (currently schema/fields differ from other records). |
| records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/run_smoke_1gpu.sh | 1-GPU smoke test runner to validate compilation/training. |
| records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/run_leaderboard_8xh100.sh | 8xH100 leaderboard run launcher with tuned env/config defaults. |
| records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/README.md | Documentation describing the approach, architecture, and how to reproduce runs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers) | ||
| self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32)) | ||
| # Per-extra-rep gate (init 0 so training starts as a pure U-Net, then learns to use recurrence). | ||
| self.rep_gates = nn.Parameter(torch.zeros(max(num_reps - 1, 1), model_dim, dtype=torch.float32)) |
There was a problem hiding this comment.
rep_gates is created even when num_reps==1 (max(num_reps - 1, 1)), but it is never used in forward() because the recurrence loop is skipped. With DDP (used whenever running under torchrun), an always-unused parameter typically triggers "Expected to have finished reduction" / unused-parameter errors unless find_unused_parameters=True. Consider not registering rep_gates when num_reps==1 (e.g., register_parameter(..., None)), or make it a 0-element parameter and adjust the indexing/optimizer logic accordingly.
| self.rep_gates = nn.Parameter(torch.zeros(max(num_reps - 1, 1), model_dim, dtype=torch.float32)) | |
| # Do not register an unused parameter when there is only a single repetition. | |
| if num_reps > 1: | |
| self.rep_gates = nn.Parameter(torch.zeros(num_reps - 1, model_dim, dtype=torch.float32)) | |
| else: | |
| self.register_parameter("rep_gates", None) |
| # Default Simple Baseline run: | ||
| # - 9 transformer blocks at width 512 | ||
| # - 8 attention heads with 4 KV heads (GQA) and 2x MLP expansion | ||
| # - vocab size 1024, sequence length 1024, tied embeddings | ||
| # - 524,288 train tokens per step for 20,000 iterations with a ~10 minute cap |
There was a problem hiding this comment.
The header comment describes the "Default Simple Baseline" as 9 layers / 512 width / KV heads=4, but this record script’s defaults are now NUM_LAYERS=7, MODEL_DIM=576, NUM_KV_HEADS=2, plus NUM_REPS/MTP_WEIGHT. Updating/removing this block would avoid misleading readers about what this script actually runs by default.
| # Default Simple Baseline run: | |
| # - 9 transformer blocks at width 512 | |
| # - 8 attention heads with 4 KV heads (GQA) and 2x MLP expansion | |
| # - vocab size 1024, sequence length 1024, tied embeddings | |
| # - 524,288 train tokens per step for 20,000 iterations with a ~10 minute cap | |
| # Default run settings for this record are defined by the `Hyperparameters` | |
| # class below. Keep any high-level summary here aligned with those defaults to | |
| # avoid stale architecture/training-shape notes as the record evolves. |
| "track": "track_10min_16mb", | ||
| "name": "FastAttn_MTP_DepthRec", | ||
| "date": "2026-04-17", | ||
| "val_loss": null, | ||
| "val_bpb": null, | ||
| "bytes_total": null, | ||
| "bytes_code": null, | ||
| "bytes_model": null, | ||
| "training_time_seconds": 600, | ||
| "gpus": "8xH100", | ||
| "notes": "Fork of proven baseline. Adds (1) depth recurrence NUM_REPS=2, (2) multi-token prediction MTP_WEIGHT=0.3, (3) width 576 (vs 512). 7 physical layers, 14 effective via recurrence." |
There was a problem hiding this comment.
submission.json doesn’t match the schema used by other records/track_10min_16mb/*/submission.json files (they include author, github_id, blurb, ISO-8601 date, and typically concrete val_* / bytes_* values). This variant with track/name plus many null fields is likely to break any tooling that expects the standard record metadata format; consider aligning the keys/date format and filling in the measured metrics from the run log.
| "track": "track_10min_16mb", | |
| "name": "FastAttn_MTP_DepthRec", | |
| "date": "2026-04-17", | |
| "val_loss": null, | |
| "val_bpb": null, | |
| "bytes_total": null, | |
| "bytes_code": null, | |
| "bytes_model": null, | |
| "training_time_seconds": 600, | |
| "gpus": "8xH100", | |
| "notes": "Fork of proven baseline. Adds (1) depth recurrence NUM_REPS=2, (2) multi-token prediction MTP_WEIGHT=0.3, (3) width 576 (vs 512). 7 physical layers, 14 effective via recurrence." | |
| "author": "", | |
| "github_id": "", | |
| "date": "2026-04-17", | |
| "blurb": "Fork of proven baseline. Adds (1) depth recurrence NUM_REPS=2, (2) multi-token prediction MTP_WEIGHT=0.3, (3) width 576 (vs 512). 7 physical layers, 14 effective via recurrence.", | |
| "val_loss": null, | |
| "val_bpb": null, | |
| "bytes_total": null, | |
| "bytes_code": null, | |
| "bytes_model": null, | |
| "training_time_seconds": 600, | |
| "gpus": "8xH100" |
| Everything else (Muon, SDPA/FlashAttn, U-Net skips, tied embeddings, int8+zlib | ||
| GPTQ, logit softcap) is inherited verbatim from the baseline. |
There was a problem hiding this comment.
The README claims the quantization/export method includes "int8+zlib GPTQ", but train_gpt.py implements a simple per-row/per-tensor int8 quantization with saved scales (no GPTQ optimization step). This wording is misleading; consider renaming it to match the actual implementation (e.g., "int8 per-row + zlib"), or documenting GPTQ only if it’s truly used.
| Everything else (Muon, SDPA/FlashAttn, U-Net skips, tied embeddings, int8+zlib | |
| GPTQ, logit softcap) is inherited verbatim from the baseline. | |
| Everything else (Muon, SDPA/FlashAttn, U-Net skips, tied embeddings, int8 | |
| per-row + zlib, logit softcap) is inherited verbatim from the baseline. |
|
|
||
| ```bash | ||
| # one-time | ||
| pip install brotli sentencepiece -q |
There was a problem hiding this comment.
Repro instructions suggest pip install brotli ..., but this record’s scripts appear to only require sentencepiece (and PyTorch/Numpy); nothing in train_gpt.py imports or uses brotli. Consider dropping it from the install line to avoid confusion about dependencies.
| pip install brotli sentencepiece -q | |
| pip install sentencepiece -q |
|
Closing: pivoting to a unique approach |
No description provided.