Submission/fastattn mtp dr by AVINASH0052 · Pull Request #1691 · openai/parameter-golf

AVINASH0052 · 2026-04-17T07:06:49Z

No description provided.

…nce and multi-token prediction

Copilot

Pull request overview

Adds a new /records/track_10min_16mb submission snapshot (“FastAttn_MTP_DepthRec”) that builds on the baseline trainer to incorporate depth recurrence and a multi-token-prediction auxiliary loss, along with launch scripts and record documentation/metadata.

Changes:

Introduces a modified train_gpt.py with depth recurrence (NUM_REPS) and MTP auxiliary loss (MTP_WEIGHT).
Adds leaderboard + smoke-test launch scripts for 8xH100 and 1 GPU.
Adds record README and submission.json metadata for the submission folder.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/train_gpt.py	New training script snapshot with depth recurrence + MTP integrated into the baseline trainer.
records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/submission.json	Record metadata for the submission (currently schema/fields differ from other records).
records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/run_smoke_1gpu.sh	1-GPU smoke test runner to validate compilation/training.
records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/run_leaderboard_8xh100.sh	8xH100 leaderboard run launcher with tuned env/config defaults.
records/track_10min_16mb/2026-04-17_FastAttn_MTP_DepthRec/README.md	Documentation describing the approach, architecture, and how to reproduce runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T07:12:49Z

+        self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)
+        self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32))
+        # Per-extra-rep gate (init 0 so training starts as a pure U-Net, then learns to use recurrence).
+        self.rep_gates = nn.Parameter(torch.zeros(max(num_reps - 1, 1), model_dim, dtype=torch.float32))


rep_gates is created even when num_reps==1 (max(num_reps - 1, 1)), but it is never used in forward() because the recurrence loop is skipped. With DDP (used whenever running under torchrun), an always-unused parameter typically triggers "Expected to have finished reduction" / unused-parameter errors unless find_unused_parameters=True. Consider not registering rep_gates when num_reps==1 (e.g., register_parameter(..., None)), or make it a 0-element parameter and adjust the indexing/optimizer logic accordingly.

Suggested change

self.rep_gates = nn.Parameter(torch.zeros(max(num_reps - 1, 1), model_dim, dtype=torch.float32))

# Do not register an unused parameter when there is only a single repetition.

if num_reps > 1:

self.rep_gates = nn.Parameter(torch.zeros(num_reps - 1, model_dim, dtype=torch.float32))

else:

self.register_parameter("rep_gates", None)

Copilot · 2026-04-17T07:12:49Z

+# Default Simple Baseline run:
+# - 9 transformer blocks at width 512
+# - 8 attention heads with 4 KV heads (GQA) and 2x MLP expansion
+# - vocab size 1024, sequence length 1024, tied embeddings
+# - 524,288 train tokens per step for 20,000 iterations with a ~10 minute cap


The header comment describes the "Default Simple Baseline" as 9 layers / 512 width / KV heads=4, but this record script’s defaults are now NUM_LAYERS=7, MODEL_DIM=576, NUM_KV_HEADS=2, plus NUM_REPS/MTP_WEIGHT. Updating/removing this block would avoid misleading readers about what this script actually runs by default.

Suggested change

# Default Simple Baseline run:

# - 9 transformer blocks at width 512

# - 8 attention heads with 4 KV heads (GQA) and 2x MLP expansion

# - vocab size 1024, sequence length 1024, tied embeddings

# - 524,288 train tokens per step for 20,000 iterations with a ~10 minute cap

# Default run settings for this record are defined by the `Hyperparameters`

# class below. Keep any high-level summary here aligned with those defaults to

# avoid stale architecture/training-shape notes as the record evolves.

Copilot · 2026-04-17T07:12:50Z

+  "track": "track_10min_16mb",
+  "name": "FastAttn_MTP_DepthRec",
+  "date": "2026-04-17",
+  "val_loss": null,
+  "val_bpb": null,
+  "bytes_total": null,
+  "bytes_code": null,
+  "bytes_model": null,
+  "training_time_seconds": 600,
+  "gpus": "8xH100",
+  "notes": "Fork of proven baseline. Adds (1) depth recurrence NUM_REPS=2, (2) multi-token prediction MTP_WEIGHT=0.3, (3) width 576 (vs 512). 7 physical layers, 14 effective via recurrence."


submission.json doesn’t match the schema used by other records/track_10min_16mb/*/submission.json files (they include author, github_id, blurb, ISO-8601 date, and typically concrete val_* / bytes_* values). This variant with track/name plus many null fields is likely to break any tooling that expects the standard record metadata format; consider aligning the keys/date format and filling in the measured metrics from the run log.

Suggested change

"track": "track_10min_16mb",

"name": "FastAttn_MTP_DepthRec",

"date": "2026-04-17",

"val_loss": null,

"val_bpb": null,

"bytes_total": null,

"bytes_code": null,

"bytes_model": null,

"training_time_seconds": 600,

"gpus": "8xH100",

"notes": "Fork of proven baseline. Adds (1) depth recurrence NUM_REPS=2, (2) multi-token prediction MTP_WEIGHT=0.3, (3) width 576 (vs 512). 7 physical layers, 14 effective via recurrence."

"author": "",

"github_id": "",

"date": "2026-04-17",

"blurb": "Fork of proven baseline. Adds (1) depth recurrence NUM_REPS=2, (2) multi-token prediction MTP_WEIGHT=0.3, (3) width 576 (vs 512). 7 physical layers, 14 effective via recurrence.",

"val_loss": null,

"val_bpb": null,

"bytes_total": null,

"bytes_code": null,

"bytes_model": null,

"training_time_seconds": 600,

"gpus": "8xH100"

Copilot · 2026-04-17T07:12:50Z

+Everything else (Muon, SDPA/FlashAttn, U-Net skips, tied embeddings, int8+zlib
+GPTQ, logit softcap) is inherited verbatim from the baseline.


The README claims the quantization/export method includes "int8+zlib GPTQ", but train_gpt.py implements a simple per-row/per-tensor int8 quantization with saved scales (no GPTQ optimization step). This wording is misleading; consider renaming it to match the actual implementation (e.g., "int8 per-row + zlib"), or documenting GPTQ only if it’s truly used.

Suggested change

Everything else (Muon, SDPA/FlashAttn, U-Net skips, tied embeddings, int8+zlib

GPTQ, logit softcap) is inherited verbatim from the baseline.

Everything else (Muon, SDPA/FlashAttn, U-Net skips, tied embeddings, int8

per-row + zlib, logit softcap) is inherited verbatim from the baseline.

Copilot · 2026-04-17T07:12:50Z

+
+```bash
+# one-time
+pip install brotli sentencepiece -q


Repro instructions suggest pip install brotli ..., but this record’s scripts appear to only require sentencepiece (and PyTorch/Numpy); nothing in train_gpt.py imports or uses brotli. Consider dropping it from the install line to avoid confusion about dependencies.

Suggested change

pip install brotli sentencepiece -q

pip install sentencepiece -q

…512 proven shape)

AVINASH0052 · 2026-04-17T11:59:23Z

Closing: pivoting to a unique approach

AVINASH0052 added 2 commits April 17, 2026 12:34

add FastAttn+MTP+DepthRec record: fork of baseline with depth recurre…

1c74934

…nce and multi-token prediction

remove HOW_TO_RUN.txt from tracking

17c6296

AVINASH0052 marked this pull request as ready for review April 17, 2026 07:06

Copilot AI review requested due to automatic review settings April 17, 2026 07:06

Copilot started reviewing on behalf of AVINASH0052 April 17, 2026 07:07 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

AVINASH0052 added 3 commits April 17, 2026 12:43

fix: anchor DATA_PATH/TOKENIZER_PATH to repo root via BASH_SOURCE

a998ec2

fix: auto-download data/tokenizer if not present before training

482323d

rework submission: Baseline+MTP (drop depth recurrence, restore 9L x …

162c862

…512 proven shape)

AVINASH0052 closed this Apr 17, 2026

PiyushDatta mentioned this pull request May 1, 2026

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta #2106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission/fastattn mtp dr#1691

Submission/fastattn mtp dr#1691
AVINASH0052 wants to merge 5 commits intoopenai:mainfrom
AVINASH0052:submission/fastattn-mtp-dr

AVINASH0052 commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

AVINASH0052 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        self.rep_gates = nn.Parameter(torch.zeros(max(num_reps - 1, 1), model_dim, dtype=torch.float32))
+        # Do not register an unused parameter when there is only a single repetition.
+        if num_reps > 1:
+            self.rep_gates = nn.Parameter(torch.zeros(num_reps - 1, model_dim, dtype=torch.float32))
+        else:
+            self.register_parameter("rep_gates", None)

		Everything else (Muon, SDPA/FlashAttn, U-Net skips, tied embeddings, int8+zlib
		GPTQ, logit softcap) is inherited verbatim from the baseline.

	pip install brotli sentencepiece -q
	pip install sentencepiece -q

Conversation

AVINASH0052 commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

AVINASH0052 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants