Non-record: 11L PR315 Backout + Native FA3 RunPod (val_bpb=1.1247)#394
Non-record: 11L PR315 Backout + Native FA3 RunPod (val_bpb=1.1247)#394greqone wants to merge 1 commit intoopenai:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new non-record 10-minute / 16MB artifact-cap submission folder under records/track_non_record_16mb, packaging a self-contained train_gpt.py snapshot plus run artifacts for an 8xH100 SXM (RunPod) run using native FlashAttention (FA3) and torch.compile.
Changes:
- Add a self-contained training script (
train_gpt.py) with inlined FlashAttention interface logic, Backout residual, and sliding-window evaluation. - Include exact run artifacts (
train.log) and metadata (submission.json) for the reportedval_bpb=1.12467423. - Add reproducibility notes (
README.md) and a minimal dependency list (requirements.txt).
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/train_gpt.py | Self-contained training + export + int6 quant + sliding-window eval script for the submission run. |
| records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/train.log | Captured training/eval log for the submitted run. |
| records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/submission.json | Leaderboard-style metadata for the non-record entry. |
| records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/requirements.txt | Dependencies needed to reproduce locally (per repo guidance). |
| records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/README.md | Run description, artifact accounting, and reproduction command. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| (int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")), | ||
| default=0, | ||
| ) + 1 | ||
| late_k_layers = set(range(num_layers_total - 2, num_layers_total)) |
There was a problem hiding this comment.
late_k_layers is computed but never used, which makes the quantization logic harder to follow and suggests a partially removed feature. Consider deleting it or wiring it into the intended “late-K passthrough” behavior so the code matches the stated design.
| late_k_layers = set(range(num_layers_total - 2, num_layers_total)) |
| train_gpt_submit.py — Submission v2: wider MLP + STE int6 QAT + MTP + seq2048 + NTK RoPE + | ||
| fp16 embed + late-K passthrough + sliding window eval. |
There was a problem hiding this comment.
The module docstring refers to train_gpt_submit.py and lists features (e.g., “fp16 embed”, “MTP”) that don’t clearly match the actual defaults/implementation in this file. This can be confusing when someone audits the submission; consider updating the docstring to reflect the actual filename and the concrete features enabled in this snapshot (or remove the feature list).
| train_gpt_submit.py — Submission v2: wider MLP + STE int6 QAT + MTP + seq2048 + NTK RoPE + | |
| fp16 embed + late-K passthrough + sliding window eval. | |
| Training script for GPT models used in parameter-golf submissions. | |
| This module's behavior is defined by the hyperparameters and options below; refer to | |
| the code and configuration flags instead of this docstring for an authoritative list | |
| of enabled features. |
| def _load_system_flash_attn_interface(): | ||
| for entry in sys.path: | ||
| if not entry: | ||
| continue | ||
| try: | ||
| resolved = Path(entry).resolve() | ||
| except OSError: | ||
| continue | ||
| candidate = resolved / "flash_attn_interface.py" | ||
| if not candidate.exists() or candidate.resolve() == here: | ||
| continue | ||
| if repo_root in candidate.resolve().parents: | ||
| continue | ||
| spec = importlib.util.spec_from_file_location("_system_flash_attn_interface", candidate) | ||
| if spec is None or spec.loader is None: | ||
| continue | ||
| module = importlib.util.module_from_spec(spec) | ||
| sys.modules[spec.name] = module | ||
| spec.loader.exec_module(module) | ||
| fn = getattr(module, "flash_attn_func", None) | ||
| if callable(fn): | ||
| return fn | ||
| return None |
There was a problem hiding this comment.
_load_system_flash_attn_interface() dynamically locates and executes an arbitrary flash_attn_interface.py from sys.path. This is a code-execution footgun (and can make runs non-reproducible if sys.path differs). Consider removing this path-walk entirely, or gating it behind an explicit env var that points to a known file and validating it’s in an expected location (e.g., site-packages) before importing.
| except OSError: | ||
| continue | ||
| candidate = resolved / "flash_attn_interface.py" | ||
| if not candidate.exists() or candidate.resolve() == here: |
There was a problem hiding this comment.
In _load_system_flash_attn_interface, the check candidate.resolve() == here will never be true because candidate is flash_attn_interface.py while here is train_gpt.py. If the intent is to avoid importing a repo-local helper, consider removing this condition (the subsequent repo_root parent check already covers it) or comparing against the actual helper path.
| if not candidate.exists() or candidate.resolve() == here: | |
| if not candidate.exists(): |
Community Review — Non-record: 11L PR315 Backout + Native FA3 RunPod (val_bpb=1.1247)BPB: 1.1247 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=72744 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=72744 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
8xH100 SXMPR315-style run plus Backouttrain_gpt.py,requirements.txt,submission.json, and READMEResult
val_bpb = 1.124674231.8989602915,545,6628xH100 SXMon RunPod with native Hopper FlashAttention andtorch.compileNotes
flash_attn_interface.py; for this submission folder that helper is inlined intotrain_gpt.pyso the package is self-contained and closer to the repo guidance that counted code should live intrain_gpt.pyrecords/track_non_record_16mb/...