Skip to content

Non-record: 11L PR315 Backout + Native FA3 RunPod (val_bpb=1.1247)#394

Open
greqone wants to merge 1 commit intoopenai:mainfrom
greqone:codex/pr315-backout-fa3-nonrecord
Open

Non-record: 11L PR315 Backout + Native FA3 RunPod (val_bpb=1.1247)#394
greqone wants to merge 1 commit intoopenai:mainfrom
greqone:codex/pr315-backout-fa3-nonrecord

Conversation

@greqone
Copy link
Copy Markdown

@greqone greqone commented Mar 22, 2026

Summary

  • add a non-record 10-minute-track submission folder for a faithful RunPod 8xH100 SXM PR315-style run plus Backout
  • include the exact training log, self-contained train_gpt.py, requirements.txt, submission.json, and README
  • package this as a non-record entry because the current live public frontier is already slightly below this score and this submission does not include a significance set for a new record claim

Result

  • exact sliding-window metric: val_bpb = 1.12467423
  • exact sliding-window loss: 1.89896029
  • total artifact bytes in this packaged folder: 15,545,662
  • hardware: 8xH100 SXM on RunPod with native Hopper FlashAttention and torch.compile

Notes

  • the original experiment used a sibling flash_attn_interface.py; for this submission folder that helper is inlined into train_gpt.py so the package is self-contained and closer to the repo guidance that counted code should live in train_gpt.py
  • this is intentionally filed under records/track_non_record_16mb/...

Copilot AI review requested due to automatic review settings March 22, 2026 03:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new non-record 10-minute / 16MB artifact-cap submission folder under records/track_non_record_16mb, packaging a self-contained train_gpt.py snapshot plus run artifacts for an 8xH100 SXM (RunPod) run using native FlashAttention (FA3) and torch.compile.

Changes:

  • Add a self-contained training script (train_gpt.py) with inlined FlashAttention interface logic, Backout residual, and sliding-window evaluation.
  • Include exact run artifacts (train.log) and metadata (submission.json) for the reported val_bpb=1.12467423.
  • Add reproducibility notes (README.md) and a minimal dependency list (requirements.txt).

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/train_gpt.py Self-contained training + export + int6 quant + sliding-window eval script for the submission run.
records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/train.log Captured training/eval log for the submitted run.
records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/submission.json Leaderboard-style metadata for the non-record entry.
records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/requirements.txt Dependencies needed to reproduce locally (per repo guidance).
records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/README.md Run description, artifact accounting, and reproduction command.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

(int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")),
default=0,
) + 1
late_k_layers = set(range(num_layers_total - 2, num_layers_total))
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

late_k_layers is computed but never used, which makes the quantization logic harder to follow and suggests a partially removed feature. Consider deleting it or wiring it into the intended “late-K passthrough” behavior so the code matches the stated design.

Suggested change
late_k_layers = set(range(num_layers_total - 2, num_layers_total))

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +3
train_gpt_submit.py — Submission v2: wider MLP + STE int6 QAT + MTP + seq2048 + NTK RoPE +
fp16 embed + late-K passthrough + sliding window eval.
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring refers to train_gpt_submit.py and lists features (e.g., “fp16 embed”, “MTP”) that don’t clearly match the actual defaults/implementation in this file. This can be confusing when someone audits the submission; consider updating the docstring to reflect the actual filename and the concrete features enabled in this snapshot (or remove the feature list).

Suggested change
train_gpt_submit.pySubmission v2: wider MLP + STE int6 QAT + MTP + seq2048 + NTK RoPE +
fp16 embed + late-K passthrough + sliding window eval.
Training script for GPT models used in parameter-golf submissions.
This module's behavior is defined by the hyperparameters and options below; refer to
the code and configuration flags instead of this docstring for an authoritative list
of enabled features.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +63
def _load_system_flash_attn_interface():
for entry in sys.path:
if not entry:
continue
try:
resolved = Path(entry).resolve()
except OSError:
continue
candidate = resolved / "flash_attn_interface.py"
if not candidate.exists() or candidate.resolve() == here:
continue
if repo_root in candidate.resolve().parents:
continue
spec = importlib.util.spec_from_file_location("_system_flash_attn_interface", candidate)
if spec is None or spec.loader is None:
continue
module = importlib.util.module_from_spec(spec)
sys.modules[spec.name] = module
spec.loader.exec_module(module)
fn = getattr(module, "flash_attn_func", None)
if callable(fn):
return fn
return None
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_load_system_flash_attn_interface() dynamically locates and executes an arbitrary flash_attn_interface.py from sys.path. This is a code-execution footgun (and can make runs non-reproducible if sys.path differs). Consider removing this path-walk entirely, or gating it behind an explicit env var that points to a known file and validating it’s in an expected location (e.g., site-packages) before importing.

Copilot uses AI. Check for mistakes.
except OSError:
continue
candidate = resolved / "flash_attn_interface.py"
if not candidate.exists() or candidate.resolve() == here:
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _load_system_flash_attn_interface, the check candidate.resolve() == here will never be true because candidate is flash_attn_interface.py while here is train_gpt.py. If the intent is to avoid importing a repo-local helper, consider removing this condition (the subsequent repo_root parent check already covers it) or comparing against the actual helper path.

Suggested change
if not candidate.exists() or candidate.resolve() == here:
if not candidate.exists():

Copilot uses AI. Check for mistakes.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: 11L PR315 Backout + Native FA3 RunPod (val_bpb=1.1247)

BPB: 1.1247 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 6b4acf9082ee, file records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=72744 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=72744 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants