Record: 0.9076 BPB — 10L + N-gram Backoff + Matrix LR 0.03#828
Record: 0.9076 BPB — 10L + N-gram Backoff + Matrix LR 0.03#828bigbag wants to merge 2 commits intoopenai:mainfrom
Conversation
Single change from PR openai#802: MATRIX_LR=0.03 (was 0.02). Discovered through systematic screening (74 experiments, steps 10-12). - 10L, 512d, GQA 8/4, LeakyReLU(0.5)², BigramHash 4096 - Multi-order n-gram backoff eval cache (orders 2-7) - Entropy-adaptive alpha mixing (score-first, legal) - 8xH100 SXM, 600s training, 138s eval - Artifact: 15.32 MB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Nice result — the systematic hyperparameter screening (74 experiments) is a solid approach, and the MATRIX_LR finding is a clean single-variable improvement. Heads up: the submission currently has 1 seed. The leaderboard requires 3-seed validation with statistical significance for record claims. Totally understand if you're waiting on compute before running the remaining seeds — just flagging so it doesn't get passed over during review. Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
… proxy) 10L + Multi-Order N-gram Backoff with entropy-adaptive alpha. Validated on 1xH100 SXM (876 steps, 59% eval coverage). Pending 8xH100 SXM verification for official record submission. Based on PR openai#828 approach with MATRIX_LR=0.03. Architecture: 10L, 512d, MLP 3x LeakyReLU(0.5)², XSA-4, VRL, BigramHash, SmearGate. Artifact: 15.18 MB (under 16 MB limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seeds 42 (0.9076), 1337 (0.9072), 2024 (0.9074). All artifacts under 16MB (15.26-15.46 MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future! |
…s 2007) THE biggest legal technique gap after LEGAL_TTT. Top 30 legal PRs in COMPETITION_SCOPE.md all use multi-order n-gram backoff (openai#788/openai#802/openai#828/openai#761 = 0.91-0.96 BPB). Implementation: at each position, use the HIGHEST-CONFIDENCE n-gram order ONLY: - if peak(4-gram[h]) > T4: use 4-gram with weight 1.0 - elif peak(3-gram[h]) > T3: use 3-gram with weight α=0.4 (Brants 2007) - else: use bigram with weight α²=0.16 The 'peak' = max log-prob across vocab — concentrated distributions = confident counts. Hash-collision noise in lower orders is stripped by using only the most-confident order. Marker: NGRAM_BACKOFF_MARKER. Env: USE_NGRAM_BACKOFF=1, NGRAM_BACKOFF_THRESH4=1.0, NGRAM_BACKOFF_THRESH3=1.0, NGRAM_BACKOFF_ALPHA=0.4. Composes with NGRAM_GATE. Smoke test in /tmp passes: marker present in patched file, syntax-valid Python. EXPECTED_MARKERS now 46 (was 45). Queued L09_ngram_backoff_S2_seed42/seed1337 on Pod C for n=2 cheap-pod validation.
Summary
val_bpb = 0.9074 (3-seed mean, std 0.0002) | 15.26-15.46 MB | 8xH100 SXM, 600s
Single change from PR #802: MATRIX_LR=0.03 (was 0.02). Discovered through systematic hyperparameter screening (74 experiments across steps 10-12).
Results
Key Change
MATRIX_LR=0.03vs PR #802's default0.02.Architecture
Eval: Multi-Order N-gram Backoff (from PR #802)
Reproduction
Test plan
Based On
🤖 Generated with Claude Code