Verifily Data Quality: Three-Tier Token Weighting + DCLS Salience by arsenis-cmd · Pull Request #1402 · openai/parameter-golf

arsenis-cmd · 2026-04-06T02:17:17Z

Summary

Three-tier token classification: GPU-resident bigram frequency table classifies tokens as Predictable (top ~5%, weight 0.10), Frontier (1.0), or Noise (0.70). Data-driven thresholds (~p95 bigram prob, median doc quality). All vectorized on GPU — zero overhead.
DCLS salience batch reweighting: EMA-based surprise tracking, per-batch loss multiplier [0.85, 1.15]. High-surprise + high-quality batches amplified.
Quality-conditioned bigram mixer at eval: P_mixed = (1-alpha)*P_neural + alpha*P_bigram where alpha is conditioned on document quality. Novel: no existing submission conditions mixer alpha on data quality.

Zero architecture changes, zero additional parameters. Layered on SOTA XSA+GPTQ+BigramHash+Muon base.

Validation

Validated locally (M4 Max, TinyGPT 500K params, FineWeb sp1024, 1000 steps):

Swept 6 weight configurations from aggressive (0.05/0.30) to gentle (0.50/0.75)
Best config (narrow_pred): -0.63% validation loss vs baseline with complementary eval
Complementary mixer helps Verifily 5x more than baseline (+4.9% vs +0.9%), confirming delegation mechanism
Key insight: soft weights (eff. batch ~81%) outperform aggressive gating (eff. batch ~53%)

val_bpb pending — awaiting 8xH100 compute grant for official run.

Test plan

Run torchrun --nproc_per_node=8 train_gpt.py on 8xH100 (3 seeds)
Run ablation: VERIFILY_ENABLED=0 baseline comparison
Run ablation: VERIFILY_MIXER=0 to isolate training-only effect
Verify training overhead < 2%
Update val_bpb in submission.json with official scores

…LS Salience + Quality-Conditioned Bigram Mixer Layered on SOTA XSA+GPTQ+BigramHash+Muon base. Zero architecture changes, zero extra parameters. Validated locally: -0.63% BPB vs baseline.

…ker identified First tokenizer-side fire (0/24 patches in this category). Subagent found 3 candidates (BPE-Dropout, Complementary Weighting, Three-Tier Classification) but ALL are blocked by our pre-tokenized .bin file pipeline. BPE-Dropout requires live re-tokenization at training time → infeasible. Complementary Weighting subagent incorrectly cited our MLX prototype, not the H100 train_gpt.py. Three-Tier is PR openai#1402 pending validation. Architectural insight: SP1024 may actually be optimal for our 22M architecture (smaller embedding = more params for model body). Top PRs use SP8192 because their depth-recurrence stack benefits from finer tokens. We may not need BPE-8192. Task openai#49 deferred indefinitely. Cross-domain coverage update (16 fires): training: 5, optimizer: 2, eval: 3, compression: 1, data: 2, tokenizer: 1, hardware: 0. Hardware still uncovered. Per user instruction: queued, not shipped. No code patches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:14:51Z

Community Review — Verifily Data Quality: Three-Tier Token Weighting + DCLS Salience

BPB: (not parsed — see PR title) | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 168bfe53b236, file records/track_10min_16mb/2026-04-05_Verifily_DataQuality_ThreeTier_DCLS/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=112932 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=112932 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Add Verifily Data Quality submission: Three-Tier Token Weighting + DC…

168bfe5

…LS Salience + Quality-Conditioned Bigram Mixer Layered on SOTA XSA+GPTQ+BigramHash+Muon base. Zero architecture changes, zero extra parameters. Validated locally: -0.63% BPB vs baseline.

arsenis-cmd force-pushed the verifily-data-quality branch from 48ba36e to 168bfe5 Compare April 6, 2026 02:23

PiyushDatta mentioned this pull request May 1, 2026

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta #2106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verifily Data Quality: Three-Tier Token Weighting + DCLS Salience#1402

Verifily Data Quality: Three-Tier Token Weighting + DCLS Salience#1402
arsenis-cmd wants to merge 1 commit intoopenai:mainfrom
arsenis-cmd:verifily-data-quality

arsenis-cmd commented Apr 6, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arsenis-cmd commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Test plan

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Verifily Data Quality: Three-Tier Token Weighting + DCLS Salience

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arsenis-cmd commented Apr 6, 2026 •

edited

Loading