Verifily Data Quality: Three-Tier Token Weighting + DCLS Salience#1402
Verifily Data Quality: Three-Tier Token Weighting + DCLS Salience#1402arsenis-cmd wants to merge 1 commit intoopenai:mainfrom
Conversation
…LS Salience + Quality-Conditioned Bigram Mixer Layered on SOTA XSA+GPTQ+BigramHash+Muon base. Zero architecture changes, zero extra parameters. Validated locally: -0.63% BPB vs baseline.
48ba36e to
168bfe5
Compare
…ker identified First tokenizer-side fire (0/24 patches in this category). Subagent found 3 candidates (BPE-Dropout, Complementary Weighting, Three-Tier Classification) but ALL are blocked by our pre-tokenized .bin file pipeline. BPE-Dropout requires live re-tokenization at training time → infeasible. Complementary Weighting subagent incorrectly cited our MLX prototype, not the H100 train_gpt.py. Three-Tier is PR openai#1402 pending validation. Architectural insight: SP1024 may actually be optimal for our 22M architecture (smaller embedding = more params for model body). Top PRs use SP8192 because their depth-recurrence stack benefits from finer tokens. We may not need BPE-8192. Task openai#49 deferred indefinitely. Cross-domain coverage update (16 fires): training: 5, optimizer: 2, eval: 3, compression: 1, data: 2, tokenizer: 1, hardware: 0. Hardware still uncovered. Per user instruction: queued, not shipped. No code patches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Verifily Data Quality: Three-Tier Token Weighting + DCLS SalienceBPB: (not parsed — see PR title) | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=112932 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=112932 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
P_mixed = (1-alpha)*P_neural + alpha*P_bigramwhere alpha is conditioned on document quality. Novel: no existing submission conditions mixer alpha on data quality.Zero architecture changes, zero additional parameters. Layered on SOTA XSA+GPTQ+BigramHash+Muon base.
Validation
Validated locally (M4 Max, TinyGPT 500K params, FineWeb sp1024, 1000 steps):
val_bpbpending — awaiting 8xH100 compute grant for official run.Test plan
torchrun --nproc_per_node=8 train_gpt.pyon 8xH100 (3 seeds)VERIFILY_ENABLED=0baseline comparisonVERIFILY_MIXER=0to isolate training-only effectval_bpbin submission.json with official scores