Skip to content

Non-record: Verifily Three-Tier Token Weighting + DCLS Salience (SP1024, 1.1335 BPB)#1634

Open
arsenis-cmd wants to merge 1 commit intoopenai:mainfrom
arsenis-cmd:verifily-non-record
Open

Non-record: Verifily Three-Tier Token Weighting + DCLS Salience (SP1024, 1.1335 BPB)#1634
arsenis-cmd wants to merge 1 commit intoopenai:mainfrom
arsenis-cmd:verifily-non-record

Conversation

@arsenis-cmd
Copy link
Copy Markdown

@arsenis-cmd arsenis-cmd commented Apr 15, 2026

Summary

Pure data-quality approach — zero architectural changes. First submission to apply token-level quality signals to training loss weighting without modifying the model architecture.

Three components layered on an SP1024 11L 512d baseline:

  1. Three-tier token weighting: Classify tokens as Predictable (w=0.10), Frontier (w=1.0), or Noise (w=0.70) using GPU-resident bigram statistics + document quality scoring
  2. DCLS salience batch reweighting: Per-batch loss multiplier [0.85, 1.15] based on surprise signal and document quality
  3. Quality-conditioned bigram mixer at eval: Alpha conditioned on document quality (0.15 for high quality, 0.30 for low quality)

Results

2-seed mean: 1.13350264 BPB on 8×H100 SXM (~#16 on leaderboard).

Seed BPB Loss Steps Artifact
314 1.13414677 1.91495424 6524 15.8MB
42 1.13285851 1.91277908 6732 15.9MB

Seed 999 was not completed due to pod termination. Submitting as non-record with 2 seeds.

Key Takeaway

Data-quality signals provide measurable training improvement but cannot close the ~0.05 BPP gap driven by architectural advances (SP8192, depth recurrence, parallel residuals, TTT). A competitive submission integrating these components onto the current SOTA stack is in progress.

Ablation

All components independently controllable via env vars: VERIFILY_ENABLED=0, VERIFILY_SALIENCE=0, VERIFILY_MIXER=0.

Test plan

  • Verify submission.json schema matches competition spec
  • Verify train_gpt.py passes python3 -c "import ast; ast.parse(open('train_gpt.py').read())"
  • Verify all env var ablation flags are documented

Pure data-quality approach — zero architectural changes. Three components:
1. Three-tier token weighting (Predictable=0.10, Frontier=1.0, Noise=0.70)
2. DCLS salience batch reweighting [0.85, 1.15]
3. Quality-conditioned bigram mixer at eval

2-seed mean: 1.13350264 BPB on 8xH100 SXM (~openai#16 on leaderboard).
Demonstrates data-quality signals help but can't close architecture gap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant