Skip to content

Crawler — 8.8MB -1.1874 BPB (3-seed mean, 8xH100, 600s)#1140

Open
newjordan wants to merge 154 commits intoopenai:mainfrom
newjordan:submission/crawler-leg3
Open

Crawler — 8.8MB -1.1874 BPB (3-seed mean, 8xH100, 600s)#1140
newjordan wants to merge 154 commits intoopenai:mainfrom
newjordan:submission/crawler-leg3

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented Mar 30, 2026

Submission: Micro Crawler

3-seed mean: 1.18742567 BPB | Size: 9.36MB | Hardware: 8×H100 SXM

Architecture Philosophy

The "Micro-Crawler" stack is a causal coordination engine operating at three temporal resolutions simultaneously through shared weights.

Each loop iteration coordinates the same fuzzy input representation against the same learned shape space, but at a different causal horizon. Loop 0 attends to immediate causes (adjacent tokens). Loop 1 attends to medium-range causal structure. Loop 2 integrates distant causes at the sentence and paragraph level. The shared weights are the learned geometric attractor — the distributed representation of known truth that the input is being pulled toward through each pass. Weight sharing is not a parameter-budget compromise; it is the mechanism. The same causal law applied at three temporal resolutions, each loop leaving the representation less fuzzy than it found it.

Results

Seed int6 SW BPB Steps Size
1337 1.18720375 8087 8,842,981 bytes
42 1.18761637 8119 9,362,069 bytes
300 1.18745690 8103 9,332,848 bytes
mean 1.18742567 9,362,069 bytes (max)

Std: <0.0002 across seeds.

Architecture

  • 4 flat XSA layers + 1 shared crawler block × 3 loops
  • CRAWLER_MLP_MULT=6.0, CRAWLER_QUANT_INT8=1 (QAT)
  • 14,462,508 parameters
  • SKIP_GPTQ=1 — naive int6 + zstd
  • NGRAM_EVAL_ORDER=0 (no ngram)
  • 8 heads / 4 KV heads, bigram=2048, RoPE=16

Update (2026-03-31): Additional runs since submission. (per-loop RoPE scaling across three causal horizons) combined with fullgraph compilation dropped both BPB and file size simultaneously. -0.00070 BPB, -337KB. New bpb val_bpb:1.18672385, Total submission size: 9024399 bytes - Still unstable but in a good way

Reproduce

git clone https://github.com/newjordan/parameter-golf-1.git
cd parameter-golf-1 && git checkout submission/crawler-leg3
python3 data/cached_challenge_fineweb.py
SEED=1337 NPROC_PER_NODE=8 bash experiments/Crawler_Leg_3/run.sh
NPROC_PER_NODE=8 bash experiments/Crawler_Leg_3/run_multi_seed.sh

Data visualization of crawler:
image
image

Logs and train_gpt.py in records/track_10min_16mb/2026-03-30_Crawler_Leg3_8xH100/.

Octavian and others added 30 commits March 26, 2026 00:23
3D cubric pattern recognizer (54 warm-started adaptive multipliers)
+ complementary training. Seeds: 1337=0.4818, 300=0.4821, 58=0.4821.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three variants targeting the 0.187 BPB gap to #1:
- bwing_alpha: clip 0.95, alpha 0.05-0.60 (isolate alpha curve)
- bwing_entropy_shift: per-order entropy center shift (isolate)
- bwing_full_port: all openai#809 techniques + fixed order mults (fire first)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Cubric 3D back online (CADENCE=32, warm-start)
- Per-order entropy center shift from openai#809
- Alpha 0.05-0.60, clip 0.95
- Our sliding-window TTT spliced in (1 epoch, SGD, freeze 2 blocks)
- TTT runs BEFORE n-gram eval → adapted model feeds n-gram

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Port openai#809 LoRA TTT: rank-8 adapters on Q/V/LM head, AdamW, Polyak
- Add LoRA injection to CausalSelfAttention, Block, GPT forward paths
- 53s vs our old 410s TTT, 6x better BPB gain
- Cubric 3D ON + entropy shift + alpha 0.05-0.60 clip 0.95

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixed mults + entropy shift + alpha 0.05-0.60 clip 0.95 (no cubric).
Base sliding: 1.1194, n-gram9: 0.4512. Delta from X-WING: -0.031.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Deleted LoRA TTT abomination. bwing_III is now a clean copy of our
best scoring variant for further iteration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bwing_IV: Prime fix only — adds primes 283721, 347237 to eliminate
XOR hash collisions for orders 8-9 (the 2.0x multiplier orders).
With 7 primes, prime[7] wrapped to prime[0], causing context tokens
at positions j-8 and j-1 to cancel when equal.

bwing_V: Prime fix + cubric 3D stacked on top of fixed mults.
Cubric warm-starts at 1.0 (neutral) and refines per (order × entropy
× count) on top of the fixed order multiplier scaling.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adapted from old setup.sh. Fixes FA3 detection (old one skipped FA3
when FA2 was present), uses sp1024 dataset, adds zstandard install.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Standalone eval script loads final_model.int6.ptz once, then sweeps:
- alpha_max: [0.50, 0.60, 0.70, 0.80]
- entropy_center: [2.0, 2.5, 3.0]
- high_order_mult: [1.5, 2.0, 2.5, 3.0]
- min_count: [1, 2]
- cubric: [on, off]
= 192 configs, ~3 min each, sorted by aggressiveness (best-first).
Results to sweep_results.csv.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
openai#809 uses INT5 — more aggressive quantization creates more entropy in
the post-quant model, letting n-gram eval rescue harder. Their quant
loss is 0.019 vs our 0.006 (INT6), but n-gram extracts 0.869 vs 0.668.

Changes from bwing_IV:
- clip_range: 31 → 15 in gptq_quantize_weight, quantize_int6_per_row,
  and _find_best_row_scales
- No cubric (it hurt in bwing_V)
- 9 hash primes (from bwing_IV)
- All openai#809 n-gram params (fixed mults, entropy shift, alpha curve)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Clean submission-ready code. 2140 → 1936 lines (-204).
Removed all dead code paths that aren't used in our config.
INT5 GPTQ + 9-prime hash fix remain as the key changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A-Wing Green (INT5 GPTQ + 9-prime):
  - Post-quant sliding: 1.1410 (vs 1.1194 INT6)
  - N-gram reduction: 0.683 (vs 0.668 INT6 — +0.015 more)
  - Final: 0.4576 BPB — worse than SOTA by 0.006
  - Conclusion: INT5 quant noise hurts more than n-gram gains

bwing_V (9-prime + cubric stacked on fixed mults):
  - Final: 0.4601 BPB — cubric on top of fixed mults HURTS by 0.009
  - Cubric over-corrected (orders 2-3 suppressed to 0.62x on top of 0.3x)

SOTA remains bwing_full_port at 0.4512 BPB (INT6, fixed mults, no cubric).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Instead of entropy-adaptive alpha (blind proxy), compare actual model_p
vs ngram_p per token. Soft sigmoid on log-ratio:
  alpha = 0.95 * sigmoid(8 * log(ngram_p / model_p))

When ngram_p > model_p: alpha → 0.95 (trust n-gram)
When ngram_p < model_p: alpha → 0.0 (trust model)
No wasted mixing on tokens where n-gram is worse.

Base: SOTA bwing_full_port + 9-prime hash fix. INT6, no cubric.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
openai#809 trains for 525s, leaving 75s for GPTQ. We were using the full
600s default. 570s leaves 30s for GPTQ calibrate (3.4s) + quantize
(~25s) with headroom.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- run.sh now checks zstandard + flash_attn BEFORE training starts
- Fails fast if zstandard missing (prevents 17MB zlib artifacts)
- Shows FA version for debugging
- train_gpt.py warns loudly if falling back to zlib

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Green_1 scored 0.3200 BPB with oracle alpha alone. Green_2 adds LoRA TTT
to close the remaining 0.025 gap to openai#809 (0.2952).

TTT flow (score-first legal):
1. Sliding window eval scores all val tokens (frozen model)
2. LoRA rank-8 adapters injected on Q, V projections
3. Single pass over val tokens: score then adapt (AdamW, lr=3e-4)
4. Polyak averaging (decay=0.998) for stability
5. N-gram eval with oracle alpha on adapted model

Coarse stride (16x) keeps TTT under 60s. Total eval budget: ~290s.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rewrote setup_runpod.sh to install FA3 + zstandard directly into the
default system env instead of creating a separate conda environment
that conflicts with torchrun and per-test scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A-Wing Green_1 seed 1337 = 0.3200 BPB (was 0.4512).
Oracle alpha = sigmoid(8 * log(ngram_p/model_p)) * 0.95.
Copies: red, purple for parallel experimentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds Linear(512→12) alpha_head trained jointly with model to predict
per-token expert weights (neural + 11 n-gram orders 2-12).
Training oracle prefilled from training data, eval uses backward-looking
val-data cache. Targets sub-0.15 BPB on our 1.1195 neural baseline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Usage on fresh pod:
  bash experiments/pod_launch.sh experiments/A_wing/purple/run.sh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add pod_setup.sh: one file, zero args, sets up pod environment
- Move stale root dirs to experiments/archive/ organized by type
- Update pod_launch.sh default branch to test
- Gitignore checkpoints (too large for GitHub)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New experiment: test whether weight-shared Frugendorff architecture
compresses model artifact while maintaining BPB when paired with the
full X-WING N-gram eval stack (3D cubric, shared tables, CT, orders 2-9).

- train_gpt.py: adds CrawlerGPT class alongside existing GPT; USE_CRAWLER=1
  switches to 4 flat + 1 shared×2 architecture; build_model() factory handles
  both; all N-gram/GPTQ/CT machinery unchanged and legal
- Green/run.sh: 0.25 scale validator (1 GPU, 150s, dim=384)
- Red/run.sh: full scale production (8×H100, 600s, USE_CRAWLER=1)
- Purple/run.sh: U-Net control (8×H100, 600s, USE_CRAWLER=0) for clean A/B

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Octavian and others added 20 commits March 29, 2026 22:52
- SKIP_GPTQ=1: no 30s reserve, full wallclock restored (~1.1091 target)
- int6_cats adds "embed": tok_emb quantized int6 not int8, ZSTD saves ~1.5-2MB
- Expected artifact: ~14.5-15MB (vs 16.73MB on Rascal I)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SKIP_GPTQ=1 + embed int6 → full 600s training + legal compression.
DO NOT MODIFY this entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Safe copy created after the original was overwritten by an agent run.
MD5-verified identical to the run that produced 0.2233 BPB ngram9.
Use this for re-runs — do not modify.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ChopShop: stripped 548 lines of dead code from Rascal submission
(TrainNgramTracker, ngram eval mixer, DTG, gated attn, value residual,
MTP heads, LAWA, complement training). 103KB → 75KB (-27.6%).

Rascal_Stripper: 4-way A/B workspace — safe/turbomuon/engramlite/combo
+ smoke_test.sh (1500 steps × 4 variants = 4500 steps total, val BPB
every 300 steps, final s64 comparison table).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Includes: Bandit_Wagon, ClownCar_III, Cobra, Crawler_Ablations_v1,
Crawler_Leg_1 (full results), GreenRod_X_1 lab protocol, H6/H8/H9/H10
hypotheses, Junkyard_Rat_MLP/Shroud_Mini, Medusa_III/VRed, Shroud,
BWING + Rascal_8xH100 records, scripts, octavian notes, Nitrust blueprints.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Arms:
  CL2-00: baseline reference (matches Leg 1 CL1-00)
  CL2-01: loops=3 + mlp=5.0 combined — primary hypothesis
  CL2-02: full stack (loops=3 + mlp=5.0 + LOOP_AWARE_GPTQ + COMPILE)
  CL2-03: loops=2 + mlp=5.0 — push loops further
  CL2-04: loops=3 + mlp=6.0 — push MLP further

Expected: CL2-01 ≈ 1.56–1.62 BPB if wins are additive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove all ngram/mixer/oracle code from Bandit_Wagon/train_gpt.py and
  Bandit/train_gpt.py (~1160 lines each, files now identical at 2378 lines)
- Update Bandit_Wagon/run.sh with post-CL1 optimal settings:
    CRAWLER_LOOPS 4→3   (CL1-01: −0.088 BPB)
    CRAWLER_MLP_MULT=5.0 added (CL1-07: −0.098 BPB)
    COMPILE_FULLGRAPH 0→1 (Ablations_v1-E: −0.026; safe now NGRAM removed)
    LOOP_AWARE_GPTQ=1 retained (Ablations_v1-B: −0.040)
- Remove dead NGRAM_EVAL_*, COMPLEMENT_ALPHA, CUBRIC_CADENCE env vars from run.sh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TurboMuon: AOL left-Gram preconditioning, Polar Express NS4 coefficients, row_col post-NS normalize
EngramLite: 2-head 8192-bucket bigram+trigram hash embedding (4× n-gram capacity)
TTT: score-first legal protocol, freeze last-2 blocks, Polyak avg, 3 epochs/chunk
CROWN-Q: QAT penalty during warmdown to sharpen quantized weights for TTT

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Target: ≤1.15 BPB at ~10MB, no ngram oracle. Updated arms to reflect
post-CL1 locked config (loops=3, mlp=5.0, loop_aware_gptq=1).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`scale` inside the CROWN-Q loop was shadowing the outer LR-schedule `scale`
variable, corrupting the learning rate for all subsequent optimizer steps.
Renamed to `q_scale` in all three variants.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ns on 8×H100

5-arm Crawler_Leg_2 sweep (350s/arm, seed 1337):
  CL2-00 baseline (loops=4 mlp=4.0): 1.20285
  CL2-01 loops=3+mlp=5.0 SKIP_GPTQ:  1.20211 (−0.0007)
  CL2-02 full stack LOOP_AWARE_GPTQ:  1.19593 (−0.0069) ✅ BEST
  CL2-03 loops=2+mlp=5.0:             1.20667 (+0.0038) ❌
  CL2-04 loops=3+mlp=6.0 SKIP_GPTQ:  1.19828 (−0.0046)

Production config locked: loops=3, mlp=5.0, COMPILE=1, LOOP_AWARE_GPTQ=1, QUANT_INT8=1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Best SKIP_GPTQ=1 arch from Leg 2 (CL2-04) at full wallclock.
600s vs 350s → ~2400 extra steps on 8×H100.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… sweep tool

Load any Rascal/combo checkpoint, run baseline sliding window eval, then run one
TTT config. Auto-detects BigramHashEmbedding vs EngramLite from checkpoint keys.
Sweep TTT_LR / TTT_EPOCHS / TTT_FREEZE_BLOCKS / TTT_CHUNK_TOKENS via env vars.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Seeds 42+300 runner. Submission dir pre-filled with seed_1337 result
(1.18720375 BPB). PLACEHOLDERs to be filled after seeds complete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs conservative / balanced / aggressive TTT configs in sequence against
a trained checkpoint. Prints comparison table at the end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Smoke test at 3200 steps showed combo -0.00492 BPB vs baseline.
Expected ~1.105 BPB at full 600s run on 8xH100.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Seeds 1337/42/300: 1.18720 / 1.18762 / 1.18746. Std <0.0002.
train_gpt.py + all logs included.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@newjordan newjordan changed the title Crawler — 1.1874 BPB (3-seed mean, 8xH100, 600s) Crawler — 8MB -1.1874 BPB (3-seed mean, 8xH100, 600s) Mar 30, 2026
@newjordan newjordan changed the title Crawler — 8MB -1.1874 BPB (3-seed mean, 8xH100, 600s) Crawler — 8.8MB -1.1874 BPB (3-seed mean, 8xH100, 600s) Mar 30, 2026
@newjordan
Copy link
Copy Markdown
Author

Update: BW5 — COMPILE_FULLGRAPH=1 drops BPB and file size simultaneously

Config int6_sw_bpb Bytes Steps
Leg 3 (submission, mean) 1.18742567 9,362,069 ~8100
BW4 (battery only, no choke) 1.18730643 8,968,xxx 8021
BW5 (BW4 + COMPILE_FULLGRAPH=1) 1.18672385 9,024,399 8035

BPB: -0.00070 vs submission mean
Bytes: 9,362,069 → 9,024,399 (-337KB)

Both drop simultaneously. The fullgraph compile fuses the 3-loop crawler dispatch into tighter kernels — fewer intermediate tensor materializations, cleaner quantization surface. Zero new parameters.

Single seed (444), 8×H100 SXM, 600s wallclock. Seed=300 confirmation pending.

@newjordan
Copy link
Copy Markdown
Author

1.82 and found a big lever

@newjordan
Copy link
Copy Markdown
Author

1.76

@newjordan
Copy link
Copy Markdown
Author

Moved it to a 9f now the Crawler is stabilized. - will switch back and work on compression now I am starting to break into quality levers. will do it step by step and keep it stable.

Serialized model int6+zstd: 15117899 bytes
final_int6_roundtrip val_loss:2.0838 val_bpb:1.1612 eval_time:20795ms
final_int6_roundtrip_exact val_loss:2.08381255 val_bpb:1.16124154
final_int6_sliding_window val_loss:2.0433 val_bpb:1.1387 stride:64 eval_time:102711ms
final_int6_sliding_window_exact val_loss:2.04332268 val_bpb:1.13867894
final_int8_zlib_roundtrip_exact val_loss:2.04332268 val_bpb:1.13867894

tmancino added a commit to tmancino/parameter-golf that referenced this pull request Apr 6, 2026
3-seed mean val_bpb: 1.1035 (seeds 271, 503, 999)
Improvement over SOTA PR openai#1019 (1.1147): -0.0112 BPB / -0.0189 nats
Welch t = -40.37, p << 0.001

Key techniques:
- MR-GPTQ Hadamard rotation before int6 GPTQ (68x lower quant MSE)
- Discriminative TTT with per-block LR scaling (from PR openai#1351)
- 2-layer depth recurrence (from PR openai#1140)

Built on PR openai#1019 (abaybektursun) base architecture.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant