diff --git a/.gitignore b/.gitignore index 3423c416a7..9b804d2069 100644 --- a/.gitignore +++ b/.gitignore @@ -8,4 +8,6 @@ data/manifest.json data/docs_selected.jsonl .mypy_cache/ .venv -logs/ \ No newline at end of file +logs/* +!logs/daily_research.md +!logs/experiments.md \ No newline at end of file diff --git a/logs/daily_research.md b/logs/daily_research.md new file mode 100644 index 0000000000..9584032497 --- /dev/null +++ b/logs/daily_research.md @@ -0,0 +1,168 @@ +# Daily Parameter Golf Research — 2026-03-25 + +## Alerts + +- **COMPETITION HAS EXPLODED.** Open PRs now reach 0.5466 bpb (PR #798). Our 1.0705 is no longer competitive among open submissions. +- **N-gram eval cache is CONFIRMED LEGAL** — but hindsight selection (comparing n-gram vs model on ground truth) is BANNED. Fixed-weight or entropy-adaptive alpha only. +- **Eval-time GPTQ BANNED.** Quantization calibration must happen within the training window, not eval. PRs #606, #615, #626, #639, #656 were closed for this. +- **Multi-epoch TTT with min-loss selection BANNED.** Token adaptation before evaluation = training on val set. +- **PR #771 still OPEN, no reviews yet.** No action required, but it will likely not merge given the score gap to new submissions. + +## Leaderboard + +**Merged SOTA**: 1.1194 bpb (PR #549, abaybektursun, 2026-03-23) + +**Our PR #771**: 1.0705 bpb — Open, awaiting review. Beats merged SOTA by 0.049 but far behind open PR frontier. + +### Top Open PRs (the real competition) + +| PR# | val_bpb | Technique | Seeds | Status | +|-----|---------|-----------|-------|--------| +| #798 | **0.5466** | Order-adaptive entropy gating + BackoffNgramMixer (per-order ent_centers) | 3 | Open | +| #796 | **0.6567** | Prefill cache + 7-gram entropy-adaptive + EBLS | ? | Open | +| #770 | **0.6672** | 11L + eval-time multi-order n-gram cache (2-7), entropy-adaptive alpha | 1 | Open | +| #795 | **0.8881** | 11L + order-adaptive 11-gram | ? | Open | +| #797 | **0.8960** | 7-gram n-gram cache | ? | Open | +| #792 | **1.0340** | 11L LeakyReLU² + XSA-all + Full GPTQ + 5-gram | ? | Open | +| #727 | **0.9674** | Multi-order n-gram backoff (2-7) + entropy-adaptive alpha | 3 | Open | +| #741 | **0.9850** | Cosine TTT + multi-order n-gram cache | ? | Open | +| #758 | **1.0465** | N-gram no TTT | ? | Open | +| #771 | **1.0705** | AdamW TTT 30ep cosine + per-layer LR (ours) | 3 | Open | + +**Key pattern**: Every sub-1.0 submission uses n-gram eval cache. The top submissions use ORDER-ADAPTIVE entropy gating with per-order thresholds. Pure TTT without n-gram is no longer competitive. + +## New Techniques Found + +### 1. Order-Adaptive Entropy Gating (PR #798 — 0.5466 bpb) +- **Source**: Open PR, 3-seed validated +- **Delta estimate**: -0.52 bpb vs our base (!) +- **How it works**: Per-order entropy centers: `{7: 3.0, 6: 3.2, 5: 3.5, 4: 3.8, 3: 4.2, 2: 4.5}`. Higher-order n-grams activate at lower entropy (high confidence), lower-order at higher entropy. Builds on BackoffNgramMixer from PR #779. +- **Evidence quality**: STRONG (3-seed, 15.99MB artifact, compliant) +- **Legality**: Legal — score-first caching, no hindsight selection +- **Implementation cost**: Medium — need backoff n-gram mixer + per-order entropy gating + +### 2. BackoffNgramMixer (PR #779 foundation) +- **Source**: Referenced by PR #798 +- **How it works**: Multi-order n-gram cache with highest-order-first cascading fallback on miss. Orders 2-7 with backoff. +- **Delta estimate**: -0.10 to -0.16 bpb (base technique before entropy gating) + +### 3. Entropy-Adaptive Alpha Mixing (PR #727 — 0.9674 bpb) +- **Source**: Open PR, 3-seed validated +- **Formula**: `alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0))` +- **Evidence**: Ablation shows +0.0151 bpb gain over fixed alpha=0.40 +- **This is the simpler version** — PR #798's per-order centers are the upgrade + +### 4. Prefill Cache + EBLS (PR #796 — 0.6567 bpb) +- **Source**: Open PR +- **Unclear what EBLS is** — needs investigation. Could be a major technique. + +## Technique Legality Updates (Issue #140, 2026-03-25) + +1. **N-gram eval cache**: LEGAL (score-first, backward-looking only) +2. **Hindsight selection** (comparing n-gram vs model on ground truth): **BANNED** +3. **Eval-time GPTQ calibration**: **BANNED** (must fit in training window) +4. **Multi-epoch TTT with min-loss selection**: **BANNED** (= training on val set) +5. **Fixed-weight blending or entropy-adaptive alpha (model uncertainty, not labels)**: LEGAL + +## Recommended Action Plan + +### Priority 1: Implement Order-Adaptive Entropy Gating N-gram Cache on our base + +**Theory of victory**: PR #798 achieves 0.5466 on a standard 11L base. Our base (1.0705 with AdamW TTT) is stronger than average. Adding order-adaptive n-gram should yield 0.50-0.60 bpb range. Even a conservative implementation (just multi-order backoff + basic entropy-adaptive alpha like PR #727) should get us to 0.85-0.95 bpb. + +**Implementation plan**: +1. Start with PR #727's approach (simpler): multi-order backoff (2-7) + `alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0))` +2. Then upgrade to PR #798's per-order entropy centers +3. Ensure score-first compliance (cache only from already-evaluated tokens) + +**We already have v9a/v9b code locally** that targets this. Key files: +- `records/track_10min_16mb/2026-03-25_sunnypatneedi_v2/train_gpt_v9a_11gram_no_ttt.py` +- `records/track_10min_16mb/2026-03-25_sunnypatneedi_v2/train_gpt_v9b_11gram_mini_ttt.py` + +**RunPod commands** (after pod creation on 8xH100 SXM): +```bash +# Setup +pip install zstandard --break-system-packages +python3 -c "import zstandard; print('zstd OK')" + +# Clone and checkout +cd /workspace +git clone https://github.com/sunnypatneedi/parameter-golf.git +cd parameter-golf + +# Copy the n-gram version to test +cp records/track_10min_16mb/2026-03-25_sunnypatneedi_v2/train_gpt_v9a_11gram_no_ttt.py train_gpt.py + +# 1-seed smoke test +torchrun --standalone --nproc_per_node=8 train_gpt.py + +# Check artifact size +python3 -c "import os; s=os.path.getsize('artifact.tar.gz'); print(f'{s:,} bytes ({s/1e6:.2f} MB) — {\"PASS\" if s < 16_000_000 else \"FAIL\"}')" + +# If seed 42 bpb < 1.0 AND artifact < 16MB, run 3-seed validation +# (use the run_3seeds.sh pattern from PR #771 submission) +``` + +**Expected result**: 0.85-1.00 bpb (conservative), 0.55-0.75 (if per-order gating works well) +**Abort criteria**: If seed 42 bpb > 1.05, the n-gram implementation has a bug — debug before spending more. +**Estimated cost**: $8 (1-seed) to $33 (3-seed) on 8xH100 + +### Priority 2: Study PR #798 implementation in detail + +Before GPU spend, WebFetch the PR #798 diff to understand the exact BackoffNgramMixer + entropy gating code. Port the key logic into our v9a/v9b scripts. The per-order entropy centers are the key innovation: +```python +ent_centers = {7: 3.0, 6: 3.2, 5: 3.5, 4: 3.8, 3: 4.2, 2: 4.5} +``` + +### Priority 3: Investigate "EBLS" technique from PR #796 + +PR #796 achieves 0.6567 with "Prefill Cache + EBLS". EBLS is unknown — could be a significant technique worth understanding. + +## Code Changes Made + +No new code written in this report cycle. Existing local versions v9a (11gram no TTT) and v9b (11gram mini TTT) were already prepared in previous session. + +## Papers & Community + +### Relevant Papers + +**Directly applicable to our competition work:** + +- **SLOT: Sample-specific LM Optimization at Test-time (arXiv:2505.12392)**: Adds a lightweight parameter vector δ to the final hidden layer, optimized on the input prompt via cross-entropy. Few optimization steps, caches last-layer features. **Potential unlock**: This is a legal TTT variant — adapts per-sample at test time by minimizing loss on the prompt itself (backward-looking). Could complement n-gram cache. The "light-weight δ on final hidden layer" is architecturally different from our current full-model TTT. Worth investigating whether SLOT-style adaptation + n-gram outperforms AdamW TTT + n-gram. + +- **N-gram Residual Learning (arXiv:2210.14431)**: Trains a neural LM to fit the *residual* between an n-gram LM and the true distribution, rather than the full distribution. The neural model only needs to learn what the n-gram can't predict. **Potential unlock**: If we trained our base model with n-gram residual awareness, the neural+n-gram combination at eval time would be tighter. This is a training-time change, not just eval-time — could be worth 0.01-0.03 bpb over naive interpolation. Medium implementation cost. + +- **LaCT / TTT Done Right (arXiv:2505.23884)**: ICLR 2026 Oral. Already in our technique reference — cosine + per-layer LR recipe. Our PR #771 implements this. + +- **E2E TTT (arXiv:2512.23675)**: Meta-learns TTT initialization at train time. Interesting but the meta-learning phase likely exceeds our 10-min training budget. Not directly applicable. + +**Quantization-specific (for squeezing more model into 16MB):** + +- **LieQ (arXiv:2508.03332)**: Layer-wise mixed-precision PTQ for small LMs. Keeps uniform bit-width within each layer but mixes precision across layers based on an information-effectiveness metric. **Potential unlock**: Instead of uniform int6 everywhere, use int7 on critical layers and int5 on redundant ones. Could recover 0.002-0.005 bpb at same artifact size, or free ~200KB for a larger n-gram cache. Low implementation cost. + +- **pQuant (arXiv:2602.22592)**: Decoupled linear QAT for sub-2-bit. Aggressive but our int6 regime is higher — less applicable. Note for future if we need to go lower. + +- **SLMQuant (arXiv:2511.13023)**: Systematic benchmark showing SLMs are uniquely sensitive to quantization. Confirms our Lesson #8 — small models need careful quant. Validates our approach of QAT over PTQ. + +**Expert mixing / ensemble theory:** + +- **Lossless Compression via Next-Token Prediction (arXiv:2505.06297)**: Uses LLM predictions + arithmetic coding for lossless compression. The ensemble approach (multiple predictors) is conceptually what n-gram + neural model does. Confirms the competition meta is sound. + +- **PEER: Parameter Efficient Expert Retrieval**: Uses product keys to route through millions of single-neuron experts. Interesting architecture but likely too expensive for our 10-min budget. File away for future. + +### Community + +- **HuggingFace**: No parameter-golf-specific community posts. TTT and test-time scaling blog posts reference the E2E TTT paper but no novel techniques. +- **DeepWiki (openai/parameter-golf)**: Has a wiki-style breakdown of the competition but blocked for direct fetch. Could contain technique cataloging. +- **Reddit**: No significant parameter golf threads found. GitHub Issue #140 remains the primary community discussion hub. +- **Competition coverage**: algo-mania.com and aitoolsclub.com have general awareness articles, no new techniques. + +## Strategic Assessment + +**Our position**: Our 1.0705 beats the merged SOTA (1.1194) but is ranked ~10th among open PRs. The competition has moved to n-gram territory. Without n-gram cache, we cannot compete. + +**The meta**: Order-adaptive entropy gating + multi-order n-gram backoff (2-7 or 2-11) is the dominant technique. TTT is a secondary boost. The winning formula appears to be: strong 11L base + n-gram cache + entropy-adaptive mixing. + +**Next session priority**: Study PR #798 diff → port order-adaptive entropy gating to our base → test on RunPod → submit. + +**Budget note**: At $33/attempt, we have ~10 attempts left before deadline. Each attempt should now include n-gram cache. Pure architecture or TTT experiments without n-gram are no longer worth GPU time. diff --git a/logs/experiments.md b/logs/experiments.md new file mode 100644 index 0000000000..e2b0ff4600 --- /dev/null +++ b/logs/experiments.md @@ -0,0 +1,17 @@ +# Experiment Log — Parameter Golf + +| Date | Exp ID | Change | val_bpb (slide) | Artifact bytes | Steps | ms/step | Hypothesis → Verdict | +|------|--------|--------|----------------|----------------|-------|---------|---------------------| +| 2026-03-22 | mlx_smoke | Baseline MLX 200 iters | 2.4081 | 13,169,730 | 200 | ~155 | Crash test → Works | +| 2026-03-22 | ex1_baseline | Baseline MLX 500 iters | 2.1850 | — | 500 | ~155 | Baseline reference → Recorded | +| 2026-03-22 | ex1_fewer_layers | 5 layers MLX 500 iters | 2.1845 | — | 500 | ~100 | Fewer layers worse? → No difference at 500 iters | +| 2026-03-22 | ex1_wider_mlp | 3x MLP MLX 500 iters | 2.1868 | — | 500 | ~155 | Wider MLP better? → Slightly worse (undertrained) | +| 2026-03-22 | baseline_1gpu | Baseline 1xH100 10min | 1.3412 | 13,169,730 | ~1,235 | ~485 | 1xH100 reference → 1.34 (fewer steps than 8xH100) | +| 2026-03-22 | sota_1gpu | SOTA on 1xH100 10min | 1.5223 | 15,668,352 | ~670 | ~900 | SOTA on 1xH100 → WORSE than baseline (too few steps, eval >10min). 1xH100 cannot run SOTA. | +| 2026-03-22 | sota_8gpu_verify | SOTA on 8xH100 (Phase 0) | 1.1463 | — | ~6,700 | ~89 | Reproduce SOTA → 1.1463 (within 0.004 of reference 1.1428). Phase 0 PASSED. | +| 2026-03-22 | stride32_8gpu | SOTA + EVAL_STRIDE=32, 8xH100 | **1.1430** | — | ~6,587 | ~91 | Stride=32 better than 64? → **YES, -0.0033 bpb, -0.0054 nats.** Eval 341s (within budget). | +| 2026-03-23 | pr486_baseline | PR #486 UNMODIFIED, 8xH100, seed 42 | **1.1249** | 13,327,625 | 5,148 | ~116 | Establish unmodified baseline → **1.1249 bpb. Pre-TTT: 1.1911. TTT gain: -0.066. Artifact 13.3MB (2.67MB headroom). Eval 524s total (within 600s).** | +| 2026-03-23 | v6_fixed | v6.0 full stack (GPTQ off, TTT 5ep, In-Place fixed), 8xH100 | **1.3174** | 16,034,682 | 5,075 | ~118 | v6.0 better than baseline? → **NO. +0.193 bpb WORSE. In-Place TTT DESTROYED model (avg_loss increasing). Artifact 34KB over 16MB. Eval >1000s (over budget). ABANDON In-Place TTT.** | +| 2026-03-25 | v8_seed42 | v8.0 AdamW TTT on PR #549, seed 42, 8xH100 | **1.0706** | 17,121,847 | ~6000 | ~86 | AdamW TTT on SOTA? → **YES, -0.075 bpb from TTT. But artifact used zlib (17.1MB, OVER). Seed 42 only.** | +| 2026-03-25 | v8_seed1337_v2 | v8.0 retry after cache clear, seed 1337, 8xH100 | **1.0699** | 15,757,968 | ~6000 | ~86 | Retry with zstd + cleared cache → **PASS. 15.76MB artifact, no crash, 1.0699 bpb.** | +| 2026-03-25 | v8_seed2024_v2 | v8.0 retry, seed 2024, 8xH100 | **1.0702** | 16,105,594 | ~6000 | ~86 | Seed 2024 → **1.0702 bpb but artifact 16.11MB (+106KB over). Needs fix.** |