nishant-resolve-ai
diff --git a/‎autoresearch/best_ideas.md‎
Lines changed: 128 additions & 0 deletions b/‎autoresearch/best_ideas.md‎
Lines changed: 128 additions & 0 deletions
diff --git a/‎autoresearch/generate_next.py‎
Lines changed: 236 additions & 0 deletions b/‎autoresearch/generate_next.py‎
Lines changed: 236 additions & 0 deletions
@@ -0,0 +1,128 @@
+# Best Ideas for Parameter Golf
+
+Ranked by expected impact, based on analysis of top leaderboard submissions (as of 2026-03-23).
+
+## Current Merged SOTA: 1.1233 BPB (signalrush, PR #414)
+Architecture: 11L, d=512, 8 heads (4 KV, GQA), MLP 3x (1536), relu², U-Net skips,
+Efficient Partial XSA on last 4 layers, Partial RoPE (16/64 dims), LN Scale 1/sqrt(layer+1),
+SmearGate + BigramHash (2048 buckets), tied embeddings, logit softcap=30.
+Training: Muon (lr=0.025, WD=0.04), EMA (decay=0.997), SWA (every 50 steps when scale<0.2),
+Late QAT (STE int6 @ lr_scale<0.15), warmdown=3500 iters. GPTQ-lite clip search post-training.
+
+## Open PR SOTA: 1.0672 BPB (JoeProAI, PR #462)
+Same base + SwiGLU/StarReLU MLP (hidden=1792), XSA on ALL layers, U-Net + AdamW TTT at eval.
+TTT is the biggest single win (~0.05 BPB). Architecture changes that improve TTT are highest priority.
+
+## Tier 1: High Impact (proven on leaderboard)
+
+### 1. AdamW TTT (Test-Time Training)
+- Fine-tune ALL model weights on validation data at eval time
+- AdamW optimizer, lr=0.0005, 10-30 epochs, cosine schedule
+- Grad clip 1.0, all layers unfrozen
+- **Impact**: -0.053 to -0.061 BPB (massive)
+- **Key insight**: Architecture matters for TTT effectiveness. U-Net + gated skips create
+  smoother loss geometry that TTT can exploit more effectively.
+- **Note**: TTT happens at eval time on H100s. For MLX autoresearch, focus on the
+  architecture that maximizes TTT effectiveness, not TTT itself.
+
+### 2. U-Net Skip Connections
+- Split layers into encoder (first N) and decoder (last M)
+- Encoder layers push outputs onto a stack; decoder layers pop + combine via learned sigmoid gates
+- `gate * x + (1-gate) * (skip_weight * skip)` where gate and skip_weight are per-dim learnable
+- **Impact**: Enables 2.8x more TTT gain vs standard architecture
+- **Synergy**: Critical for TTT effectiveness
+
+### 3. XSA (Exclusive Self-Attention)
+- After standard attention output y = softmax(QK^T)V, subtract self-value projection:
+  `y_out = y - proj(y, normalize(v))`
+- Forces attention to encode novel cross-token information, not repeat values
+- Apply to ALL layers (not just last 4) for best results
+- **Impact**: -0.002 BPB standalone, but compounds with other techniques
+- **Cost**: ~3ms/step extra on H100
+
+### 4. SwiGLU / StarReLU MLP
+- Replace relu² with StarReLU: `relu(x)^2 * scale + bias` (per-channel learnable)
+- Or SwiGLU gating: `silu(W_gate * x) * (W_up * x)`
+- Top submission (#462) uses StarReLU with hidden_dim=1792
+- **Impact**: Improves both base model and TTT effectiveness
+
+### 5. EMA (Exponential Moving Average)
+- Maintain shadow copy of all weights: `ema = decay * ema + (1-decay) * weights`
+- decay=0.997, applied every step, stored in fp32
+- Use EMA weights as base for quantization (smoother → less quantization damage)
+- **Impact**: -0.001 to -0.002 BPB, improves quantization quality
+
+### 6. Partial RoPE + LN Scale (NEW — merged SOTA uses both)
+- Partial RoPE: Apply RoPE to only 16/64 dims (25%). Rest are position-free.
+  Helps generalization and reduces positional overfitting.
+- LN Scale Factor: Scale LayerNorm output by `1/sqrt(layer_idx+1)`. Deeper layers get smaller
+  residual contributions. Stabilizes training, especially with more layers.
+- **Impact**: Part of every recent SOTA. Easy to implement, no artifact size cost.
+
+### 7. 11 Layers (not 9 or 10)
+- All merged SOTAs since 1.1307 use 11 layers with int6 quantization.
+- Fits under 16MB with MLP 3x and int6.
+- **Impact**: More depth = better features, especially with U-Net skips.
+
+## Tier 2: Medium Impact (proven but smaller gains)
+
+### 6. Per-Layer TTT Learning Rates
+- Measure quantization error per layer type
+- Give 3x LR to MLP output projections (most damaged by quantization)
+- Give 0.5x LR to MLP input projections (least damaged)
+- **Impact**: +23.5% TTT improvement for free
+
+### 7. GPTQ-lite Clip Search
+- Per-row optimal clipping for int quantization
+- Try 5 clip percentiles [0.999, 0.9995, 0.9999, 0.99999, 1.0], pick best MSE
+- **Impact**: -0.0006 BPB, zero training cost
+- **Cost**: Post-training only, simple to implement
+
+### 8. BigramHash Embeddings
+- Hash-based bigram lookup table (4096-12288 entries)
+- Adds local context signal to token embeddings
+- **Impact**: -0.001 to -0.002 BPB
+
+### 9. Mixed-Precision Quantization
+- Int5 for MLP weights, Int6 for attention weights
+- Bitpacking for sub-byte storage (critical for int5)
+- **Impact**: Frees ~20% bytes vs uniform int6, fund wider model or more layers
+
+### 10. Stochastic Weight Averaging (SWA)
+- Snapshot model weights periodically during warmdown (every 50 steps when lr < 0.2)
+- Average snapshots for final model
+- Combined with EMA for dual averaging
+- **Impact**: -0.001 BPB, smoother weight distributions
+
+## Tier 3: Speculative / Lower Priority
+
+### 11. Train Larger, Quantize Harder
+- d=576 (27M params) at int5 instead of d=512 (22M) at int6
+- Lower pre-quant loss offsets coarser quantization
+- Needs extended QAT (start at lr_scale < 0.50, not 0.10)
+- **Impact**: Competitive but not yet proven better than optimized d=512
+
+### 12. Value Embeddings (ResFormer)
+- Per-layer value embedding with input-dependent gating
+- Already in Karpathy's autoresearch baseline
+- **Impact**: Small but consistent improvement
+
+### 13. Custom Compression
+- ANS/arithmetic coder tuned to weight distributions
+- Could beat zstd by 5-10%
+- **Impact**: ~1MB saved, funds more parameters
+
+### 14. Structured Sparsity (2:4)
+- Halves MLP compute, enables wider model
+- **Impact**: Unclear under 16MB constraint
+
+## Strategy Notes
+
+- **For MLX autoresearch**: Focus on architecture (U-Net, XSA, MLP type, EMA) since
+  TTT runs at eval time on H100. The goal is to build the architecture that responds
+  best to TTT.
+- **Stack incrementally**: Test each technique in isolation, then combine winners.
+- **Relative signal**: MLX train loss at 500 iters is a reliable relative signal.
+  Lower train loss locally → lower val_bpb on H100.
+- **Artifact size**: Always monitor. Some techniques (wider model, more layers, bigram
+  hash) increase artifact size. Must stay under 16 MB.
@@ -0,0 +1,236 @@
+"""
+Adaptive experiment generator for Parameter Golf autoresearch.
+
+Reads completed results, analyzes what worked/didn't, and generates
+the next batch of experiments. Called by loop.sh after each batch.
+
+Strategy:
+1. Rank completed experiments by BPB
+2. Identify which technique changes improved vs hurt
+3. Generate new experiments that:
+   a. Combine top-2 individual winners
+   b. Push winning techniques further (e.g., if XSA-6 beat XSA-4, try XSA-8)
+   c. Sweep around the best hyperparameters
+   d. Try removing the worst-performing changes (simplify)
+4. Always include 1 "wild card" experiment for exploration
+
+All generated experiments must be:
+- Legal TTT (TTT_PASSES ≤ 1)
+- Within 16MB artifact budget (estimated)
+- Not duplicates of already-run experiments
+"""
+
+import json
+import sys
+import os
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from experiments import (
+    Experiment, ExperimentResult, estimate_artifact_bytes,
+    MAX_ARTIFACT_BYTES, EXPERIMENTS,
+)
+
+STATE_FILE = Path(__file__).resolve().parent / "state.json"
+EXPERIMENTS_FILE = Path(__file__).resolve().parent.parent / "experiments.py"
+
+# Default env for legal experiments
+BASE_ENV = {"SEED": "1337", "TTT_PASSES": "1"}
+
+
+def load_state() -> dict:
+    if STATE_FILE.exists():
+        return json.loads(STATE_FILE.read_text())
+    return {"results": [], "completed_experiments": []}
+
+
+def analyze_results(results: list[dict]) -> dict:
+    """Analyze what worked and what didn't."""
+    if not results:
+        return {"winners": [], "losers": [], "baseline_bpb": None}
+
+    # Find baseline (no_ttt or legal_ttt_baseline)
+    baseline = None
+    for r in results:
+        if r.get("experiment") in ("no_ttt_baseline", "legal_ttt_baseline"):
+            if r.get("val_bpb"):
+                baseline = r
+                break
+
+    if not baseline:
+        # Use the first result with a BPB as reference
+        for r in results:
+            if r.get("val_bpb"):
+                baseline = r
+                break
+
+    if not baseline:
+        return {"winners": [], "losers": [], "baseline_bpb": None}
+
+    base_bpb = baseline["val_bpb"]
+
+    # Classify experiments
+    winners = []  # Lower BPB = better
+    losers = []
+    for r in results:
+        if not r.get("val_bpb") or r["experiment"] == baseline["experiment"]:
+            continue
+        delta = r["val_bpb"] - base_bpb
+        entry = {"experiment": r["experiment"], "val_bpb": r["val_bpb"], "delta": delta}
+        if delta < -0.0005:  # Improved by at least 0.0005
+            winners.append(entry)
+        elif delta > 0.001:  # Hurt by more than 0.001
+            losers.append(entry)
+
+    winners.sort(key=lambda x: x["delta"])
+    losers.sort(key=lambda x: x["delta"], reverse=True)
+
+    return {
+        "winners": winners,
+        "losers": losers,
+        "baseline_bpb": base_bpb,
+        "baseline_experiment": baseline["experiment"],
+    }
+
+
+def extract_env_diff(experiment_name: str) -> dict:
+    """Get the env overrides for a named experiment from EXPERIMENTS list."""
+    for exp in EXPERIMENTS:
+        if exp.name == experiment_name:
+            # Return only the non-default overrides
+            diff = {}
+            for k, v in exp.env.items():
+                if k == "SEED" or k == "TTT_PASSES":
+                    continue
+                diff[k] = v
+            return diff
+    return {}
+
+
+def generate_next_batch(analysis: dict, completed: list[str], batch_size: int = 5) -> list[Experiment]:
+    """Generate the next batch of experiments based on results analysis."""
+    new_experiments = []
+    used_names = set(completed)
+
+    def _add(name, desc, env_extra=None, patches=None):
+        if name in used_names or len(new_experiments) >= batch_size:
+            return
+        env = {**BASE_ENV}
+        if env_extra:
+            env.update(env_extra)
+        exp = Experiment(name=name, description=desc, env=env, patches=patches or [])
+        est = estimate_artifact_bytes(env)
+        if est <= MAX_ARTIFACT_BYTES:
+            new_experiments.append(exp)
+            used_names.add(name)
+
+    winners = analysis.get("winners", [])
+    losers = analysis.get("losers", [])
+
+    # Strategy 1: Combine top-2 winners
+    if len(winners) >= 2:
+        w1_env = extract_env_diff(winners[0]["experiment"])
+        w2_env = extract_env_diff(winners[1]["experiment"])
+        combined_env = {**w1_env, **w2_env}
+        name = f"combo_{winners[0]['experiment']}_plus_{winners[1]['experiment']}"[:60]
+        desc = f"Combine #{1} {winners[0]['experiment']} ({winners[0]['delta']:+.4f}) + #{2} {winners[1]['experiment']} ({winners[1]['delta']:+.4f})"
+        _add(name, desc, combined_env)
+
+    # Strategy 2: Combine top-3 winners
+    if len(winners) >= 3:
+        combined_env = {}
+        for w in winners[:3]:
+            combined_env.update(extract_env_diff(w["experiment"]))
+        name = "combo_top3_winners"
+        desc = f"Combine top 3: {', '.join(w['experiment'] for w in winners[:3])}"
+        _add(name, desc, combined_env)
+
+    # Strategy 3: Push winning hyperparameters further
+    for w in winners[:3]:
+        env_diff = extract_env_diff(w["experiment"])
+        for key, val in env_diff.items():
+            try:
+                fval = float(val)
+                # If this was an increase from baseline, try going further
+                # If it was a decrease, try going even lower
+                for mult, suffix in [(1.5, "more"), (0.5, "less")]:
+                    new_val = fval * mult
+                    name = f"{w['experiment']}_{suffix}"
+                    desc = f"Push {key}={new_val} ({suffix} than {val})"
+                    _add(name, desc, {key: str(new_val)})
+            except (ValueError, TypeError):
+                pass
+
+    # Strategy 4: Interpolate between winner and baseline
+    for w in winners[:2]:
+        env_diff = extract_env_diff(w["experiment"])
+        for key, val in env_diff.items():
+            try:
+                fval = float(val)
+                # Try halfway between baseline default and winning value
+                # (We don't know the baseline default here, so skip this for now)
+                pass
+            except (ValueError, TypeError):
+                pass
+
+    # Strategy 5: Wild card — try something not yet tested
+    wild_cards = [
+        ("seq_len_4096", "Longer sequence length (4096 vs 2048)", {"TRAIN_SEQ_LEN": "4096", "EVAL_SEQ_LEN": "4096"}),
+        ("rope_base_50k", "Higher RoPE base (50000 vs 10000)", {"ROPE_BASE": "50000"}),
+        ("softcap_50", "Higher logit softcap (50 vs 30)", {"LOGIT_SOFTCAP": "50.0"}),
+        ("softcap_20", "Lower logit softcap (20 vs 30)", {"LOGIT_SOFTCAP": "20.0"}),
+        ("qk_gain_2", "Higher QK gain init (2.0 vs 1.5)", {"QK_GAIN_INIT": "2.0"}),
+        ("muon_momentum_095", "Lower Muon momentum (0.95 vs 0.99)", {"MUON_MOMENTUM": "0.95"}),
+        ("embed_lr_08", "Higher embed LR (0.8 vs 0.6)", {"EMBED_LR": "0.8"}),
+    ]
+    for name, desc, env_extra in wild_cards:
+        _add(name, desc, env_extra)
+        if len(new_experiments) >= batch_size:
+            break
+
+    return new_experiments
+
+
+def main():
+    state = load_state()
+    results = state.get("results", [])
+    completed = state.get("completed_experiments", [])
+
+    print(f"Completed experiments: {len(completed)}")
+    print(f"Results with BPB: {sum(1 for r in results if r.get('val_bpb'))}")
+
+    analysis = analyze_results(results)
+    print(f"\nBaseline: {analysis.get('baseline_bpb', 'N/A')}")
+    print(f"Winners ({len(analysis['winners'])}):")
+    for w in analysis["winners"]:
+        print(f"  {w['delta']:+.4f} — {w['experiment']} ({w['val_bpb']:.4f})")
+    print(f"Losers ({len(analysis['losers'])}):")
+    for l in analysis["losers"]:
+        print(f"  {l['delta']:+.4f} — {l['experiment']} ({l['val_bpb']:.4f})")
+
+    new_batch = generate_next_batch(analysis, completed)
+    print(f"\nGenerated {len(new_batch)} new experiments:")
+    for exp in new_batch:
+        est = estimate_artifact_bytes(exp.env)
+        print(f"  {exp.name}: {exp.description} (~{est/1e6:.1f} MB)")
+
+    if new_batch:
+        # Append to experiments.py EXPERIMENTS list
+        # Write them as a separate file that the provider can pick up
+        out = Path(__file__).resolve().parent / "next_batch.json"
+        batch_data = []
+        for exp in new_batch:
+            batch_data.append({
+                "name": exp.name,
+                "description": exp.description,
+                "env": exp.env,
+                "patches": exp.patches,
+            })
+        out.write_text(json.dumps(batch_data, indent=2))
+        print(f"\nWrote {len(new_batch)} experiments to {out}")
+
+    return new_batch
+
+
+if __name__ == "__main__":
+    main()