openai · nprime06 · Apr 14, 2026 · Apr 14, 2026 · Apr 14, 2026 · Apr 14, 2026
diff --git a/.gitignore b/.gitignore
@@ -8,4 +8,6 @@ data/manifest.json
 data/docs_selected.jsonl
 .mypy_cache/
 .venv
-logs/
+logs/
+*.pt
+*.ptz
diff --git a/axes/README.md b/axes/README.md
@@ -0,0 +1,16 @@
+# Axes
+
+Experimentation is organized by axis. Taxonomy is derived from [`../research/AXES_ANALYSIS.md`](../research/AXES_ANALYSIS.md), which catalogs every merged record by axis.
+
+| Axis | Dir | Current status |
+|------|-----|---------------|
+| Architecture | [`architecture/`](architecture/) | — |
+| Attention | [`attention/`](attention/) | — |
+| Quantization | [`quantization/`](quantization/) | — |
+| Optimizer | [`optimizer/`](optimizer/) | — |
+| Training dynamics | [`training/`](training/) | — |
+| Data & tokenizer | [`data/`](data/) | — |
+| Eval-time adaptation | [`eval-time/`](eval-time/) | — |
+| Compression | [`compression/`](compression/) | — |
+
+Each axis dir has a `README.md` that tracks hypotheses, experiments run, findings, and next steps for that axis.
diff --git a/axes/architecture/README.md b/axes/architecture/README.md
@@ -0,0 +1,22 @@
+# Architecture
+
+Reference: [`research/AXES_ANALYSIS.md#axis-1-architecture`](../../research/AXES_ANALYSIS.md)
+
+*Layer structure, residual patterns, skip connections, recurrence, auxiliary input features, MLP design.*
+
+## Hypothesis
+
+_What we think is unexploited._
+
+## Experiments
+
+| ID | Date | Branch | Config | val_bpb | Base | Notes |
+|----|------|--------|--------|---------|------|-------|
+
+## Findings
+
+-
+
+## Next
+
+-
diff --git a/axes/attention/README.md b/axes/attention/README.md
@@ -0,0 +1,22 @@
+# Attention
+
+Reference: [`research/AXES_ANALYSIS.md#axis-2-attention`](../../research/AXES_ANALYSIS.md)
+
+*Attention variants (XSA, SWA, MLA, sliding windows), RoPE tweaks, QK gain, attention sparsity.*
+
+## Hypothesis
+
+_What we think is unexploited._
+
+## Experiments
+
+| ID | Date | Branch | Config | val_bpb | Base | Notes |
+|----|------|--------|--------|---------|------|-------|
+
+## Findings
+
+-
+
+## Next
+
+-
diff --git a/axes/compression/README.md b/axes/compression/README.md
@@ -0,0 +1,22 @@
+# Compression
+
+Reference: [`research/AXES_ANALYSIS.md#axis-8-compression`](../../research/AXES_ANALYSIS.md)
+
+*Artifact compression: zstd / brotli / LZMA / ANS, bit-packing, code-string compression, per-group grids.*
+
+## Hypothesis
+
+_What we think is unexploited._
+
+## Experiments
+
+| ID | Date | Branch | Config | val_bpb | Base | Notes |
+|----|------|--------|--------|---------|------|-------|
+
+## Findings
+
+-
+
+## Next
+
+-
diff --git a/axes/data/README.md b/axes/data/README.md
@@ -0,0 +1,28 @@
+# Data & tokenizer
+
+Reference: [`research/AXES_ANALYSIS.md#axis-6-data--tokenizer`](../../research/AXES_ANALYSIS.md)
+
+*Tokenizer choice (SP1024 / SP4096 / SP8192 / BPE variants), shard ordering, data filtering, FineWeb-Edu substitution, Rho-1-style selective LM, curriculum.*
+
+## Hypothesis
+
+The challenge train stream appears to be a frozen shuffled snapshot rather than a deliberately education-filtered corpus, so a cleaner train-only substitution may help under the 600s token budget even if it introduces some validation-distribution mismatch.
+
+## Experiments
+
+| ID | Date | Branch | Config | val_bpb | Base | Notes |
+|----|------|--------|--------|---------|------|-------|
+| `fwedu100-sp8192-pr1493` | 2026-04-14 | `main` | Official SP8192 tokenizer + official val + FineWeb-Edu-only train shards | pending | PR1493 | First directional probe. Pure Edu is expected to test the distribution-shift ceiling before mixed-train followups. |
+
+## Findings
+
+- The published challenge docs are a frozen shuffled export, not obviously a FineWeb-Edu-style cleaned subset.
+- With 10-minute training, the model only sees an early slice of the train stream, so train data order and mixing strategy matter.
+
+## Next
+
+- Build the `100% FineWeb-Edu` train-only variant with the original SP8192 tokenizer and unchanged val split.
+- Run the merged PR1493 command against the alternate `DATA_DIR`.
+- If pure Edu loses, try an interleaved original/Edu mix before touching the tokenizer.
+
+Reference runbook: [fineweb_edu_sp8192.md](fineweb_edu_sp8192.md)
diff --git a/axes/eval-time/README.md b/axes/eval-time/README.md
@@ -0,0 +1,22 @@
+# Eval-time adaptation
+
+Reference: [`research/AXES_ANALYSIS.md#axis-7-eval-time-adaptation`](../../research/AXES_ANALYSIS.md)
+
+*TTT (LoRA / legal / dTTT / score-first), sliding window eval, stride, causality-compliant methods. Note: SLOT was shown illegal (PR-1240).*
+
+## Hypothesis
+
+_What we think is unexploited._
+
+## Experiments
+
+| ID | Date | Branch | Config | val_bpb | Base | Notes |
+|----|------|--------|--------|---------|------|-------|
+
+## Findings
+
+-
+
+## Next
+
+-
diff --git a/axes/optimizer/README.md b/axes/optimizer/README.md
@@ -0,0 +1,22 @@
+# Optimizer
+
+Reference: [`research/AXES_ANALYSIS.md#axis-4-optimizer`](../../research/AXES_ANALYSIS.md)
+
+*Muon / AdamW / MuonEq variants, parameter sharding, momentum precision, parallel-muon, weight decay.*
+
+## Hypothesis
+
+_What we think is unexploited._
+
+## Experiments
+
+| ID | Date | Branch | Config | val_bpb | Base | Notes |
+|----|------|--------|--------|---------|------|-------|
+
+## Findings
+
+-
+
+## Next
+
+-
-Original file line number
+Diff line change
@@ Expand Up / @@ -8,4 +8,6 @@ data/manifest.json @@
     data/docs_selected.jsonl
     .mypy_cache/
     .venv
-    logs/
+    logs/
+    *.pt
+    *.ptz