Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
8500dd2
Add minimal experiment tracking scaffold
nprime06 Apr 14, 2026
3a4b9e2
Reorganize repo by axis of experimentation
nprime06 Apr 14, 2026
8978b21
Populate quantization axis with PR-1019 analysis + experiment plan
nprime06 Apr 14, 2026
c41a2f2
Rewrite quantization report against PR-1493 (actual merged SOTA at 1.…
nprime06 Apr 14, 2026
26357c7
Add PR-1493 reproduction script that saves bundle for offline quant i…
nprime06 Apr 16, 2026
b0dff24
Add Modal launcher for PR-1493 bundle reproduction
nprime06 Apr 16, 2026
45faac5
Guard REPO_ROOT against short parents on Modal container
nprime06 Apr 16, 2026
bc2dcfb
Log pr1493_bundle_seed42 reproduction run
nprime06 Apr 16, 2026
2ed23ad
Add quantize_bundle.py + modal quantize mode
nprime06 Apr 16, 2026
e0d6985
Create logs/ dir before log() writes in quantize_bundle.py
nprime06 Apr 16, 2026
4e90428
Log pr1493_quantize_reference_v2 reference quantization run
nprime06 Apr 16, 2026
3db2cc4
Add flash_attn_3 SDPA fallback + disable sliding by default
nprime06 Apr 17, 2026
7b9ae43
Add FORCE_SDPA_FALLBACK flag + manual attention fallback
nprime06 Apr 17, 2026
de8a303
Restore torch.compile — model requires it for correct eval
nprime06 Apr 17, 2026
e727eac
Add NF (NormalFloat) quantization support to GPTQ pipeline
nprime06 Apr 18, 2026
99132dc
Optimize NF quantizer: searchsorted instead of argmin
nprime06 Apr 18, 2026
58901ed
Add tiered quantization: NF3 for T5, int5 k=6 for T4, baseline for T1-T3
nprime06 Apr 19, 2026
fe3ec23
Add tiered_k: all int5, adaptive k per sensitivity tier
nprime06 Apr 19, 2026
56f921a
Add post-hoc magnitude pruning before GPTQ
nprime06 Apr 19, 2026
d7520f8
Add Hessian-aware pruning: importance = |w| * sqrt(H_diag_col)
nprime06 Apr 19, 2026
fb1a928
Fix device mismatch in Hessian pruning + add train_with_pruning.py
nprime06 Apr 19, 2026
c3dbe6b
Add 'large' and 'random' pruning methods for comparison
nprime06 Apr 19, 2026
ca71c3f
Fix pruning schedule: use frac-based progress + enforce sparsity on EMA
nprime06 Apr 19, 2026
139a3b7
Add SparseGPT: joint sparsification + quantization in GPTQ sweep
nprime06 Apr 20, 2026
359e9fa
Pass SPARSITY_THRESHOLD through modal launcher
nprime06 Apr 20, 2026
b43e049
Add WD taper support (PR-1729 style)
nprime06 Apr 20, 2026
0cafff9
Add WD taper to train_save_bundle.py + modal launcher
nprime06 Apr 20, 2026
25d5abc
Add Cautious WD to Muon optimizer
nprime06 Apr 21, 2026
1ef79c1
Pass CAUTIOUS_WD through modal launcher
nprime06 Apr 21, 2026
bc67869
Comprehensive experiment log for quantization axis
nprime06 Apr 21, 2026
93425cf
Add wide-zero grid quantizer (reverse-NF)
nprime06 Apr 21, 2026
32ed59d
Extended NF: configurable tail coverage + composable SparseGPT
nprime06 Apr 21, 2026
6407f3f
Pass NUM_LAYERS and PARALLEL_RESIDUAL_START through modal launcher
nprime06 Apr 21, 2026
075aa9b
Pass XSA_LAST_N=num_layers to ensure all layers get XSA
nprime06 Apr 21, 2026
760d91b
Pass NUM_LAYERS/PARALLEL_RESIDUAL_START/XSA_LAST_N through quantize path
nprime06 Apr 21, 2026
62b0473
Non-record: Negative Results Compendium — 14 failed/marginal experime…
nprime06 Apr 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
logs/
*.pt
*.ptz
16 changes: 16 additions & 0 deletions axes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Axes

Experimentation is organized by axis. Taxonomy is derived from [`../research/AXES_ANALYSIS.md`](../research/AXES_ANALYSIS.md), which catalogs every merged record by axis.

| Axis | Dir | Current status |
|------|-----|---------------|
| Architecture | [`architecture/`](architecture/) | — |
| Attention | [`attention/`](attention/) | — |
| Quantization | [`quantization/`](quantization/) | — |
| Optimizer | [`optimizer/`](optimizer/) | — |
| Training dynamics | [`training/`](training/) | — |
| Data & tokenizer | [`data/`](data/) | — |
| Eval-time adaptation | [`eval-time/`](eval-time/) | — |
| Compression | [`compression/`](compression/) | — |

Each axis dir has a `README.md` that tracks hypotheses, experiments run, findings, and next steps for that axis.
22 changes: 22 additions & 0 deletions axes/architecture/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Architecture

Reference: [`research/AXES_ANALYSIS.md#axis-1-architecture`](../../research/AXES_ANALYSIS.md)

*Layer structure, residual patterns, skip connections, recurrence, auxiliary input features, MLP design.*

## Hypothesis

_What we think is unexploited._

## Experiments

| ID | Date | Branch | Config | val_bpb | Base | Notes |
|----|------|--------|--------|---------|------|-------|

## Findings

-

## Next

-
22 changes: 22 additions & 0 deletions axes/attention/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Attention

Reference: [`research/AXES_ANALYSIS.md#axis-2-attention`](../../research/AXES_ANALYSIS.md)

*Attention variants (XSA, SWA, MLA, sliding windows), RoPE tweaks, QK gain, attention sparsity.*

## Hypothesis

_What we think is unexploited._

## Experiments

| ID | Date | Branch | Config | val_bpb | Base | Notes |
|----|------|--------|--------|---------|------|-------|

## Findings

-

## Next

-
22 changes: 22 additions & 0 deletions axes/compression/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Compression

Reference: [`research/AXES_ANALYSIS.md#axis-8-compression`](../../research/AXES_ANALYSIS.md)

*Artifact compression: zstd / brotli / LZMA / ANS, bit-packing, code-string compression, per-group grids.*

## Hypothesis

_What we think is unexploited._

## Experiments

| ID | Date | Branch | Config | val_bpb | Base | Notes |
|----|------|--------|--------|---------|------|-------|

## Findings

-

## Next

-
28 changes: 28 additions & 0 deletions axes/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Data & tokenizer

Reference: [`research/AXES_ANALYSIS.md#axis-6-data--tokenizer`](../../research/AXES_ANALYSIS.md)

*Tokenizer choice (SP1024 / SP4096 / SP8192 / BPE variants), shard ordering, data filtering, FineWeb-Edu substitution, Rho-1-style selective LM, curriculum.*

## Hypothesis

The challenge train stream appears to be a frozen shuffled snapshot rather than a deliberately education-filtered corpus, so a cleaner train-only substitution may help under the 600s token budget even if it introduces some validation-distribution mismatch.

## Experiments

| ID | Date | Branch | Config | val_bpb | Base | Notes |
|----|------|--------|--------|---------|------|-------|
| `fwedu100-sp8192-pr1493` | 2026-04-14 | `main` | Official SP8192 tokenizer + official val + FineWeb-Edu-only train shards | pending | PR1493 | First directional probe. Pure Edu is expected to test the distribution-shift ceiling before mixed-train followups. |

## Findings

- The published challenge docs are a frozen shuffled export, not obviously a FineWeb-Edu-style cleaned subset.
- With 10-minute training, the model only sees an early slice of the train stream, so train data order and mixing strategy matter.

## Next

- Build the `100% FineWeb-Edu` train-only variant with the original SP8192 tokenizer and unchanged val split.
- Run the merged PR1493 command against the alternate `DATA_DIR`.
- If pure Edu loses, try an interleaved original/Edu mix before touching the tokenizer.

Reference runbook: [fineweb_edu_sp8192.md](fineweb_edu_sp8192.md)
22 changes: 22 additions & 0 deletions axes/eval-time/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Eval-time adaptation

Reference: [`research/AXES_ANALYSIS.md#axis-7-eval-time-adaptation`](../../research/AXES_ANALYSIS.md)

*TTT (LoRA / legal / dTTT / score-first), sliding window eval, stride, causality-compliant methods. Note: SLOT was shown illegal (PR-1240).*

## Hypothesis

_What we think is unexploited._

## Experiments

| ID | Date | Branch | Config | val_bpb | Base | Notes |
|----|------|--------|--------|---------|------|-------|

## Findings

-

## Next

-
22 changes: 22 additions & 0 deletions axes/optimizer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Optimizer

Reference: [`research/AXES_ANALYSIS.md#axis-4-optimizer`](../../research/AXES_ANALYSIS.md)

*Muon / AdamW / MuonEq variants, parameter sharding, momentum precision, parallel-muon, weight decay.*

## Hypothesis

_What we think is unexploited._

## Experiments

| ID | Date | Branch | Config | val_bpb | Base | Notes |
|----|------|--------|--------|---------|------|-------|

## Findings

-

## Next

-
Loading