fix(data): normalize cached FineWeb paths#7
Closed
RolanH wants to merge 1 commit intoopenai:mainfrom
Closed
Conversation
Handle MATCHED_FINEWEB_REMOTE_ROOT_PREFIX consistently for manifest, docs, datasets, and tokenizer artifacts. Add regression tests for nested prefixes and empty-prefix manifest resolution.
dhruvjatkar
pushed a commit
to dhruvjatkar/parameter-golf
that referenced
this pull request
Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future improvements must be orthogonal to TTT. This update: - Sets 1.0781 BPB (PR openai#672) as the new target to beat - Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2, SwiGLU #3, Muon-VS #4, aggressive quant #5, MASA openai#6, depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8 - Deprioritizes TTT-related directions already exploited by PR openai#672 - Collapses ~1000 lines of stale Round 0-3.9 session logs into a concise historical summary - Removes resolved blockers (flash_attn, SSH hangs, local runtime) - Adds fresh Round 1 section with 5 submitted experiments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dhruvjatkar
pushed a commit
to dhruvjatkar/parameter-golf
that referenced
this pull request
Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future improvements must be orthogonal to TTT. This update: - Sets 1.0781 BPB (PR openai#672) as the new target to beat - Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2, SwiGLU #3, Muon-VS #4, aggressive quant #5, MASA openai#6, depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8 - Deprioritizes TTT-related directions already exploited by PR openai#672 - Collapses ~1000 lines of stale Round 0-3.9 session logs into a concise historical summary - Removes resolved blockers (flash_attn, SSH hangs, local runtime) - Adds fresh Round 1 section with 5 submitted experiments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jzmyres
pushed a commit
to jzmyres/parameter-golf
that referenced
this pull request
Mar 31, 2026
…mization goal Key clarifications: 1. Warm-start x is the soft embedding: initially one-hot, iteratively updated by CTP/NTP predictions across DEQ iterations 2. Soft Dense Routing: sparsity encouraged (L1 per-token) not required, balance enforced globally — applies to ALL expert groups (MLP + MoS) 3. Optimization goal: throughput × convergence rate within time budget 4. Include paper/repo references when proposing improvements 5. Fixed constraint numbering (DDP is now openai#7) 6. Added paper refs: DeepSeek-V2, GatedAttn repo, FSQ paper Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
deborahnelson8788726
pushed a commit
to deborahnelson8788726/parameter-golf
that referenced
this pull request
Apr 2, 2026
- Fix openai#1: ternary roundtrip eval on ALL ranks with dist.broadcast (was: only rank 0 loaded weights → invalid eval results) - Fix openai#2: pass pre-computed scales to export (avoids double-quantization) - Fix openai#3: keep scales as float32 (was: lossy float16 cast) - Fix openai#4: import returns float32 (was: lossy bfloat16 cast) - Fix openai#5: lower z_loss from 1e-4 to 1e-5 (prevents loss explosion) - Fix openai#6: add dist.broadcast after int8 roundtrip load too - Fix openai#7: add weights_only=False to suppress FutureWarning Ternary roundtrip is now LOSSLESS (max error = 0.0). The previous val_bpb=0.9650 was an artifact of bug openai#1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 7, 2026
EL2 cycle-2 = 3.2742 (only +0.0008 above champion 3.2734) reversed the audit fire openai#1 verdict that EngramLite was falsified. Adding 4 new EL multi-seed experiments to confirm: - EL3 (seed 1337), EL4 (seed 999), EL5 (seed 7) - EL6 with L5 weights (0.15/0.20/0.15) — new combination Removed 15 dead/falsified configs that wasted cycle 2 compute: EA*, BG*, NG*, TH*, MEGA, MTP0/2/3, MTP1_seed999, PR2/3, EL0. Also captured EMA(0.997) canonical spec from 6 merged records (openai#287, openai#315, openai#414, openai#1019, openai#1099) — DEFERRED actual Patch 17 ship because EMA only affects final val_bpb (not loop train_loss) and training-loop anchoring is risky without reading train_gpt.py. Queue now cycles in ~100 min (vs 185 min) leaving more compute for the EL family expansion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 7, 2026
…ntified as top missing technique Patches 15/16/21 + NEW Patch 20 USE_COPRIME_STRIDE all uncontested in 150+ open + 20 closed PRs (7 consecutive audits for the original 3, first confirmation for Patch 20 just shipped 3h ago). CRITICAL FINDING: XSA (Cross-Sequence Attention) is in 4+ MERGED records (PR openai#1019, openai#287, openai#315, openai#265, latest openai#1099) and we have ZERO attention-mask variants. Most-validated missing technique. ~200 LOC moderate port — too big for a single research fire but worth a focused 30-45 min investigation if we can find a minimal variant. SLOT (Score-First TTT) is the openai#2 missing (PR openai#549, ~100 LOC) but it's eval-time, joins the H100 escalation bundle category. H100 escalation candidate updated: NEW: CHAMP_L4 + COPRIME_STRIDE + EL + (EMA + Tilt + INT6 GPTQ) OLD: CHAMP_L4 + EL + (EMA + Tilt + INT6 GPTQ) Need CS2 cycle 2+3 for n=3 mean confirmation before escalating. PR openai#1430 still OPEN, 0 comments, no comp owner activity for 16h+. Spend ~$4.00/$36 (11.1%). Pod healthy at 7h 50min uptime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
jzmyres
pushed a commit
to jzmyres/parameter-golf
that referenced
this pull request
May 3, 2026
… coef sweeps Per user directive 2026-04-30 (feedback_throughput_priority.md): throughput-bearing iters (Triton kernels, sparse dispatch, sparse attention) take queue priority over coef-sweep follow-ups for iter 112's Gram penalty. Throughput compounds research velocity — faster step rate = more iters per unit time. Tier 1 reordered: openai#1 (DROPPED) iter 113 openai#2 iter 112 — IN FLIGHT openai#3 iter 117b-2 — Triton entmax (THROUGHPUT) openai#4 iter 117b-3 — Sparse MoE dispatch (THROUGHPUT, biggest win) openai#5 iter 117b-3b — Sparse-Q attention (THROUGHPUT, promoted from Tier 2) openai#6 iter 120 — RRAttention (THROUGHPUT, promoted from Tier 2) openai#7 iter 108 — k_eval=10 throughput openai#8 iter 110 — refinement re-enable (last) Deferred coef sweeps (post-throughput): iter 112b/c/d. These remain conditional on iter 112 promotion AND will only run after the throughput iters are exhausted. Anti-pattern explicitly avoided: chasing diminishing val_bpb gains via hyperparameter tuning while a 1.5-4x wallclock improvement sits unmerged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jzmyres
pushed a commit
to jzmyres/parameter-golf
that referenced
this pull request
May 3, 2026
…2048 Per user challenge 2026-04-30: "Would RRAttention hurt throughput as the optimized SDPA is replaced?" — answered yes. RRAttention is the same SDPA-replacement class as iter 106 NSA which was DROPPED 2026-04-29 because NSA = 0.42x FlashAttention at T=2048 (official fla-org Triton benchmark). At T=2048 we're in the FA-fusion + tensorcore-saturation regime; manual sparse attention loses on memory traffic, kernel launch overhead, and tensorcore utilization simultaneously. The component file's "8/8 PASS, tau=1.0 bit-identical" claim is a correctness check, NOT a throughput check. Pure-PyTorch component cannot compete with F.scaled_dot_product_attention at T=2048. Re-queue paths: - flex_attention (PyTorch 2.5+) with score_mod/block_mask - Custom Triton kernel with selection inside FA tile - Defer until T-scaling phase (T=4096+) Tier 1 reordered: openai#1 (DROPPED) iter 113 openai#2 iter 112 — IN FLIGHT openai#3 iter 117b-2 — Triton entmax (kernel-only, doesn't replace SDPA) openai#4 iter 117b-3 — Sparse MoE dispatch (replaces MLP path, not SDPA) openai#5 iter 117b-3b — Sparse-Q attention (smaller-Q gather; SDPA call preserved) openai#6 iter 108 — k_eval=10 (one-line config) openai#7 iter 110 — refinement re-enable DEMOTED to Deferred: iter 120 (RRAttention). New durable rule: feedback_sdpa_replacement_at_T2048.md — never queue sparse-attention iters that REPLACE F.scaled_dot_product_attention at T=2048 without a fused implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MATCHED_FINEWEB_REMOTE_ROOT_PREFIXhandling for manifest, docs, dataset, and tokenizer downloads indata/cached_challenge_fineweb.pydata/datasetsanddata/tokenizersTest Plan
python3 -m unittest discover -s tests -vpython3 -m py_compile $(rg --files -g '*.py')