Skip to content

fix(data): normalize cached FineWeb paths#7

Closed
RolanH wants to merge 1 commit intoopenai:mainfrom
RolanH:codex/fix-cached-fineweb-paths
Closed

fix(data): normalize cached FineWeb paths#7
RolanH wants to merge 1 commit intoopenai:mainfrom
RolanH:codex/fix-cached-fineweb-paths

Conversation

@RolanH
Copy link
Copy Markdown

@RolanH RolanH commented Mar 18, 2026

Summary

  • normalize MATCHED_FINEWEB_REMOTE_ROOT_PREFIX handling for manifest, docs, dataset, and tokenizer downloads in data/cached_challenge_fineweb.py
  • strip full multi-segment remote prefixes before mapping files into local data/datasets and data/tokenizers
  • add regression coverage for nested remote prefixes and empty-prefix manifest resolution

Test Plan

  • python3 -m unittest discover -s tests -v
  • python3 -m py_compile $(rg --files -g '*.py')

Handle MATCHED_FINEWEB_REMOTE_ROOT_PREFIX consistently for
manifest, docs, datasets, and tokenizer artifacts.

Add regression tests for nested prefixes and empty-prefix
manifest resolution.
@RolanH RolanH closed this Mar 18, 2026
@RolanH RolanH deleted the codex/fix-cached-fineweb-paths branch March 18, 2026 18:56
dhruvjatkar pushed a commit to dhruvjatkar/parameter-golf that referenced this pull request Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future
improvements must be orthogonal to TTT. This update:
- Sets 1.0781 BPB (PR openai#672) as the new target to beat
- Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2,
  SwiGLU #3, Muon-VS #4, aggressive quant #5, MASA openai#6,
  depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8
- Deprioritizes TTT-related directions already exploited by PR openai#672
- Collapses ~1000 lines of stale Round 0-3.9 session logs into a
  concise historical summary
- Removes resolved blockers (flash_attn, SSH hangs, local runtime)
- Adds fresh Round 1 section with 5 submitted experiments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dhruvjatkar pushed a commit to dhruvjatkar/parameter-golf that referenced this pull request Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future
improvements must be orthogonal to TTT. This update:
- Sets 1.0781 BPB (PR openai#672) as the new target to beat
- Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2,
  SwiGLU #3, Muon-VS #4, aggressive quant #5, MASA openai#6,
  depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8
- Deprioritizes TTT-related directions already exploited by PR openai#672
- Collapses ~1000 lines of stale Round 0-3.9 session logs into a
  concise historical summary
- Removes resolved blockers (flash_attn, SSH hangs, local runtime)
- Adds fresh Round 1 section with 5 submitted experiments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jzmyres pushed a commit to jzmyres/parameter-golf that referenced this pull request Mar 31, 2026
…mization goal

Key clarifications:
1. Warm-start x is the soft embedding: initially one-hot, iteratively
   updated by CTP/NTP predictions across DEQ iterations
2. Soft Dense Routing: sparsity encouraged (L1 per-token) not required,
   balance enforced globally — applies to ALL expert groups (MLP + MoS)
3. Optimization goal: throughput × convergence rate within time budget
4. Include paper/repo references when proposing improvements
5. Fixed constraint numbering (DDP is now openai#7)
6. Added paper refs: DeepSeek-V2, GatedAttn repo, FSQ paper

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
deborahnelson8788726 pushed a commit to deborahnelson8788726/parameter-golf that referenced this pull request Apr 2, 2026
- Fix openai#1: ternary roundtrip eval on ALL ranks with dist.broadcast
  (was: only rank 0 loaded weights → invalid eval results)
- Fix openai#2: pass pre-computed scales to export (avoids double-quantization)
- Fix openai#3: keep scales as float32 (was: lossy float16 cast)
- Fix openai#4: import returns float32 (was: lossy bfloat16 cast)
- Fix openai#5: lower z_loss from 1e-4 to 1e-5 (prevents loss explosion)
- Fix openai#6: add dist.broadcast after int8 roundtrip load too
- Fix openai#7: add weights_only=False to suppress FutureWarning

Ternary roundtrip is now LOSSLESS (max error = 0.0).
The previous val_bpb=0.9650 was an artifact of bug openai#1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
EL2 cycle-2 = 3.2742 (only +0.0008 above champion 3.2734) reversed
the audit fire openai#1 verdict that EngramLite was falsified. Adding 4 new
EL multi-seed experiments to confirm:
  - EL3 (seed 1337), EL4 (seed 999), EL5 (seed 7)
  - EL6 with L5 weights (0.15/0.20/0.15) — new combination

Removed 15 dead/falsified configs that wasted cycle 2 compute:
EA*, BG*, NG*, TH*, MEGA, MTP0/2/3, MTP1_seed999, PR2/3, EL0.

Also captured EMA(0.997) canonical spec from 6 merged records
(openai#287, openai#315, openai#414, openai#1019, openai#1099) — DEFERRED actual Patch 17 ship
because EMA only affects final val_bpb (not loop train_loss) and
training-loop anchoring is risky without reading train_gpt.py.

Queue now cycles in ~100 min (vs 185 min) leaving more compute
for the EL family expansion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…ntified as top missing technique

Patches 15/16/21 + NEW Patch 20 USE_COPRIME_STRIDE all uncontested
in 150+ open + 20 closed PRs (7 consecutive audits for the original
3, first confirmation for Patch 20 just shipped 3h ago).

CRITICAL FINDING: XSA (Cross-Sequence Attention) is in 4+ MERGED
records (PR openai#1019, openai#287, openai#315, openai#265, latest openai#1099) and we have ZERO
attention-mask variants. Most-validated missing technique. ~200 LOC
moderate port — too big for a single research fire but worth a focused
30-45 min investigation if we can find a minimal variant.

SLOT (Score-First TTT) is the openai#2 missing (PR openai#549, ~100 LOC) but it's
eval-time, joins the H100 escalation bundle category.

H100 escalation candidate updated:
  NEW: CHAMP_L4 + COPRIME_STRIDE + EL + (EMA + Tilt + INT6 GPTQ)
  OLD: CHAMP_L4 + EL + (EMA + Tilt + INT6 GPTQ)

Need CS2 cycle 2+3 for n=3 mean confirmation before escalating.

PR openai#1430 still OPEN, 0 comments, no comp owner activity for 16h+.

Spend ~$4.00/$36 (11.1%). Pod healthy at 7h 50min uptime.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jzmyres pushed a commit to jzmyres/parameter-golf that referenced this pull request May 3, 2026
… coef sweeps

Per user directive 2026-04-30 (feedback_throughput_priority.md):
throughput-bearing iters (Triton kernels, sparse dispatch, sparse
attention) take queue priority over coef-sweep follow-ups for
iter 112's Gram penalty. Throughput compounds research velocity —
faster step rate = more iters per unit time.

Tier 1 reordered:
  openai#1 (DROPPED) iter 113
  openai#2 iter 112 — IN FLIGHT
  openai#3 iter 117b-2 — Triton entmax (THROUGHPUT)
  openai#4 iter 117b-3 — Sparse MoE dispatch (THROUGHPUT, biggest win)
  openai#5 iter 117b-3b — Sparse-Q attention (THROUGHPUT, promoted from Tier 2)
  openai#6 iter 120 — RRAttention (THROUGHPUT, promoted from Tier 2)
  openai#7 iter 108 — k_eval=10 throughput
  openai#8 iter 110 — refinement re-enable (last)

Deferred coef sweeps (post-throughput): iter 112b/c/d. These remain
conditional on iter 112 promotion AND will only run after the
throughput iters are exhausted. Anti-pattern explicitly avoided:
chasing diminishing val_bpb gains via hyperparameter tuning while a
1.5-4x wallclock improvement sits unmerged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jzmyres pushed a commit to jzmyres/parameter-golf that referenced this pull request May 3, 2026
…2048

Per user challenge 2026-04-30: "Would RRAttention hurt throughput as
the optimized SDPA is replaced?" — answered yes.

RRAttention is the same SDPA-replacement class as iter 106 NSA which
was DROPPED 2026-04-29 because NSA = 0.42x FlashAttention at T=2048
(official fla-org Triton benchmark). At T=2048 we're in the
FA-fusion + tensorcore-saturation regime; manual sparse attention
loses on memory traffic, kernel launch overhead, and tensorcore
utilization simultaneously.

The component file's "8/8 PASS, tau=1.0 bit-identical" claim is a
correctness check, NOT a throughput check. Pure-PyTorch component
cannot compete with F.scaled_dot_product_attention at T=2048.

Re-queue paths:
- flex_attention (PyTorch 2.5+) with score_mod/block_mask
- Custom Triton kernel with selection inside FA tile
- Defer until T-scaling phase (T=4096+)

Tier 1 reordered:
  openai#1 (DROPPED) iter 113
  openai#2 iter 112 — IN FLIGHT
  openai#3 iter 117b-2 — Triton entmax (kernel-only, doesn't replace SDPA)
  openai#4 iter 117b-3 — Sparse MoE dispatch (replaces MLP path, not SDPA)
  openai#5 iter 117b-3b — Sparse-Q attention (smaller-Q gather; SDPA call preserved)
  openai#6 iter 108 — k_eval=10 (one-line config)
  openai#7 iter 110 — refinement re-enable

DEMOTED to Deferred: iter 120 (RRAttention).

New durable rule: feedback_sdpa_replacement_at_T2048.md — never queue
sparse-attention iters that REPLACE F.scaled_dot_product_attention
at T=2048 without a fused implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant