From 1ce4645d84ec72f5a5279560bc07cad37a056aca Mon Sep 17 00:00:00 2001 From: anonymous Date: Wed, 25 Mar 2026 21:42:13 +0100 Subject: [PATCH 01/23] quantization-denoising --- ..._code_recurrence_plan_from_current_best.md | 600 +++++++++++ claude plans/parameter-golf-plan.md | 183 ++++ .../README.md | 121 +++ .../feedback.py | 139 +++ .../full_run_1gpu.sh | 116 +++ .../model_recurrent_bestbase.py | 660 ++++++++++++ .../quant.py | 81 ++ .../smoke_test.sh | 59 ++ .../stability.py | 108 ++ .../submission.json | 9 + ...train_bestbase_recurrent_feedback_fixed.py | 377 +++++++ ...ain_bestbase_recurrent_feedback_learned.py | 459 +++++++++ .../train_bestbase_recurrent_qat.py | 334 ++++++ .../train_utils_recurrent.py | 958 ++++++++++++++++++ .../ttt_recurrent.py | 259 +++++ sky.yaml | 78 ++ sky_recurrent.yaml | 57 ++ 17 files changed, 4598 insertions(+) create mode 100644 claude plans/claude_code_recurrence_plan_from_current_best.md create mode 100644 claude plans/parameter-golf-plan.md create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/README.md create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/feedback.py create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/full_run_1gpu.sh create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/model_recurrent_bestbase.py create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/quant.py create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/smoke_test.sh create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/stability.py create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/submission.json create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_fixed.py create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_learned.py create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_qat.py create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_utils_recurrent.py create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/ttt_recurrent.py create mode 100644 sky.yaml create mode 100644 sky_recurrent.yaml diff --git a/claude plans/claude_code_recurrence_plan_from_current_best.md b/claude plans/claude_code_recurrence_plan_from_current_best.md new file mode 100644 index 0000000000..6f22ff7b43 --- /dev/null +++ b/claude plans/claude_code_recurrence_plan_from_current_best.md @@ -0,0 +1,600 @@ +# Claude Code implementation brief: start from the current best record, then add a recurrent core with error correction + +Implement a **minimal-diff branch** on top of the current best 10-minute / 16 MB Parameter Golf record: + +- `records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/README.md` + +Use that record as the starting point for architecture, optimizer, quantization/export path, and legal test-time training (TTT). Do **not** start from PR #363 as the code base. Instead, use PR #363 only as the motivating failure case for recurrent quantization instability. + +## Ground truth to preserve from the current best + +Mirror the current best setup as closely as possible before adding recurrence: + +- 11 layers, width 512, 8 heads, 4 KV heads +- MLP expansion 3× with `LeakyReLU(0.5)^2` +- `BigramHash=1536` +- XSA in the last 4 layers +- partial RoPE on 16 of 64 dims +- LayerNorm scaling of `1 / sqrt(layer + 1)` +- VE128 in layers 9–10 +- EMA + tight SWA +- `GPTQ-lite int6 + lzma` +- Parameter Banking + Parallel Muon +- legal score-first TTT with 32K-token chunks, SGD with momentum, 3 epochs per chunk, all blocks unfrozen in the current record, and total eval under the time budget + +The current best README reports a **3-seed mean of 1.1194 BPB**, roughly **15.95 MB** artifacts, about **83.4 ms/step**, and about **409 s** of TTT time inside a total evaluation time around **530 s**. Preserve that spirit: recurrence must be added without breaking the training/eval budget or the export path. + +## Why recurrence needs extra care + +PR #363 is the reference failure mode, not the base implementation. Its summary describes a looped architecture where shared blocks were reused across recurrence cycles. The PR reports that a `4 unique blocks × 3 cycles` setup went from **2.0711 BPB pre-quant** to **2.4402 BPB post-quant**, with the writeup attributing the collapse to roughly **900× amplification** of quantization error through recurrence. It also reports that a separate “noisy QAT” experiment largely removed the recurrence quantization gap. Treat that as the problem this branch is solving. + +In the recurrent setting, the report’s dynamical-systems framing is the right mental model. With shared quantized weights + +$$ +W_q = W + \varepsilon, +$$ + +a first-order perturbation grows roughly like + +$$ +\lVert J \rVert^k \cdot \lVert \varepsilon h_0 \rVert, +$$ + +where $J$ is the Jacobian of the shared update and $k$ is the number of recurrence passes. This means recurrence is **not** “free extra depth”; it is a noisy iterative system. + +## Design objective + +Build a recurrent variant that **inherits as much as possible from the current best record** while changing only the transformer body where necessary. + +Concretely: + +- keep the current best tokenizer, data path, optimizer stack, EMA/SWA, export flow, and legal TTT protocol, +- keep the current best non-recurrent baseline runnable, +- replace only a contiguous middle portion of the 11-layer stack with a shared recurrent core, +- and add explicit training-time + test-time correction for quantization error. + +## Recommended architecture migration + +Do **not** convert the entire 11-layer stack into one shared loop immediately. + +Instead, start from the current best 11-layer stack and partition it into: + +1. **stem**: early unique layers +2. **recurrent core**: a shared block or shared block-group repeated for `K` passes +3. **tail**: late unique layers + +### Safe first migration + +A good first version is: + +- keep layers 0–2 unique +- replace layers 3–7 with a recurrent core derived from 1–2 shared layers repeated `K` times +- keep layers 8–10 unique + +This preserves the current best model’s input processing and output refinement while localizing recurrence to the middle, where it is least likely to break the full recipe. + +## Recurrent update equations + +Let the current-best-derived stem produce + +$$ +h_0 = \text{Stem}(x). +$$ + +Let the recurrent core update be + +$$ +h_{k+1} = f_{W_q}(h_k + c_k), \qquad k = 0, \dots, K-1, +$$ + +and let the final tail and LM head produce + +$$ +\text{logits} = \text{LMHead}(\text{Tail}(h_K)). +$$ + +The correction term $c_k$ depends on the script variant. + +## Research constraints to encode + +### 1. Full-rollout QAT is mandatory + +All recurrence passes must be present during training. Compute the LM loss only after the final recurrent pass: + +$$ +\mathcal{L} = \operatorname{CE}(\text{LMHead}(\text{Tail}(h_K)), y). +$$ + +Use STE fake quantization inside the shared recurrent core during the rollout. + +### 2. Heterogeneous precision should remain available + +Per the report’s cited recurrent-model quantization literature, the recurrently reused matrices are the sensitive ones. Therefore: + +- prioritize fake quant and export quant support for the shared attention and MLP matrices, +- keep LayerNorm and tiny scalars in higher precision, +- and make it easy to leave embeddings and the LM head less aggressively quantized. + +### 3. Error feedback is the main experimental lever + +Use the report’s delta-sigma / error-feedback idea as the center of the implementation. + +Approximate the quantization residual action by + +$$ +e_k \approx (W - W_q) h_k, +$$ + +then inject a compensation term into the next pass: + +$$ +c_k = D_k e_k, +$$ + +$$ +h_{k+1} = f_{W_q}(h_k + c_k). +$$ + +Do **not** store the full dense residual matrix. + +### 4. Low-rank residual approximation is the default practical path + +Implement + +$$ +e_k \approx U(V^\top h_k), +$$ + +where $U, V \in \mathbb{R}^{d \times r}$ and $r \in \{1,2,4\}$. + +This is the small-parameter correction branch that fits the Parameter Golf budget. + +### 5. Jacobian control, clipping, and residual scaling are secondary stabilizers + +Add optional flags for: + +- hidden-state clipping between passes, +- per-pass residual scaling, +- a light Jacobian proxy regularizer. + +Use them as ablations and safety rails. + +### 6. TTT and recurrence interact, so default conservatively + +The current best record relies heavily on legal TTT. In a recurrent model, updating shared weights during TTT affects **all** recurrence passes at once. Therefore the recurrent branch should support safer TTT defaults, including: + +- freezing the recurrent core during TTT, +- adapting only tail layers during TTT, +- or using a smaller TTT LR for the recurrent core than for the unique layers. + +Keep the original “all blocks unfrozen” TTT option available, but do not make it the only path. + +## Files to create + +Create these top-level scripts: + +1. `train_bestbase_recurrent_qat.py` +2. `train_bestbase_recurrent_feedback_fixed.py` +3. `train_bestbase_recurrent_feedback_learned.py` + +Create these shared modules: + +- `model_recurrent_bestbase.py` +- `quant.py` +- `feedback.py` +- `stability.py` +- `ttt_recurrent.py` +- `train_utils_recurrent.py` + +## Script 1: `train_bestbase_recurrent_qat.py` + +### Goal + +Take the current best architecture and replace the chosen middle stack with a shared recurrent core trained with **full-rollout QAT**, but without explicit error feedback. + +### Forward pass + +Run: + +$$ +h_0 = \text{Stem}(x), +$$ + +$$ +h_{k+1} = f_{W_q}(h_k), +$$ + +$$ +\text{logits} = \text{LMHead}(\text{Tail}(h_K)). +$$ + +### Requirements + +- preserve the current best defaults outside the recurrent core, +- apply fake quant only to the shared recurrent core by default, +- compute loss only after the final pass, +- support `num_passes`, `shared_core_layers`, and `recurrent_layer_range` flags, +- preserve the current-best export path as closely as possible. + +### Purpose + +This script answers: how much of the recurrence problem disappears if we simply start from the current best recipe and train the real quantized recurrent rollout? + +## Script 2: `train_bestbase_recurrent_feedback_fixed.py` + +### Goal + +Add a small fixed-form error-feedback path on top of Script 1. + +### Residual approximation + +Use + +$$ +e_k = U(V^\top h_k) +$$ + +with tiny rank by default. + +### Correction options + +Support at least: + +1. identity correction + +$$ +c_k = e_k +$$ + +2. shared diagonal correction + +$$ +c_k = d \odot e_k +$$ + +where $d \in \mathbb{R}^d$ is learned or initialized at ones. + +### Recurrent update + +Use + +$$ +h_{k+1} = f_{W_q}(h_k + c_k). +$$ + +### Requirements + +- correction is inactive on pass 0, +- full-rollout QAT remains enabled, +- keep parameter overhead tiny, +- log correction norms and per-pass activation growth. + +### Purpose + +This script tests whether a tiny correction path can recover recurrence while leaving the current best recipe mostly untouched. + +## Script 3: `train_bestbase_recurrent_feedback_learned.py` + +### Goal + +Make the correction operator explicitly learnable. + +### Residual approximation + +Use + +$$ +e_k = U(V^\top h_k). +$$ + +### Learned correction operator + +Support: + +1. **shared diagonal** + +$$ +c_k = D e_k, \qquad D = \operatorname{diag}(d) +$$ + +2. **per-pass diagonal** + +$$ +c_k = D_k e_k +$$ + +3. **shared low-rank** + +$$ +c_k = U_D(V_D^\top e_k) +$$ + +4. **per-pass low-rank** + +$$ +c_k = U_{D,k}(V_{D,k}^\top e_k) +$$ + +### Recurrent update + +Use + +$$ +h_{k+1} = f_{W_q}(h_k + c_k). +$$ + +### Requirements + +- full-rollout QAT stays on, +- learned correction trains jointly with the recurrent core, +- optional affine junction correction is available, +- optional Jacobian proxy regularization is available, +- support warm-start phases that freeze the recurrent core while fitting correction modules. + +### Purpose + +This is the strongest version and the main target for experiments. + +## Base-model implementation requirements + +## `model_recurrent_bestbase.py` + +This module should be a minimal diff on top of the current best training stack. + +### Preserve from the current best + +- LeakyReLU(0.5)^2 MLP +- BigramHash path +- XSA last-4-layer support where still applicable +- partial RoPE behavior +- VE128 in layers 9–10 if those layers remain unique +- Parameter Banking compatibility where practical +- EMA/SWA hooks + +### Recurrent-core insertion + +Allow the user to specify which contiguous layers are replaced by the recurrent core. The recurrent core can be either: + +- a single shared block, +- or a small shared block group repeated `K` times. + +### Correction injection API + +The recurrent block/group must accept an optional correction tensor: + +```python +def forward(self, x, correction=None, ...): + if correction is not None: + x = x + correction + ... + return x +``` + +## `quant.py` + +Implement fake quantization and export helpers suitable for the shared recurrent core. + +### Required features + +- symmetric quantization, +- configurable bits (support at least 5, 6, 8), +- per-tensor and per-row modes, +- selective application to the recurrent core, +- export helper that matches training fake quant closely. + +For a weight tensor $W$ with scale $s$: + +$$ +q = \operatorname{clip}\left(\operatorname{round}(W / s), q_{\min}, q_{\max}\right), +$$ + +$$ +W_q = s q. +$$ + +Use STE so the forward uses $W_q$ while gradients flow through $W$. + +## `feedback.py` + +Implement: + +### `LowRankResidual` + +$$ +e_k = U(V^\top h_k) +$$ + +### `DiagonalFeedback` + +$$ +c_k = d \odot e_k +$$ + +### `LowRankFeedback` + +$$ +c_k = U_D(V_D^\top e_k) +$$ + +### Optional `AffineJunction` + +$$ +c_k^{\text{aff}} = \gamma_k \odot h_k + \beta_k +$$ + +Keep all of these lightweight and sequence-shape aware. + +## `stability.py` + +Implement: + +### Per-pass diagnostics + +Track: + +- $\lVert h_k \rVert$ +- $\lVert h_{k+1} - h_k \rVert$ +- $\lVert e_k \rVert$ +- $\lVert c_k \rVert$ + +### Growth proxy + +$$ +\rho_k^{\text{emp}} = \frac{\lVert h_{k+1} \rVert}{\lVert h_k \rVert + \epsilon} +$$ + +### Optional clipping + +$$ +h_k \leftarrow \operatorname{clip}(h_k, -\alpha, \alpha) +$$ + +or norm clipping. + +### Optional residual scaling + +$$ +h_{k+1} = h_k + \alpha_k F(h_k + c_k) +$$ + +### Optional Jacobian proxy penalty + +Add a cheap finite-difference sensitivity penalty under a flag. + +## `ttt_recurrent.py` + +Implement a recurrent-aware TTT wrapper around the current best legal TTT protocol. + +### Scoring phase + +Preserve the current record’s score-first requirement: + +- score each chunk under `torch.inference_mode()` +- do not mutate weights during scoring + +### Adaptation phase + +Support TTT regimes: + +1. `tail_only` +2. `tail_plus_stem` +3. `all_unique_layers` +4. `all_layers` +5. `all_layers_with_recurrent_lr_scale` + +Also support: + +- separate LR scale for recurrent core, +- separate freeze mask for correction modules, +- momentum SGD as in the current best record. + +## CLI flags to add + +Support at least: + +- `--recurrent-layer-range` +- `--shared-core-layers` +- `--num-passes` +- `--quant-bits` +- `--quant-mode` +- `--feedback-rank` +- `--feedback-mode` +- `--per-pass-feedback` +- `--affine-junction` +- `--clip-hidden` +- `--clip-value` +- `--residual-scale-init` +- `--jacobian-proxy-weight` +- `--ttt-regime` +- `--ttt-recurrent-lr-scale` +- `--leave-embeddings-fp16` +- `--leave-head-fp16` + +## Experimental plan + +Run the experiments in this order. + +### Experiment A: preserve the current best, add only recurrence + QAT + +- start from the current best defaults +- replace a middle layer range with a recurrent core +- use full-rollout QAT +- no error feedback +- no TTT changes yet + +### Experiment B: fixed feedback + +- same as A +- add low-rank residual branch +- identity or shared diagonal feedback + +### Experiment C: learned feedback + +- same as B +- learned diagonal or low-rank correction operator + +### Experiment D: recurrent-aware TTT ablation + +Using the best model from C, compare: + +- `tail_only` +- `all_unique_layers` +- `all_layers_with_recurrent_lr_scale` + +This directly checks whether the current best TTT recipe survives the shared-core setting. + +### Experiment E: stabilizer ablation + +Test: + +- hidden clipping, +- residual scaling, +- affine junction correction, +- Jacobian proxy penalty. + +## Logging requirements + +At each debug or validation interval, log: + +- train loss, +- val loss, +- val BPB, +- per-pass activation norms, +- per-pass empirical growth ratios, +- correction norms, +- gradient norm, +- step time, +- pre-TTT vs post-TTT BPB, +- pre-quant vs fake-quant gap if cheap to compute. + +## Success criteria + +The branch should make it easy to answer: + +1. Can the current best recipe remain competitive after replacing part of the 11-layer stack with a recurrent core? +2. How much does full-rollout QAT repair the recurrent quantization failure by itself? +3. How much more does fixed feedback recover? +4. Does learned feedback beat fixed feedback at the same tiny parameter budget? +5. Which TTT regime is safest for shared recurrent weights? + +## Final deliverables + +Return: + +1. the three training scripts, +2. the shared modules, +3. a short `README_recurrent_from_bestbase.md`, +4. and a concise note describing: + - what was preserved from the current best record, + - what was borrowed conceptually from PR #363, + - and what was changed to make recurrence quantization-stable. + +## Final implementation principle + +Treat the current best record as the **production-grade scaffold** and PR #363 as the **failure case to solve**. + +That means: + +- preserve the winning optimizer / quantization / TTT / architecture defaults wherever possible, +- localize recurrence to the smallest part of the stack that can still save parameters, +- and treat quantized recurrence as a controlled dynamical system with explicit error correction. diff --git a/claude plans/parameter-golf-plan.md b/claude plans/parameter-golf-plan.md new file mode 100644 index 0000000000..eb8d805537 --- /dev/null +++ b/claude plans/parameter-golf-plan.md @@ -0,0 +1,183 @@ +# Parameter Golf: RYS Layer Duplication at Eval Time + +## Context & Goal + +Apply RYS (Repeat Your Self) — duplicating mid-stack transformer layers at eval time only — to the current Parameter Golf SOTA. Inspired by David Noel Ng's work showing that repeating "reasoning" layers in trained transformers improves performance with zero retraining. + +- **Primary base**: `records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/` — current SOTA, 11L, LeakyReLU(0.5)², Legal score-first TTT, Parallel Muon optimizer +- **Control base**: PR #505 (`SwiGLU+VE128+NoTTT`, val_bpb=1.1181) — best non-TTT submission, 11L, GEPA architecture. We test BOTH because the SOTA uses Test-Time Training (TTT), which continuously adapts weights during eval. RYS would repeat TTT-adapted layers, making it hard to isolate the RYS effect. PR #505 has frozen weights at eval time, giving a clean control experiment. +- **Storage cost of RYS**: ~0 bytes (a few extra lines in eval code) + +### Key Risks +- The model is only 11 layers deep and quantized to int5/int6. The three-phase encode/reason/decode structure that makes RYS work on 64-layer models may not cleanly separate here. Quantization error may also compound through repeated passes. +- TTT interaction is unknown: TTT adapts weights per-document, so repeated layers use adapted weights. Could be synergistic (adapted reasoning layers benefit more from a second pass) or catastrophic (TTT already pushed weights to a fragile optimum). + +--- + +## Phase 1: Setup & Baseline + +### 1.1 Clone and reproduce +```bash +git clone https://github.com/openai/parameter-golf.git +cd parameter-golf +bash prepare.sh # downloads dataset + tokenizer +``` + +### 1.2 Train BOTH base models and record baselines +```bash +# Primary: current SOTA (with TTT) +cd records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/ +SEED=42 bash eval/eval.sh +# Record the exact val_bpb number + +# Control: PR #505 (no TTT) — checkout the PR branch or copy its train_gpt.py +# into a working directory and run it the same way +``` + +Train from scratch then evaluate — takes ~10min on 8xH100 SXM per run. Save the trained checkpoints so you don't have to retrain for every RYS config. Check `train_gpt.py` for how the model is saved after training and loaded for eval. + +We run the full RYS sweep on BOTH models. Comparing the two tells us: +- Whether RYS works at all at 11 layers +- Whether TTT helps or hurts the RYS effect +- If only the non-TTT model benefits, that's still a useful finding (RYS as a TTT alternative) + +--- + +## Phase 2: Implement RYS + +### 2.1 Find the eval forward pass +In `train_gpt.py`, locate the model's forward method used during evaluation. It will look something like: + +```python +for layer in self.layers: + x = layer(x) +logits = self.output(x) +``` + +### 2.2 Add RYS layer duplication (eval-only) +Modify the forward pass to accept RYS parameters. This should ONLY activate during eval, not during training. + +```python +def forward(self, x, rys_start=None, rys_end=None, rys_repeats=2): + for i, layer in enumerate(self.layers): + x = layer(x) + # At the end of the RYS block, loop back + if rys_start is not None and i == rys_end - 1: + for _ in range(rys_repeats - 1): # -1 because we already did one pass + for j in range(rys_start, rys_end): + x = self.layers[j](x) + logits = self.output(x) + return logits +``` + +Be careful with: +- **KV cache**: If eval uses a KV cache for sliding window, the repeated layers will generate additional KV entries. Make sure the cache handling is correct. +- **TTT interaction**: The SOTA submission uses Legal TTT (test-time training). RYS should be applied AFTER TTT has finished adapting the weights, during the final eval forward pass. Don't apply RYS during TTT's adaptation steps. (This is why we also test PR #505 which has no TTT.) + +### 2.3 Verify correctness +Before sweeping, test with rys_start=None (disabled) and confirm val_bpb matches baseline exactly. Any discrepancy means you introduced a bug. + +--- + +## Phase 3: Exhaustive Sweep (run on BOTH models) + +### 3.1 Sweep all (start, end) block duplications +With 11 layers, there are C(11,2) + 11 = 66 possible contiguous blocks. This is small enough to sweep exhaustively. + +```python +import itertools + +num_layers = 11 +baseline_bpb = +results = {} + +for start in range(num_layers): + for end in range(start + 1, num_layers + 1): + # Skip blocks that include layer 0 or layer 10 (encode/decode boundaries) + # Actually, don't skip — sweep everything and let the data tell you + + val_bpb = evaluate_with_rys(model, val_data, rys_start=start, rys_end=end, rys_repeats=2) + delta = val_bpb - baseline_bpb + results[(start, end)] = {'bpb': val_bpb, 'delta': delta, 'extra_layers': end - start} + print(f"RYS({start},{end}) +{end-start}L: {val_bpb:.4f} delta={delta:+.4f}") + +# Sort by delta (lower is better for BPB) +for k, v in sorted(results.items(), key=lambda x: x[1]['delta']): + print(f" {k}: {v['delta']:+.4f} bpb ({v['extra_layers']} extra layers)") +``` + +### 3.2 Single-layer repeat sweep +Also test repeating individual layers multiple times (Ng's repeat-x8 experiment): + +```python +for layer_idx in range(num_layers): + for num_repeats in [2, 3, 4, 5]: + val_bpb = evaluate_with_rys(model, val_data, + rys_start=layer_idx, rys_end=layer_idx+1, + rys_repeats=num_repeats) + delta = val_bpb - baseline_bpb + print(f"Layer {layer_idx} x{num_repeats}: {val_bpb:.4f} delta={delta:+.4f}") +``` + +### 3.3 Visualize as heatmap (one per model) +Plot the (start, end) → delta results as a heatmap (upper triangular matrix, like Ng's brain scans). If there's a red zone in the middle, RYS works and you can see the reasoning circuit. Generate one heatmap for the TTT model and one for the non-TTT model — differences between them reveal how TTT affects the internal structure. + +```python +import numpy as np +import matplotlib.pyplot as plt + +heatmap = np.full((num_layers, num_layers), np.nan) +for (s, e), v in results.items(): + heatmap[s, e-1] = v['delta'] + +plt.imshow(heatmap, cmap='RdBu', vmin=-0.01, vmax=0.01) +plt.colorbar(label='BPB delta (negative = better)') +plt.xlabel('End layer') +plt.ylabel('Start layer') +plt.title('RYS Heatmap: 11L LeakyReLU LegalTTT ParallelMuon') +plt.savefig('rys_heatmap.png', dpi=150) +``` + +--- + +## Phase 4: Interpret & Submit + +### 4.1 If RYS improves BPB +- Identify the Pareto-optimal configs (best delta per extra-layer-count) +- Check if the improvement exceeds 0.005 nats (required for SOTA record) +- Run 3 seeds to confirm statistical significance (p < 0.01) +- Submit as a record if it beats SOTA, or non-record if interesting but not SOTA + +### 4.2 If RYS hurts or is neutral everywhere +This is still a valuable negative result. Write up: +- The heatmap showing no clear reasoning region +- Comparison to Ng's 64L results — what's different at 11L? +- Hypothesis: 11 layers is below the threshold for clean phase separation +- Suggestion: try RYS on a deeper (13-15L) model if tokenizer pruning frees bytes +- Submit as a non-record submission + +### 4.3 Submission format +``` +records/track_non_record_16mb/YYYY-MM-DD_RYS_LayerDuplication/ +├── README.md # Full writeup with heatmap, methodology, results +├── submission.json # {"val_bpb": X.XXXX, "artifact_bytes": XXXXXX} +├── train.log # Training log from base model +├── train_gpt.py # Modified script with RYS eval code +└── rys_heatmap.png # Visualization +``` + +In the README, cite: +- Ng's RYS Part 1 & 2 (https://dnhkng.github.io/posts/rys/, https://dnhkng.github.io/posts/rys-ii/) +- The base submission you forked from +- The quantization error amplification finding from PR #363 + +--- + +## Quick Reference + +- **Base submission**: `records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/` +- **Repo**: https://github.com/openai/parameter-golf +- **RYS blog**: https://dnhkng.github.io/posts/rys-ii/ +- **RYS code**: https://github.com/dnhkng/RYS +- **Live commentary**: https://github.com/openai/parameter-golf/issues/140 +- **Deadline**: April 30, 2026 diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/README.md b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/README.md new file mode 100644 index 0000000000..41ba6691ea --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/README.md @@ -0,0 +1,121 @@ +# Recurrent Core from Current Best Record + +Adds a shared recurrent core with quantization-aware training and error-feedback correction on top of the current best 10-min / 16 MB Parameter Golf record. + +## Preserved from the current best record + +| Component | Detail | +|-----------|--------| +| Activation | LeakyReLU(0.5)^2 | +| BigramHash | 1536 | +| XSA | Last 4 unique layers | +| Partial RoPE | 16 / 64 dims | +| LayerNorm scaling | 1 / sqrt(layer + 1) | +| VE128 | Configurable unique-layer indices | +| Weight averaging | EMA(0.997) + tight SWA(every 50) | +| Export path | GPTQ-lite int6 + lzma | +| Optimizer | Parameter Banking + Parallel Muon | +| TTT | Legal score-first, SGD+momentum, 32K chunks | + +## Borrowed conceptually from PR #363 + +PR #363 demonstrated that a 4-block × 3-cycle looped architecture suffered ~900× quantization error amplification (2.0711 → 2.4402 BPB post-quant). Its "noisy QAT" experiment showed the gap could be largely removed. + +From PR #363 we take: + +- The dynamical-systems framing: with shared quantized weights W_q = W + ε, perturbation grows as ‖J‖^k · ‖εh_0‖. +- The delta-sigma / error-feedback idea: approximate the quantization residual and inject a compensation term. +- The conclusion that full-rollout QAT is necessary but may not be sufficient alone. + +## What was changed to make recurrence quantization-stable + +### Architecture: stem / recurrent core / tail + +Instead of converting the full 11-layer stack, the model partitions into: + +- **Stem** (default 3 unique layers) — early processing, collects U-Net skips +- **Recurrent core** (default 2 shared layers × K passes) — middle depth via weight reuse +- **Tail** (default 3 unique layers) — late refinement, consumes skips + +Banks store weights for `num_unique = stem + core + tail` layers. Core bank entries are reused across K recurrence passes. + +### Full-rollout QAT with STE + +Core bank weights are fake-quantized (symmetric int6, per-row, STE) on every forward pass during training. Loss is computed only after the final recurrence pass. Stem and tail weights are not fake-quantized by default. + +### Error feedback + +Quantization residual is approximated as a low-rank branch: + + e_k = U (V^T h_k), U, V in R^{d × r}, r in {1, 2, 4} + +Three correction variants: + +| Script | Correction | Parameters added | +|--------|-----------|-----------------| +| `train_bestbase_recurrent_qat.py` | None (QAT only) | 0 | +| `train_bestbase_recurrent_feedback_fixed.py` | Identity or shared diagonal | Very small | +| `train_bestbase_recurrent_feedback_learned.py` | Learned diagonal/low-rank, per-pass option, optional affine junction | Small | + +### Stabilizers + +- Optional hidden-state clipping (value or norm) +- Optional learnable per-pass residual scaling +- Optional Jacobian spectral-norm proxy penalty + +### Recurrence-safe TTT + +Five regimes controlling which parameters adapt at test time: + +| Regime | Adapts | +|--------|--------| +| `tail_only` | Tail blocks only (safest) | +| `tail_plus_stem` | Stem + tail, core frozen | +| `all_unique_layers` | All blocks at full LR | +| `all_layers` | Alias for all_unique | +| `all_layers_with_recurrent_lr_scale` | Core at reduced LR (e.g. 0.1×) | + +## File structure + +``` +records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/ +├── model_recurrent_bestbase.py # RecurrentGPT model +├── quant.py # Fake quantization (STE) + export +├── feedback.py # Error feedback modules +├── stability.py # Diagnostics, clipping, Jacobian proxy +├── ttt_recurrent.py # Recurrence-aware TTT +├── train_utils_recurrent.py # Hyperparameters, Muon, data, eval, export +├── train_bestbase_recurrent_qat.py # Script 1: QAT only +├── train_bestbase_recurrent_feedback_fixed.py # Script 2: fixed feedback +├── train_bestbase_recurrent_feedback_learned.py # Script 3: learned feedback +├── smoke_test.sh # 1-GPU correctness check +├── submission.json +└── README.md +``` + +## Quick start + +```bash +# Script 3 (learned feedback) — the main experimental target +NUM_STEM_LAYERS=3 NUM_CORE_LAYERS=2 NUM_TAIL_LAYERS=3 NUM_PASSES=3 \ +CORE_QUANT_BITS=6 CORE_QUANT_ENABLED=1 \ +BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 \ +VE_ENABLED=1 VE_DIM=128 VE_LAYERS=6,7 \ +EMA_ENABLED=1 SWA_ENABLED=1 SWA_EVERY=50 \ +TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \ +MUON_WD=0.04 ADAM_WD=0.04 MATRIX_LR=0.025 SCALAR_LR=0.025 \ +ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 SEED=1337 \ +torchrun --standalone --nproc_per_node=8 \ + train_bestbase_recurrent_feedback_learned.py \ + --feedback-mode diagonal --feedback-rank 2 --ttt-regime tail_only +``` + +## Experimental plan + +| Experiment | Script | Key question | +|-----------|--------|-------------| +| A | qat | Does QAT alone fix recurrence? | +| B | fixed | Does a tiny correction path help further? | +| C | learned | Does learned feedback beat fixed at same budget? | +| D | learned + TTT | Which TTT regime is safest for shared weights? | +| E | learned + stabilizers | Do clipping / scaling / Jacobian penalty help? | diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/feedback.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/feedback.py new file mode 100644 index 0000000000..f7ecd3f4c3 --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/feedback.py @@ -0,0 +1,139 @@ +"""Error feedback modules for recurrent quantization correction. + +Implements low-rank residual approximation and correction operators +to compensate for quantization error amplification in recurrent passes. + + e_k = U (V^T h_k) -- low-rank residual approx. + c_k = D_k(e_k) -- correction operator + h_{k+1} = f_{W_q}(h_k + c_k) -- corrected recurrent update +""" +from __future__ import annotations +import math +import torch +import torch.nn as nn +from torch import Tensor + + +class LowRankResidual(nn.Module): + """e_k = U (V^T h_k) with U, V in R^{d x r}.""" + + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V = nn.Parameter(torch.randn(dim, rank) * (1.0 / math.sqrt(dim))) + self.U = nn.Parameter(torch.randn(dim, rank) * (1.0 / math.sqrt(rank))) + + def forward(self, h: Tensor) -> Tensor: + return (h @ self.V) @ self.U.T + + +class DiagonalFeedback(nn.Module): + """c_k = d odot e_k.""" + + def __init__(self, dim: int, init_ones: bool = True): + super().__init__() + init_val = torch.ones(dim) if init_ones else torch.zeros(dim) + self.d = nn.Parameter(init_val) + + def forward(self, e: Tensor) -> Tensor: + return self.d.to(dtype=e.dtype) * e + + +class LowRankFeedback(nn.Module): + """c_k = U_D (V_D^T e_k) with U_D, V_D in R^{d x r}.""" + + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V_D = nn.Parameter(torch.randn(dim, rank) * (1.0 / math.sqrt(dim))) + self.U_D = nn.Parameter(torch.randn(dim, rank) * (1.0 / math.sqrt(rank))) + + def forward(self, e: Tensor) -> Tensor: + return (e @ self.V_D) @ self.U_D.T + + +class AffineJunction(nn.Module): + """c_k^{aff} = gamma_k odot h_k + beta_k.""" + + def __init__(self, dim: int): + super().__init__() + self.gamma = nn.Parameter(torch.ones(dim)) + self.beta = nn.Parameter(torch.zeros(dim)) + + def forward(self, h: Tensor) -> Tensor: + return self.gamma.to(dtype=h.dtype) * h + self.beta.to(dtype=h.dtype) + + +class ErrorFeedbackModule(nn.Module): + """Combined error-feedback path: residual -> correction -> (optional junction). + + Supports shared or per-pass correction operators. Correction is inactive + on pass 0 (the first recurrence pass sees no prior quantization residual). + + Args: + dim: model hidden dimension + rank: rank for low-rank components + feedback_mode: 'identity' | 'diagonal' | 'low_rank' + per_pass: separate correction per pass if True + num_passes: number of recurrence passes (K) + affine_junction: add an affine junction path + """ + + def __init__( + self, + dim: int, + rank: int = 2, + feedback_mode: str = "diagonal", + per_pass: bool = False, + num_passes: int = 3, + affine_junction: bool = False, + ): + super().__init__() + self.feedback_mode = feedback_mode + self.per_pass = per_pass + self.num_passes = num_passes + + self.residual = LowRankResidual(dim, rank) + + if feedback_mode == "identity": + self.correction: nn.Module | nn.ModuleList | None = None + elif feedback_mode == "diagonal": + if per_pass: + self.correction = nn.ModuleList( + [DiagonalFeedback(dim) for _ in range(num_passes)] + ) + else: + self.correction = DiagonalFeedback(dim) + elif feedback_mode == "low_rank": + if per_pass: + self.correction = nn.ModuleList( + [LowRankFeedback(dim, rank) for _ in range(num_passes)] + ) + else: + self.correction = LowRankFeedback(dim, rank) + else: + raise ValueError(f"Unknown feedback_mode: {feedback_mode}") + + self.junction: AffineJunction | None = ( + AffineJunction(dim) if affine_junction else None + ) + + def forward(self, h: Tensor, pass_idx: int) -> Tensor | None: + """Return correction tensor, or None for pass 0.""" + if pass_idx == 0: + return None + e = self.residual(h) + if self.correction is None: + c = e + elif self.per_pass: + c = self.correction[pass_idx](e) + else: + c = self.correction(e) + if self.junction is not None: + c = c + self.junction(h) + return c + + def extra_repr(self) -> str: + return (f"mode={self.feedback_mode}, per_pass={self.per_pass}, " + f"passes={self.num_passes}") + + def param_count(self) -> int: + return sum(p.numel() for p in self.parameters()) diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/full_run_1gpu.sh b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/full_run_1gpu.sh new file mode 100644 index 0000000000..28b36be9e5 --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/full_run_1gpu.sh @@ -0,0 +1,116 @@ +#!/bin/bash +# Full 1-GPU run for 80 minutes to guesstimate final loss. +# +# On 1 GPU, grad_accum_steps=8 so each step is ~8x slower than 8-GPU. +# 80 minutes on 1 GPU ≈ 10 minutes on 8 GPUs in terms of training steps, +# giving a realistic estimate of the final BPB. +# +# Usage: +# cd records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT +# bash full_run_1gpu.sh # default: learned feedback +# bash full_run_1gpu.sh qat # QAT-only baseline +# bash full_run_1gpu.sh fixed # fixed feedback +# bash full_run_1gpu.sh learned # learned feedback +# SEED=42 bash full_run_1gpu.sh learned # custom seed +# MINUTES=120 bash full_run_1gpu.sh learned # longer run +# +# Prerequisites: +# - 1 CUDA GPU (H100/H200/A100 recommended) +# - Data downloaded: python data/cached_challenge_fineweb.py +# - Dependencies: pip install sentencepiece numpy flash-attn + +set -euo pipefail + +VARIANT="${1:-learned}" +MINUTES="${MINUTES:-80}" +WALLCLOCK=$((MINUTES * 60)) + +export DATA_PATH="${DATA_PATH:-../../../data/datasets/fineweb10B_sp1024}" +export TOKENIZER_PATH="${TOKENIZER_PATH:-../../../data/tokenizers/fineweb_1024_bpe.model}" +export SEED="${SEED:-1337}" + +# Full training config — matches the 8-GPU regime +export ITERATIONS=20000 +export MAX_WALLCLOCK_SECONDS="${WALLCLOCK}" +export VAL_LOSS_EVERY=2000 +export TRAIN_LOG_EVERY=200 +export WARMUP_STEPS=20 +export WARMDOWN_ITERS=3500 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=64 + +# Model config +export NUM_STEM_LAYERS=3 +export NUM_CORE_LAYERS=2 +export NUM_TAIL_LAYERS=3 +export NUM_PASSES=3 +export CORE_QUANT_BITS=6 +export CORE_QUANT_ENABLED=1 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="6,7" + +# Optimizer config +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=1500 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 + +# Weight averaging +export SWA_ENABLED=1 +export SWA_EVERY=50 + +# Late QAT +export LATE_QAT=1 +export LATE_QAT_THRESHOLD=0.15 + +# TTT off for training run (enable separately for eval) +export TTT_ENABLED=0 + +echo "============================================================" +echo " Full 1-GPU run: ${VARIANT} variant" +echo " Wall clock: ${MINUTES} minutes (${WALLCLOCK}s)" +echo " Seed: ${SEED}" +echo " Data: ${DATA_PATH}" +echo "============================================================" + +case "${VARIANT}" in + qat) + echo "Running: train_bestbase_recurrent_qat.py (QAT only, no feedback)" + python train_bestbase_recurrent_qat.py \ + --ttt-regime tail_only + ;; + fixed) + echo "Running: train_bestbase_recurrent_feedback_fixed.py (fixed diagonal feedback)" + python train_bestbase_recurrent_feedback_fixed.py \ + --feedback-mode diagonal \ + --feedback-rank 2 \ + --ttt-regime tail_only + ;; + learned) + echo "Running: train_bestbase_recurrent_feedback_learned.py (learned feedback)" + python train_bestbase_recurrent_feedback_learned.py \ + --feedback-mode diagonal \ + --feedback-rank 2 \ + --ttt-regime tail_only + ;; + *) + echo "Unknown variant: ${VARIANT}" + echo "Usage: bash full_run_1gpu.sh [qat|fixed|learned]" + exit 1 + ;; +esac + +echo "" +echo "Run complete. Check logs/ for detailed output." diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/model_recurrent_bestbase.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/model_recurrent_bestbase.py new file mode 100644 index 0000000000..7463290cd0 --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/model_recurrent_bestbase.py @@ -0,0 +1,660 @@ +"""Recurrent GPT model derived from the current best record. + +Preserves: LeakyReLU(0.5)^2, BigramHash, XSA, partial RoPE, VE128, +LayerNorm scaling, Parameter Banking layout, EMA/SWA hooks. + +Adds: stem / recurrent-core / tail partitioning with optional +fake-quantization inside the shared core and correction injection. +""" +from __future__ import annotations +import math +import torch +import torch.nn.functional as F +from torch import Tensor, nn + +try: + from flash_attn_interface import flash_attn_func as flash_attn_3_func +except ImportError: + flash_attn_3_func = None # allow import on CPU for tests + +from quant import fake_quantize_weight + +# --------------------------------------------------------------------------- +# Base components (mirrored from the current-best record) +# --------------------------------------------------------------------------- + +class RMSNorm(nn.Module): + def __init__(self, eps: float | None = None): + super().__init__() + self.eps = eps + + def forward(self, x: Tensor) -> Tensor: + return F.rms_norm(x, (x.size(-1),), eps=self.eps) + + +class Rotary(nn.Module): + def __init__(self, dim: int, base: float = 10000.0, + train_seq_len: int = 1024, rope_dims: int = 0): + super().__init__() + self.dim = dim + self.base = base + self.train_seq_len = train_seq_len + self.rope_dims = rope_dims if rope_dims > 0 else dim + inv_freq = 1.0 / (base ** (torch.arange(0, self.rope_dims, 2, + dtype=torch.float32) / self.rope_dims)) + self.register_buffer("inv_freq", inv_freq, persistent=False) + self._seq_len_cached = 0 + self._cos_cached: Tensor | None = None + self._sin_cached: Tensor | None = None + + def forward(self, seq_len: int, device: torch.device, + dtype: torch.dtype) -> tuple[Tensor, Tensor]: + if (self._cos_cached is None or self._sin_cached is None + or self._seq_len_cached != seq_len + or self._cos_cached.device != device): + rd = self.rope_dims + if seq_len > self.train_seq_len: + scale = seq_len / self.train_seq_len + new_base = self.base * (scale ** (rd / (rd - 2))) + inv_freq = 1.0 / (new_base ** (torch.arange(0, rd, 2, + dtype=torch.float32, device=device) / rd)) + else: + inv_freq = self.inv_freq.to(device) + t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype) + freqs = torch.outer(t, inv_freq) + self._cos_cached = freqs.cos()[None, :, None, :] + self._sin_cached = freqs.sin()[None, :, None, :] + self._seq_len_cached = seq_len + return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype) + + +def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor, + rope_dims: int = 0) -> Tensor: + if rope_dims > 0 and rope_dims < x.size(-1): + x_rope, x_pass = x[..., :rope_dims], x[..., rope_dims:] + half = rope_dims // 2 + x1, x2 = x_rope[..., :half], x_rope[..., half:] + x_rope = torch.cat((x1 * cos + x2 * sin, + x1 * (-sin) + x2 * cos), dim=-1) + return torch.cat((x_rope, x_pass), dim=-1) + half = x.size(-1) // 2 + x1, x2 = x[..., :half], x[..., half:] + return torch.cat((x1 * cos + x2 * sin, + x1 * (-sin) + x2 * cos), dim=-1) + + +class CastedLinear(nn.Linear): + _qat_enabled: bool = False + + def forward(self, x: Tensor) -> Tensor: + w = self.weight.to(x.dtype) + if CastedLinear._qat_enabled and self.training and w.ndim == 2: + with torch.no_grad(): + w32 = self.weight.float() + row_max = w32.abs().amax(dim=1) + scale = (row_max / 31.0).clamp_min(1.0 / 31.0) + w_q = (torch.clamp(torch.round(w32 / scale[:, None]), + -32, 31) * scale[:, None]).to(x.dtype) + w = w + (w_q - w).detach() + bias = self.bias.to(x.dtype) if self.bias is not None else None + return F.linear(x, w, bias) + + +class CausalSelfAttention(nn.Module): + def __init__(self, dim: int, num_heads: int, num_kv_heads: int, + rope_base: float, qk_gain_init: float, + gated_attention: bool = False, value_residual: bool = False): + super().__init__() + if dim % num_heads != 0: + raise ValueError("model_dim must be divisible by num_heads") + if num_heads % num_kv_heads != 0: + raise ValueError("num_heads must be divisible by num_kv_heads") + self.num_heads = num_heads + self.num_kv_heads = num_kv_heads + self.head_dim = dim // num_heads + self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, + dtype=torch.float32)) + self.rope_dims = 0 + self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=1024) + self.use_xsa = False + self.gated_attention = gated_attention + if gated_attention: + self.attn_gate = nn.Linear(dim, num_heads, bias=True) + nn.init.zeros_(self.attn_gate.weight) + nn.init.constant_(self.attn_gate.bias, 4.0) + self.value_residual = value_residual + if value_residual: + self.vr_lambda = nn.Parameter(torch.tensor([0.5, 0.5], + dtype=torch.float32)) + + def _xsa_efficient(self, y: Tensor, v: Tensor) -> Tensor: + B, T, H, D = y.shape + Hkv = v.size(-2) + group = H // Hkv + y_g = y.reshape(B, T, Hkv, group, D) + vn = F.normalize(v, dim=-1).unsqueeze(-2) + proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn + return (y_g - proj).reshape(B, T, H, D) + + def forward(self, x: Tensor, q_w: Tensor, k_w: Tensor, + v_w: Tensor, out_w: Tensor, + v_embed: Tensor | None = None, + v0: Tensor | None = None) -> tuple[Tensor, Tensor | None]: + bsz, seqlen, dim = x.shape + q = F.linear(x, q_w.to(x.dtype)).reshape( + bsz, seqlen, self.num_heads, self.head_dim) + k = F.linear(x, k_w.to(x.dtype)).reshape( + bsz, seqlen, self.num_kv_heads, self.head_dim) + v = F.linear(x, v_w.to(x.dtype)) + if v_embed is not None: + v = v + v_embed + v = v.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + raw_v = v if self.value_residual else None + if self.value_residual and v0 is not None: + lam = self.vr_lambda.to(dtype=v.dtype) + v = lam[0] * v0 + lam[1] * v + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = self.rotary(seqlen, x.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, self.rope_dims) + k = apply_rotary_emb(k, cos, sin, self.rope_dims) + q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None] + y = flash_attn_3_func(q, k, v, causal=True) + if self.use_xsa: + y = self._xsa_efficient(y, v) + if self.gated_attention: + gate = torch.sigmoid(self.attn_gate(x)).unsqueeze(-1) + y = y * gate + y = y.reshape(bsz, seqlen, dim) + return F.linear(y, out_w.to(x.dtype)), raw_v + + +class SmearGate(nn.Module): + def __init__(self, dim: int): + super().__init__() + self.gate = nn.Parameter(torch.zeros(dim, dtype=torch.float32)) + + def forward(self, x: Tensor) -> Tensor: + g = torch.sigmoid(self.gate.to(dtype=x.dtype))[None, None, :] + x_prev = torch.cat([torch.zeros_like(x[:, :1]), x[:, :-1]], dim=1) + return (1 - g) * x + g * x_prev + + +class BigramHashEmbedding(nn.Module): + def __init__(self, bigram_vocab_size: int, bigram_dim: int, + model_dim: int): + super().__init__() + self.bigram_vocab_size = bigram_vocab_size + self.embed = nn.Embedding(bigram_vocab_size, bigram_dim) + nn.init.zeros_(self.embed.weight) + self.proj = (CastedLinear(bigram_dim, model_dim, bias=False) + if bigram_dim != model_dim else None) + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.05, dtype=torch.float32)) + + def bigram_hash(self, tokens: Tensor) -> Tensor: + t = tokens.to(torch.int32) + mod = self.bigram_vocab_size - 1 + out = torch.empty_like(t) + out[..., 0] = mod + out[..., 1:] = (torch.bitwise_xor( + 36313 * t[..., 1:], 27191 * t[..., :-1]) % mod) + return out.long() + + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(self.bigram_hash(token_ids)) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) + + +class ValueEmbedding(nn.Module): + def __init__(self, vocab_size: int, ve_dim: int, model_dim: int): + super().__init__() + self.embed = nn.Embedding(vocab_size, ve_dim) + nn.init.normal_(self.embed.weight, std=0.01) + self.proj = (CastedLinear(ve_dim, model_dim, bias=False) + if ve_dim != model_dim else None) + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.1, dtype=torch.float32)) + + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(token_ids) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) + + +class MLP(nn.Module): + def __init__(self, dim: int, mlp_mult: int): + super().__init__() + + def forward(self, x: Tensor, up_w: Tensor, down_w: Tensor) -> Tensor: + x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), + negative_slope=0.5) + return F.linear(x.square(), down_w.to(x.dtype)) + + +class Block(nn.Module): + def __init__(self, dim: int, num_heads: int, num_kv_heads: int, + mlp_mult: int, rope_base: float, qk_gain_init: float, + layer_idx: int = 0, ln_scale: bool = False, + dtg: bool = False, gated_attention: bool = False, + value_residual: bool = False): + super().__init__() + self.attn_norm = RMSNorm() + self.mlp_norm = RMSNorm() + self.attn = CausalSelfAttention( + dim, num_heads, num_kv_heads, rope_base, qk_gain_init, + gated_attention=gated_attention, value_residual=value_residual) + self.mlp = MLP(dim, mlp_mult) + self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.resid_mix = nn.Parameter( + torch.stack((torch.ones(dim), torch.zeros(dim))).float()) + self.ln_scale_factor = (1.0 / math.sqrt(layer_idx + 1) + if ln_scale else 1.0) + if dtg: + self.dtg_gate = nn.Linear(dim, 1, bias=True) + nn.init.zeros_(self.dtg_gate.weight) + nn.init.constant_(self.dtg_gate.bias, 2.0) + else: + self.dtg_gate = None + + def forward(self, x: Tensor, x0: Tensor, + q_w: Tensor, k_w: Tensor, v_w: Tensor, + out_w: Tensor, up_w: Tensor, down_w: Tensor, + v_embed: Tensor | None = None, + v0: Tensor | None = None) -> tuple[Tensor, Tensor | None]: + mix = self.resid_mix.to(dtype=x.dtype) + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + attn_out, raw_v = self.attn( + self.attn_norm(x_in) * self.ln_scale_factor, + q_w, k_w, v_w, out_w, v_embed=v_embed, v0=v0) + x_out = x_in + (self.attn_scale.to(dtype=x_in.dtype)[None, None, :] + * attn_out) + x_out = x_out + (self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] + * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, + up_w, down_w)) + if self.dtg_gate is not None: + gate = torch.sigmoid(self.dtg_gate(x_in.detach())) + x_out = x_in + gate * (x_out - x_in) + return x_out, raw_v + + +# --------------------------------------------------------------------------- +# Recurrent GPT +# --------------------------------------------------------------------------- + +CONTROL_TENSOR_NAME_PATTERNS = ( + "attn_scale", "attn_scales", "mlp_scale", "mlp_scales", "resid_mix", + "resid_mixes", "q_gain", "skip_weight", "skip_weights", "smear", + "dtg_gate", "ve_layer_scales", "ve_shared.scale", "attn_gate", + "vr_lambda", +) + + +def restore_low_dim_params_to_fp32(module: nn.Module) -> None: + with torch.no_grad(): + for name, param in module.named_parameters(): + if ((param.ndim < 2 + or any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS)) + and param.dtype != torch.float32): + param.data = param.data.float() + + +class RecurrentGPT(nn.Module): + """Recurrent GPT: stem → shared core (×K) → tail. + + All architectural defaults match the current best record. The recurrent + core replaces a contiguous range of layers with a small set of shared + blocks repeated *num_passes* times. + + The bank layout stores weights for num_unique layers (stem + core + tail), + and the core bank entries are reused on each recurrence pass — optionally + with fake quantization applied via STE. + """ + + def __init__( + self, + vocab_size: int = 1024, + model_dim: int = 512, + num_heads: int = 8, + num_kv_heads: int = 4, + mlp_mult: float = 3.0, + tie_embeddings: bool = True, + tied_embed_init_std: float = 0.005, + logit_softcap: float = 30.0, + rope_base: float = 10000.0, + qk_gain_init: float = 1.5, + bigram_vocab_size: int = 1536, + bigram_dim: int = 128, + rope_dims: int = 16, + ln_scale: bool = True, + ve_enabled: bool = True, + ve_dim: int = 128, + ve_layers: str = "", + xsa_last_n: int = 4, + gated_attention: bool = False, + value_residual: bool = False, + # recurrence + num_stem_layers: int = 3, + num_core_layers: int = 2, + num_tail_layers: int = 3, + num_passes: int = 3, + # fake quant for core + core_quant_bits: int = 6, + core_quant_enabled: bool = True, + ): + super().__init__() + if logit_softcap <= 0.0: + raise ValueError(f"logit_softcap must be positive, got {logit_softcap}") + + self.tie_embeddings = tie_embeddings + self.tied_embed_init_std = tied_embed_init_std + self.logit_softcap = logit_softcap + self.value_residual = value_residual + + self.num_stem = num_stem_layers + self.num_core = num_core_layers + self.num_tail = num_tail_layers + self.num_unique = num_stem_layers + num_core_layers + num_tail_layers + self.num_passes = num_passes + self.core_quant_bits = core_quant_bits + self.core_quant_enabled = core_quant_enabled + + head_dim = model_dim // num_heads + kv_dim = num_kv_heads * head_dim + mlp_dim = int(mlp_mult * model_dim) + self._kv_dim = kv_dim + + # --- embeddings --- + self.tok_emb = nn.Embedding(vocab_size, model_dim) + self.bigram = (BigramHashEmbedding(bigram_vocab_size, bigram_dim, + model_dim) + if bigram_vocab_size > 0 else None) + self.smear = SmearGate(model_dim) + + # --- skip connections (stem ↔ tail) --- + self.num_skip_weights = min(num_stem_layers, num_tail_layers) + self.skip_weights = nn.Parameter( + torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32)) + + # --- parameter banks (sized for unique layers only) --- + n = self.num_unique + self.qo_bank = nn.Parameter( + torch.empty(2 * n, model_dim, model_dim)) + self.kv_bank = nn.Parameter( + torch.empty(2 * n, kv_dim, model_dim)) + self.mlp_up_bank = nn.Parameter( + torch.empty(n, mlp_dim, model_dim)) + self.mlp_down_bank = nn.Parameter( + torch.empty(n, model_dim, mlp_dim)) + + # core bank indices for quick lookup + self._core_bank_start = num_stem_layers + self._core_bank_end = num_stem_layers + num_core_layers + + # --- blocks --- + def _make_block(layer_idx: int) -> Block: + return Block( + model_dim, num_heads, num_kv_heads, int(mlp_mult), + rope_base, qk_gain_init, layer_idx=layer_idx, + ln_scale=ln_scale, gated_attention=gated_attention, + value_residual=value_residual) + + self.stem_blocks = nn.ModuleList( + [_make_block(i) for i in range(num_stem_layers)]) + self.core_blocks = nn.ModuleList( + [_make_block(num_stem_layers + j) + for j in range(num_core_layers)]) + self.tail_blocks = nn.ModuleList( + [_make_block(num_stem_layers + num_core_layers + i) + for i in range(num_tail_layers)]) + + # partial RoPE + if rope_dims > 0: + for blk in list(self.stem_blocks) + list(self.core_blocks) + list(self.tail_blocks): + blk.attn.rope_dims = rope_dims + blk.attn.rotary = Rotary(head_dim, base=rope_base, + train_seq_len=1024, rope_dims=rope_dims) + + # XSA on last N unique layers + all_blocks = (list(self.stem_blocks) + list(self.core_blocks) + + list(self.tail_blocks)) + if xsa_last_n > 0: + for blk in all_blocks[max(0, len(all_blocks) - xsa_last_n):]: + blk.attn.use_xsa = True + + # Value Embedding + self.ve_layer_indices = ([int(x) for x in ve_layers.split(",") if x.strip()] + if ve_enabled and ve_layers else []) + if self.ve_layer_indices: + self.ve_shared = ValueEmbedding(vocab_size, ve_dim, kv_dim) + self.ve_layer_scales = nn.ParameterList( + [nn.Parameter(torch.ones(1, dtype=torch.float32)) + for _ in self.ve_layer_indices]) + else: + self.ve_shared = None + self.ve_layer_scales = nn.ParameterList() + + # --- output --- + self.final_norm = RMSNorm() + self.lm_head = (None if tie_embeddings + else CastedLinear(model_dim, vocab_size, bias=False)) + if self.lm_head is not None: + self.lm_head._zero_init = True + + self._init_weights() + + # ---- weight init (mirrors base) ---- + + def _init_weights(self) -> None: + if self.tie_embeddings: + nn.init.normal_(self.tok_emb.weight, mean=0.0, + std=self.tied_embed_init_std) + n = self.num_unique + proj_scale = 1.0 / math.sqrt(2 * n) + for i in range(n): + nn.init.orthogonal_(self.qo_bank.data[i], gain=1.0) + nn.init.zeros_(self.qo_bank.data[n + i]) + nn.init.orthogonal_(self.kv_bank.data[i], gain=1.0) + nn.init.orthogonal_(self.kv_bank.data[n + i], gain=1.0) + nn.init.orthogonal_(self.mlp_up_bank.data[i], gain=1.0) + nn.init.zeros_(self.mlp_down_bank.data[i]) + self.qo_bank.data[n + i].mul_(proj_scale) + self.mlp_down_bank.data[i].mul_(proj_scale) + for name, module in self.named_modules(): + if isinstance(module, nn.Linear): + if getattr(module, "_zero_init", False): + nn.init.zeros_(module.weight) + elif (module.weight.ndim == 2 and module.weight.shape[0] >= 64 + and module.weight.shape[1] >= 64): + nn.init.orthogonal_(module.weight, gain=1.0) + + # ---- helpers ---- + + def _get_ve(self, unique_layer_idx: int, input_ids: Tensor, + ve_cache: dict) -> Tensor | None: + if self.ve_shared is None or unique_layer_idx not in self.ve_layer_indices: + return None + if "ve" not in ve_cache: + ve_cache["ve"] = self.ve_shared(input_ids) + ve_base = ve_cache["ve"] + ve_idx = self.ve_layer_indices.index(unique_layer_idx) + return ve_base * self.ve_layer_scales[ve_idx].to(dtype=ve_base.dtype) + + def _maybe_fq(self, w: Tensor, bank_idx: int) -> Tensor: + """Apply fake quantization if bank_idx belongs to the core.""" + if (self.core_quant_enabled and self.training + and self._core_bank_start <= bank_idx < self._core_bank_end): + return fake_quantize_weight(w, self.core_quant_bits, per_row=True) + return w + + def _bank_weights(self, bank_idx: int) -> tuple[ + Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]: + """Return (q_w, k_w, v_w, out_w, up_w, down_w) for a unique layer.""" + n = self.num_unique + q_w = self._maybe_fq(self.qo_bank[bank_idx], bank_idx) + out_w = self._maybe_fq(self.qo_bank[n + bank_idx], bank_idx) + k_w = self._maybe_fq(self.kv_bank[bank_idx], bank_idx) + v_w = self._maybe_fq(self.kv_bank[n + bank_idx], bank_idx) + up_w = self._maybe_fq(self.mlp_up_bank[bank_idx], bank_idx) + down_w = self._maybe_fq(self.mlp_down_bank[bank_idx], bank_idx) + return q_w, k_w, v_w, out_w, up_w, down_w + + # ---- forward (training) ---- + + def forward(self, input_ids: Tensor, target_ids: Tensor, + feedback_fn=None, + stabilizer=None) -> Tensor: + """Full forward with loss. + + Args: + feedback_fn: callable(h, pass_idx) -> correction | None + stabilizer: RecurrentStabilizer instance (or None) + """ + x = self.tok_emb(input_ids) + if self.bigram is not None: + x = x + self.bigram(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + x = self.smear(x) + x0 = x + v0 = None + ve_cache: dict = {} + skips: list[Tensor] = [] + + # --- STEM --- + for i, blk in enumerate(self.stem_blocks): + bi = i + ve = self._get_ve(bi, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(bi) + x, raw_v = blk(x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + if v0 is None and raw_v is not None: + v0 = raw_v + skips.append(x) + + # --- RECURRENT CORE --- + for k in range(self.num_passes): + for j, blk in enumerate(self.core_blocks): + bi = self.num_stem + j + h_prev = x + + # correction injection (inactive on pass 0) + correction = None + if feedback_fn is not None: + correction = feedback_fn(x, k) + if correction is not None: + x = x + correction + + if stabilizer is not None: + x = stabilizer.clip(x) + + ve = self._get_ve(bi, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(bi) + x, _ = blk(x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + + if stabilizer is not None: + stabilizer.record_pass(h_prev, x, correction=correction) + + # --- TAIL --- + for i, blk in enumerate(self.tail_blocks): + bi = self.num_stem + self.num_core + i + if skips: + x = x + (self.skip_weights[i].to(dtype=x.dtype)[None, None, :] + * skips.pop()) + ve = self._get_ve(bi, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(bi) + x, _ = blk(x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + + # --- OUTPUT --- + x = self.final_norm(x) + x_flat = x.reshape(-1, x.size(-1)) + targets = target_ids.reshape(-1) + if self.tie_embeddings: + logits_proj = F.linear(x_flat, self.tok_emb.weight) + else: + logits_proj = self.lm_head(x_flat) + logits = self.logit_softcap * torch.tanh( + logits_proj / self.logit_softcap) + return F.cross_entropy(logits.float(), targets, reduction="mean") + + # ---- forward_logits (eval) ---- + + def forward_logits(self, input_ids: Tensor, + feedback_fn=None, + stabilizer=None) -> Tensor: + """Return logits (bsz, seq_len, vocab) without loss.""" + x = self.tok_emb(input_ids) + if self.bigram is not None: + x = x + self.bigram(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + x = self.smear(x) + x0 = x + v0 = None + ve_cache: dict = {} + skips: list[Tensor] = [] + + for i, blk in enumerate(self.stem_blocks): + bi = i + ve = self._get_ve(bi, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(bi) + x, raw_v = blk(x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + if v0 is None and raw_v is not None: + v0 = raw_v + skips.append(x) + + for k in range(self.num_passes): + for j, blk in enumerate(self.core_blocks): + bi = self.num_stem + j + correction = None + if feedback_fn is not None: + correction = feedback_fn(x, k) + if correction is not None: + x = x + correction + if stabilizer is not None: + x = stabilizer.clip(x) + ve = self._get_ve(bi, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(bi) + x, _ = blk(x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + + for i, blk in enumerate(self.tail_blocks): + bi = self.num_stem + self.num_core + i + if skips: + x = x + (self.skip_weights[i].to(dtype=x.dtype)[None, None, :] + * skips.pop()) + ve = self._get_ve(bi, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(bi) + x, _ = blk(x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + + x = self.final_norm(x) + if self.tie_embeddings: + logits_proj = F.linear(x, self.tok_emb.weight) + else: + logits_proj = self.lm_head(x) + return self.logit_softcap * torch.tanh( + logits_proj / self.logit_softcap) + + # ---- bank utility (for export) ---- + + def core_bank_indices(self) -> list[int]: + return list(range(self._core_bank_start, self._core_bank_end)) + + def stem_bank_indices(self) -> list[int]: + return list(range(self.num_stem)) + + def tail_bank_indices(self) -> list[int]: + start = self.num_stem + self.num_core + return list(range(start, start + self.num_tail)) + + def all_blocks(self) -> list[Block]: + return (list(self.stem_blocks) + list(self.core_blocks) + + list(self.tail_blocks)) diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/quant.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/quant.py new file mode 100644 index 0000000000..b475839dc8 --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/quant.py @@ -0,0 +1,81 @@ +"""Fake quantization and export helpers for recurrent core. + +Provides STE-based fake quantization for training and matching export +quantization. Supports symmetric quantization with configurable bits (5, 6, 8), +per-tensor and per-row modes, and selective application to recurrent core. +""" +from __future__ import annotations +import torch +import torch.nn as nn +from torch import Tensor + + +class _FakeQuantSTE(torch.autograd.Function): + @staticmethod + def forward(ctx, w: Tensor, scale: Tensor, qmin: int, qmax: int) -> Tensor: + return torch.clamp(torch.round(w / scale), qmin, qmax) * scale + + @staticmethod + def backward(ctx, grad_output: Tensor): + return grad_output, None, None, None + + +def _compute_scale(w: Tensor, bits: int, per_row: bool) -> Tensor: + qmax = (1 << (bits - 1)) - 1 + if per_row and w.ndim == 2: + amax = w.detach().abs().amax(dim=1, keepdim=True) + else: + amax = w.detach().abs().amax() + return (amax / qmax).clamp_min(1.0 / qmax) + + +def fake_quantize_weight(w: Tensor, bits: int = 6, per_row: bool = True) -> Tensor: + """Apply symmetric fake quantization with STE.""" + qmax = (1 << (bits - 1)) - 1 + qmin = -qmax - 1 + scale = _compute_scale(w, bits, per_row) + return _FakeQuantSTE.apply(w, scale, qmin, qmax) + + +def fake_quantize_bank(bank: Tensor, bits: int = 6, per_row: bool = True, + indices: list[int] | None = None) -> Tensor: + """Apply fake quantization to selected slices of a 3-D bank tensor. + + If *indices* is None every slice is quantized; otherwise only the + listed indices are touched and the rest pass through unchanged. + """ + if indices is None: + indices = list(range(bank.shape[0])) + out = bank + for i in indices: + q_slice = fake_quantize_weight(bank[i], bits, per_row) + if out is bank: + out = bank.clone() + out[i] = q_slice + return out + + +# --------------- export helpers ------------------------------------------------ + +def quantize_for_export(w: Tensor, bits: int = 6) -> tuple[Tensor, Tensor]: + """True integer quantization for model export.""" + qmax = (1 << (bits - 1)) - 1 + w32 = w.float() + if w32.ndim == 2: + amax = w32.abs().amax(dim=1) + scale = (amax / qmax).clamp_min(1.0 / qmax).to(torch.float16) + q = torch.clamp(torch.round(w32 / scale.float()[:, None]), + -qmax - 1, qmax).to(torch.int8) + return q, scale + amax = w32.abs().max().item() + scale = torch.tensor(amax / qmax if amax > 0 else 1.0, dtype=torch.float16) + q = torch.clamp(torch.round(w32 / scale.float()), + -qmax - 1, qmax).to(torch.int8) + return q, scale + + +def dequantize_exported(q: Tensor, scale: Tensor, dtype: torch.dtype = torch.bfloat16) -> Tensor: + if scale.ndim > 0: + return (q.float() * scale.float().view(q.shape[0], + *([1] * (q.ndim - 1)))).to(dtype) + return (q.float() * float(scale.item())).to(dtype) diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/smoke_test.sh b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/smoke_test.sh new file mode 100644 index 0000000000..6d27421dce --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/smoke_test.sh @@ -0,0 +1,59 @@ +#!/bin/bash +# Quick 1-GPU smoke test to validate code correctness. +# Runs ~100 steps with small settings — NOT for competitive BPB. +# +# Usage: +# cd records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT +# bash smoke_test.sh +# +# Prerequisites: +# - 1+ CUDA GPU +# - Data downloaded: python data/cached_challenge_fineweb.py +# - Dependencies: pip install sentencepiece numpy flash-attn + +set -euo pipefail + +export DATA_PATH="../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../data/tokenizers/fineweb_1024_bpe.model" + +# Minimal settings for a quick correctness check +export ITERATIONS=100 +export MAX_WALLCLOCK_SECONDS=120 +export VAL_LOSS_EVERY=50 +export TRAIN_LOG_EVERY=10 +export WARMUP_STEPS=5 +export WARMDOWN_ITERS=30 +export TRAIN_BATCH_TOKENS=131072 +export TTT_ENABLED=0 + +# Recurrence config +export NUM_STEM_LAYERS=3 +export NUM_CORE_LAYERS=2 +export NUM_TAIL_LAYERS=3 +export NUM_PASSES=3 +export CORE_QUANT_BITS=6 +export CORE_QUANT_ENABLED=1 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="6,7" +export SWA_ENABLED=0 + +echo "=== Smoke test: train_bestbase_recurrent_qat.py (QAT only) ===" +python train_bestbase_recurrent_qat.py --ttt-regime tail_only 2>&1 | tail -20 + +echo "" +echo "=== Smoke test: train_bestbase_recurrent_feedback_fixed.py (fixed feedback) ===" +python train_bestbase_recurrent_feedback_fixed.py \ + --feedback-mode diagonal --feedback-rank 2 --ttt-regime tail_only 2>&1 | tail -20 + +echo "" +echo "=== Smoke test: train_bestbase_recurrent_feedback_learned.py (learned feedback) ===" +python train_bestbase_recurrent_feedback_learned.py \ + --feedback-mode diagonal --feedback-rank 2 --ttt-regime tail_only 2>&1 | tail -20 + +echo "" +echo "All smoke tests passed!" diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/stability.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/stability.py new file mode 100644 index 0000000000..a02c831638 --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/stability.py @@ -0,0 +1,108 @@ +"""Stability monitoring and control for recurrent passes. + +Provides per-pass diagnostics, hidden-state clipping, learnable residual +scaling, and a cheap Jacobian proxy regulariser. +""" +from __future__ import annotations +import torch +import torch.nn as nn +from torch import Tensor +from dataclasses import dataclass, field + + +@dataclass +class PassDiagnostics: + h_norms: list[float] = field(default_factory=list) + delta_norms: list[float] = field(default_factory=list) + error_norms: list[float] = field(default_factory=list) + correction_norms: list[float] = field(default_factory=list) + growth_ratios: list[float] = field(default_factory=list) + + def reset(self): + for lst in (self.h_norms, self.delta_norms, self.error_norms, + self.correction_norms, self.growth_ratios): + lst.clear() + + def summary(self) -> dict[str, list[float]]: + return { + "h_norms": list(self.h_norms), + "delta_norms": list(self.delta_norms), + "error_norms": list(self.error_norms), + "correction_norms": list(self.correction_norms), + "growth_ratios": list(self.growth_ratios), + } + + +class RecurrentStabilizer: + """Manages stability diagnostics and optional controls for recurrence.""" + + def __init__( + self, + clip_hidden: bool = False, + clip_value: float = 10.0, + clip_mode: str = "value", + jacobian_proxy_weight: float = 0.0, + eps: float = 1e-6, + ): + self.clip_hidden = clip_hidden + self.clip_value = clip_value + self.clip_mode = clip_mode + self.jacobian_proxy_weight = jacobian_proxy_weight + self.eps = eps + self.diagnostics = PassDiagnostics() + + def clip(self, h: Tensor) -> Tensor: + if not self.clip_hidden: + return h + if self.clip_mode == "value": + return torch.clamp(h, -self.clip_value, self.clip_value) + norm = h.norm(dim=-1, keepdim=True) + scale = torch.clamp(self.clip_value / (norm + self.eps), max=1.0) + return h * scale + + def record_pass( + self, + h_prev: Tensor, + h_next: Tensor, + error: Tensor | None = None, + correction: Tensor | None = None, + ): + with torch.no_grad(): + h_pn = h_prev.float().norm().item() + h_nn = h_next.float().norm().item() + self.diagnostics.h_norms.append(h_nn) + self.diagnostics.delta_norms.append( + (h_next - h_prev).float().norm().item() + ) + self.diagnostics.growth_ratios.append(h_nn / (h_pn + self.eps)) + if error is not None: + self.diagnostics.error_norms.append(error.float().norm().item()) + if correction is not None: + self.diagnostics.correction_norms.append( + correction.float().norm().item() + ) + + def jacobian_proxy_loss(self, h_in: Tensor, h_out: Tensor) -> Tensor: + """Finite-difference proxy for Jacobian spectral norm.""" + if self.jacobian_proxy_weight <= 0: + return h_in.new_zeros(()) + delta = h_out - h_in + ratio = delta.norm() / (h_in.norm() + self.eps) + return self.jacobian_proxy_weight * torch.relu(ratio - 1.0).square() + + def reset(self): + self.diagnostics.reset() + + +class ResidualScale(nn.Module): + """Learnable per-pass residual scaling: + h_{k+1} = h_k + alpha_k * F(h_k + c_k)""" + + def __init__(self, num_passes: int, init_value: float = 1.0): + super().__init__() + self.scales = nn.Parameter( + torch.full((num_passes,), init_value, dtype=torch.float32) + ) + + def forward(self, residual: Tensor, pass_idx: int) -> Tensor: + return self.scales[pass_idx].to(dtype=residual.dtype) * residual diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/submission.json b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/submission.json new file mode 100644 index 0000000000..a80172c4ff --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/submission.json @@ -0,0 +1,9 @@ +{ + "name": "Recurrent Core + Learned Feedback + Full-Rollout QAT", + "val_bpb": null, + "bytes_total": null, + "blurb": "Shared recurrent core (stem/core/tail partitioning) with STE fake-quantized core weights, low-rank error-feedback correction, and recurrence-safe TTT. Built on the LeakyReLU² + Legal TTT + Parallel Muon record (1.1194 BPB). Three script variants: QAT-only baseline, fixed feedback, learned feedback.", + "author": "nesta.midavaine", + "github_id": "nesta.midavaine", + "date": "2026-03-25" +} diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_fixed.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_fixed.py new file mode 100644 index 0000000000..ff18bceb2c --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_fixed.py @@ -0,0 +1,377 @@ +"""Script 2: Recurrent core with *fixed-form* error-feedback correction. + +Adds a small fixed correction path on top of the QAT baseline (Script 1). + +Residual approximation: + e_k = U(V^T h_k) -- low-rank, tiny rank (1-4) + +Correction options: + 1. identity: c_k = e_k + 2. diagonal: c_k = d ⊙ e_k (d learned or init to ones) + +Forward: + h_{k+1} = f_{W_q}(h_k + c_k) (correction inactive on pass 0) +""" +from __future__ import annotations + +import argparse +import copy +import math +import os +import random +import subprocess +import sys +import time +import uuid +from pathlib import Path + +import numpy as np +import sentencepiece as spm +import torch +import torch.distributed as dist +import torch.nn.functional as F + +from model_recurrent_bestbase import ( + CastedLinear, RecurrentGPT, restore_low_dim_params_to_fp32, +) +from feedback import ErrorFeedbackModule +from stability import RecurrentStabilizer +from train_utils_recurrent import ( + Hyperparameters, Muon, DistributedTokenLoader, + build_sentencepiece_luts, load_validation_tokens, + eval_val, eval_val_sliding, export_and_roundtrip, + build_model, build_optimizers, + add_common_args, apply_cli_overrides, +) +from ttt_recurrent import eval_val_sliding_ttt + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Recurrent bestbase with fixed error-feedback") + add_common_args(parser) + return parser.parse_args() + + +def main() -> None: + cli = parse_args() + code = Path(__file__).read_text(encoding="utf-8") + args = Hyperparameters() + args = apply_cli_overrides(args, cli) + + # Force fixed-form: no per-pass, only identity or diagonal + if cli.feedback_mode == "low_rank": + cli.feedback_mode = "diagonal" + cli.per_pass_feedback = False + + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + grad_accum_steps = 8 // world_size + grad_scale = 1.0 / grad_accum_steps + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + master_process = rank == 0 + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + + logfile = None + if master_process: + os.makedirs("logs", exist_ok=True) + logfile = f"logs/{args.run_id}.txt" + + def log0(msg: str, console: bool = True) -> None: + if not master_process: + return + if console: + print(msg) + if logfile is not None: + with open(logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + + log0(code, console=False) + random.seed(args.seed) + np.random.seed(args.seed) + torch.manual_seed(args.seed) + torch.cuda.manual_seed_all(args.seed) + + sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) + effective_eval_seq_len = (args.eval_seq_len + if args.eval_seq_len > 0 else args.train_seq_len) + val_seq_len = max(args.train_seq_len, effective_eval_seq_len) + val_tokens = load_validation_tokens(args.val_files, val_seq_len) + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = \ + build_sentencepiece_luts(sp, args.vocab_size, device) + + base_model = build_model(args, device) + n_params = sum(p.numel() for p in base_model.parameters()) + log0(f"model_params:{n_params} stem:{base_model.num_stem} " + f"core:{base_model.num_core} tail:{base_model.num_tail} " + f"passes:{base_model.num_passes}") + + # Fixed feedback (shared, no per-pass) + feedback = ErrorFeedbackModule( + dim=args.model_dim, + rank=cli.feedback_rank, + feedback_mode=cli.feedback_mode, + per_pass=False, + num_passes=args.num_passes, + affine_junction=False, + ).to(device).bfloat16() + restore_low_dim_params_to_fp32(feedback) + fb_params = sum(p.numel() for p in feedback.parameters()) + log0(f"feedback: mode={cli.feedback_mode} rank={cli.feedback_rank} " + f"fixed=True params={fb_params}") + + def feedback_fn(h, pass_idx): + return feedback(h, pass_idx) + + stabilizer = RecurrentStabilizer( + clip_hidden=cli.clip_hidden, clip_value=cli.clip_value, + jacobian_proxy_weight=cli.jacobian_proxy_weight) + + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + model = compiled_model + + extra_scalar = list(feedback.parameters()) + optimizers, replicated_params = build_optimizers( + base_model, args, extra_scalar_params=extra_scalar) + optimizer_muon = optimizers[1] + + train_loader = DistributedTokenLoader( + args.train_files, rank, world_size, device) + + def zero_grad_all(): + for opt in optimizers: + opt.zero_grad(set_to_none=True) + + max_wallclock_ms = (1000.0 * args.max_wallclock_seconds + if args.max_wallclock_seconds > 0 else None) + + def lr_mul(step: int, elapsed_ms: float) -> float: + if args.warmdown_iters <= 0: + return 1.0 + if max_wallclock_ms is None: + wd_start = max(args.iterations - args.warmdown_iters, 0) + if wd_start <= step < args.iterations: + return max((args.iterations - step) + / max(args.warmdown_iters, 1), 0.0) + return 1.0 + step_ms = elapsed_ms / max(step, 1) + warmdown_ms = args.warmdown_iters * step_ms + remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) + if remaining_ms <= warmdown_ms: + return remaining_ms / max(warmdown_ms, 1e-9) + return 1.0 + + # warmup + if args.warmup_steps > 0: + init_model = {n: t.detach().cpu().clone() + for n, t in base_model.state_dict().items()} + init_fb = {n: t.detach().cpu().clone() + for n, t in feedback.state_dict().items()} + init_opts = [copy.deepcopy(o.state_dict()) for o in optimizers] + model.train() + feedback.train() + for ws in range(args.warmup_steps): + zero_grad_all() + for _ in range(grad_accum_steps): + x, y = train_loader.next_batch( + args.train_batch_tokens, args.train_seq_len, + grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = model(x, y, feedback_fn=feedback_fn, + stabilizer=stabilizer) + (loss * grad_scale).backward() + if distributed: + for p in list(base_model.parameters()) + list(feedback.parameters()): + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + for opt in optimizers: + opt.step() + zero_grad_all() + base_model.load_state_dict(init_model, strict=True) + feedback.load_state_dict(init_fb, strict=True) + for opt, st in zip(optimizers, init_opts, strict=True): + opt.load_state_dict(st) + zero_grad_all() + train_loader = DistributedTokenLoader( + args.train_files, rank, world_size, device) + + # EMA for both model and feedback + all_state = {**{f"m.{k}": v for k, v in base_model.state_dict().items()}, + **{f"fb.{k}": v for k, v in feedback.state_dict().items()}} + ema_state = {n: t.detach().float().clone() for n, t in all_state.items()} + ema_decay = 0.997 + swa_state = None + swa_count = 0 + training_time_ms = 0.0 + stop_after_step = None + torch.cuda.synchronize() + t0 = time.perf_counter() + step = 0 + + while True: + last_step = (step == args.iterations + or (stop_after_step is not None + and step >= stop_after_step)) + should_validate = (last_step + or (args.val_loss_every > 0 + and step % args.val_loss_every == 0)) + + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1000.0 * (time.perf_counter() - t0) + val_loss, val_bpb = eval_val( + args, model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, + is_boundary_token_lut) + diag = stabilizer.diagnostics.summary() + diag_str = "" + if diag["correction_norms"]: + diag_str = (f" c_norms={[f'{v:.2f}' for v in diag['correction_norms'][-4:]]}") + log0(f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} " + f"val_bpb:{val_bpb:.4f} " + f"train_time:{training_time_ms:.0f}ms " + f"step_avg:{training_time_ms/max(step,1):.2f}ms" + f"{diag_str}") + stabilizer.reset() + torch.cuda.synchronize() + t0 = time.perf_counter() + + if last_step: + break + + elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + scale = lr_mul(step, elapsed_ms) + + if (args.late_qat_threshold > 0 and scale < args.late_qat_threshold + and not CastedLinear._qat_enabled): + CastedLinear._qat_enabled = True + + zero_grad_all() + train_loss = torch.zeros((), device=device) + + for _ in range(grad_accum_steps): + x, y = train_loader.next_batch( + args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = model(x, y, feedback_fn=feedback_fn, + stabilizer=stabilizer) + train_loss += loss.detach() + (loss * grad_scale).backward() + + train_loss /= grad_accum_steps + + frac = (min(step / args.muon_momentum_warmup_steps, 1.0) + if args.muon_momentum_warmup_steps > 0 else 1.0) + for group in optimizer_muon.param_groups: + group["momentum"] = ((1 - frac) * args.muon_momentum_warmup_start + + frac * args.muon_momentum) + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group.get("base_lr", group["lr"]) * scale + + if args.grad_clip_norm > 0: + all_params = list(base_model.parameters()) + list(feedback.parameters()) + torch.nn.utils.clip_grad_norm_(all_params, args.grad_clip_norm) + + optimizer_muon.launch_reduce_scatters() + if distributed: + for p in replicated_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + for opt in optimizers: + if opt is not optimizer_muon: + opt.step() + optimizer_muon.step() + zero_grad_all() + + with torch.no_grad(): + curr = {**{f"m.{k}": v for k, v in base_model.state_dict().items()}, + **{f"fb.{k}": v for k, v in feedback.state_dict().items()}} + for n, t in curr.items(): + ema_state[n].mul_(ema_decay).add_( + t.detach().float(), alpha=1.0 - ema_decay) + + step += 1 + approx_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + + if args.swa_enabled and scale < 0.2 and step % args.swa_every == 0: + if swa_state is None: + swa_state = {n: t.detach().cpu().clone() for n, t in curr.items()} + swa_count = 1 + else: + for n, t in curr.items(): + swa_state[n] += t.detach().cpu() + swa_count += 1 + + should_log = (args.train_log_every > 0 + and (step <= 10 or step % args.train_log_every == 0)) + if should_log: + log0(f"step:{step}/{args.iterations} " + f"train_loss:{train_loss.item():.4f} " + f"train_time:{approx_ms:.0f}ms " + f"step_avg:{approx_ms/step:.2f}ms") + + reached_cap = (max_wallclock_ms is not None + and approx_ms >= max_wallclock_ms) + if distributed and max_wallclock_ms is not None: + cap_t = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(cap_t, op=dist.ReduceOp.MAX) + reached_cap = bool(cap_t.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + + # apply EMA + log0("ema:applying EMA weights") + m_sd = base_model.state_dict() + fb_sd = feedback.state_dict() + for n, t in m_sd.items(): + m_sd[n] = ema_state[f"m.{n}"].to(dtype=t.dtype) + for n, t in fb_sd.items(): + fb_sd[n] = ema_state[f"fb.{n}"].to(dtype=t.dtype) + base_model.load_state_dict(m_sd, strict=True) + feedback.load_state_dict(fb_sd, strict=True) + + log0(f"peak memory: {torch.cuda.max_memory_allocated()//1024//1024} MiB") + + # diagnostic + diag_loss, diag_bpb = eval_val( + args, compiled_model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, + is_boundary_token_lut) + log0(f"DIAGNOSTIC post_ema val_loss:{diag_loss:.4f} val_bpb:{diag_bpb:.4f}") + + # export + eval_model = export_and_roundtrip( + base_model, args, log0, device, rank, world_size, + grad_accum_steps, val_tokens, base_bytes_lut, + has_leading_space_lut, is_boundary_token_lut, + feedback_module=feedback, stabilizer=stabilizer) + + # TTT + if args.ttt_enabled: + torch.cuda.synchronize() + t_ttt = time.perf_counter() + ttt_loss, ttt_bpb = eval_val_sliding_ttt( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, + is_boundary_token_lut, stride=args.eval_stride, log0=log0, + feedback_fn=feedback_fn, stabilizer=stabilizer, + ttt_regime=cli.ttt_regime, + ttt_recurrent_lr_scale=cli.ttt_recurrent_lr_scale) + torch.cuda.synchronize() + log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} " + f"eval_time:{1000*(time.perf_counter()-t_ttt):.0f}ms") + + if distributed: + dist.destroy_process_group() + + +if __name__ == "__main__": + main() diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_learned.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_learned.py new file mode 100644 index 0000000000..29deeaf61a --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_learned.py @@ -0,0 +1,459 @@ +"""Script 3: Recurrent core with *learned* error-feedback correction. + +This is the strongest variant and the main experimental target. + +Forward pass: + h_0 = Stem(x) + e_k = U(V^T h_k) -- low-rank residual approx. + c_k = D_k(e_k) -- learned correction (diagonal or low-rank) + h_{k+1} = f_{W_q}(h_k + c_k) -- corrected recurrent update + logits = LMHead(Tail(h_K)) + +Learned correction operators: + 1. shared diagonal : c_k = diag(d) e_k + 2. per-pass diagonal : c_k = diag(d_k) e_k + 3. shared low-rank : c_k = U_D (V_D^T e_k) + 4. per-pass low-rank : c_k = U_{D,k} (V_{D,k}^T e_k) + +Supports optional affine junction and Jacobian proxy regularization. +""" +from __future__ import annotations + +import argparse +import copy +import math +import os +import random +import subprocess +import sys +import time +import uuid +from collections import deque +from pathlib import Path + +import numpy as np +import sentencepiece as spm +import torch +import torch.distributed as dist +import torch.nn.functional as F + +from model_recurrent_bestbase import ( + CastedLinear, RecurrentGPT, restore_low_dim_params_to_fp32, +) +from feedback import ErrorFeedbackModule +from stability import RecurrentStabilizer, ResidualScale +from train_utils_recurrent import ( + Hyperparameters, Muon, DistributedTokenLoader, + build_sentencepiece_luts, load_validation_tokens, + eval_val, eval_val_sliding, export_and_roundtrip, + build_model, build_optimizers, + add_common_args, apply_cli_overrides, +) +from ttt_recurrent import eval_val_sliding_ttt + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Recurrent bestbase with learned error-feedback") + add_common_args(parser) + parser.add_argument("--warm-start-steps", type=int, default=0, + help="Steps to freeze core and train only correction") + return parser.parse_args() + + +def main() -> None: + cli = parse_args() + code = Path(__file__).read_text(encoding="utf-8") + args = Hyperparameters() + args = apply_cli_overrides(args, cli) + + # ---- distributed setup ---- + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + grad_accum_steps = 8 // world_size + grad_scale = 1.0 / grad_accum_steps + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + master_process = rank == 0 + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + + logfile = None + if master_process: + os.makedirs("logs", exist_ok=True) + logfile = f"logs/{args.run_id}.txt" + print(logfile) + + def log0(msg: str, console: bool = True) -> None: + if not master_process: + return + if console: + print(msg) + if logfile is not None: + with open(logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + + log0(code, console=False) + log0("=" * 100, console=False) + log0(f"Python {sys.version}", console=False) + log0(f"PyTorch {torch.__version__}", console=False) + log0(subprocess.run(["nvidia-smi"], capture_output=True, + text=True, check=False).stdout, console=False) + log0("=" * 100, console=False) + + # ---- seeding ---- + random.seed(args.seed) + np.random.seed(args.seed) + torch.manual_seed(args.seed) + torch.cuda.manual_seed_all(args.seed) + + # ---- tokenizer ---- + sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) + effective_eval_seq_len = (args.eval_seq_len + if args.eval_seq_len > 0 else args.train_seq_len) + val_seq_len = max(args.train_seq_len, effective_eval_seq_len) + val_tokens = load_validation_tokens(args.val_files, val_seq_len) + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = \ + build_sentencepiece_luts(sp, args.vocab_size, device) + + # ---- model ---- + base_model = build_model(args, device) + n_params = sum(p.numel() for p in base_model.parameters()) + log0(f"model_params:{n_params} unique_layers:{base_model.num_unique} " + f"stem:{base_model.num_stem} core:{base_model.num_core} " + f"tail:{base_model.num_tail} passes:{base_model.num_passes}") + + # ---- feedback module ---- + feedback = ErrorFeedbackModule( + dim=args.model_dim, + rank=cli.feedback_rank, + feedback_mode=cli.feedback_mode, + per_pass=cli.per_pass_feedback, + num_passes=args.num_passes, + affine_junction=cli.affine_junction, + ).to(device).bfloat16() + restore_low_dim_params_to_fp32(feedback) + fb_params = sum(p.numel() for p in feedback.parameters()) + log0(f"feedback: mode={cli.feedback_mode} rank={cli.feedback_rank} " + f"per_pass={cli.per_pass_feedback} affine={cli.affine_junction} " + f"params={fb_params}") + + def feedback_fn(h, pass_idx): + return feedback(h, pass_idx) + + # ---- stabilizer ---- + stabilizer = RecurrentStabilizer( + clip_hidden=cli.clip_hidden, + clip_value=cli.clip_value, + jacobian_proxy_weight=cli.jacobian_proxy_weight, + ) + residual_scale = None + if cli.residual_scale_init != 1.0: + residual_scale = ResidualScale( + args.num_passes, cli.residual_scale_init + ).to(device) + + # ---- compile ---- + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + model = compiled_model + + # ---- optimizers ---- + extra_scalar = list(feedback.parameters()) + if residual_scale is not None: + extra_scalar.extend(residual_scale.parameters()) + optimizers, replicated_params = build_optimizers( + base_model, args, extra_scalar_params=extra_scalar) + optimizer_muon = optimizers[1] + + log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}") + log0(f"train_batch_tokens:{args.train_batch_tokens} " + f"train_seq_len:{args.train_seq_len} " + f"iterations:{args.iterations} seed:{args.seed}") + + # ---- data ---- + train_loader = DistributedTokenLoader( + args.train_files, rank, world_size, device) + + def zero_grad_all(): + for opt in optimizers: + opt.zero_grad(set_to_none=True) + + max_wallclock_ms = (1000.0 * args.max_wallclock_seconds + if args.max_wallclock_seconds > 0 else None) + + def lr_mul(step: int, elapsed_ms: float) -> float: + if args.warmdown_iters <= 0: + return 1.0 + if max_wallclock_ms is None: + wd_start = max(args.iterations - args.warmdown_iters, 0) + if wd_start <= step < args.iterations: + return max((args.iterations - step) + / max(args.warmdown_iters, 1), 0.0) + return 1.0 + step_ms = elapsed_ms / max(step, 1) + warmdown_ms = args.warmdown_iters * step_ms + remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) + if remaining_ms <= warmdown_ms: + return remaining_ms / max(warmdown_ms, 1e-9) + return 1.0 + + # ---- warmup ---- + if args.warmup_steps > 0: + init_model_state = {n: t.detach().cpu().clone() + for n, t in base_model.state_dict().items()} + init_fb_state = {n: t.detach().cpu().clone() + for n, t in feedback.state_dict().items()} + init_opt_states = [copy.deepcopy(o.state_dict()) for o in optimizers] + model.train() + feedback.train() + for ws in range(args.warmup_steps): + zero_grad_all() + for _ in range(grad_accum_steps): + x, y = train_loader.next_batch( + args.train_batch_tokens, args.train_seq_len, + grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = model(x, y, feedback_fn=feedback_fn, + stabilizer=stabilizer) + (loss * grad_scale).backward() + if distributed: + for p in base_model.parameters(): + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + for p in feedback.parameters(): + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + for opt in optimizers: + opt.step() + zero_grad_all() + if ws + 1 == args.warmup_steps or (ws + 1) % 10 == 0: + log0(f"warmup_step:{ws+1}/{args.warmup_steps}") + base_model.load_state_dict(init_model_state, strict=True) + feedback.load_state_dict(init_fb_state, strict=True) + for opt, st in zip(optimizers, init_opt_states, strict=True): + opt.load_state_dict(st) + zero_grad_all() + train_loader = DistributedTokenLoader( + args.train_files, rank, world_size, device) + + # ---- EMA / SWA ---- + all_state = {**{f"model.{k}": v for k, v in base_model.state_dict().items()}, + **{f"fb.{k}": v for k, v in feedback.state_dict().items()}} + ema_state = {n: t.detach().float().clone() for n, t in all_state.items()} + ema_decay = 0.997 + swa_state: dict[str, torch.Tensor] | None = None + swa_count = 0 + + # ---- training loop ---- + training_time_ms = 0.0 + stop_after_step: int | None = None + torch.cuda.synchronize() + t0 = time.perf_counter() + step = 0 + + while True: + last_step = (step == args.iterations + or (stop_after_step is not None + and step >= stop_after_step)) + should_validate = (last_step + or (args.val_loss_every > 0 + and step % args.val_loss_every == 0)) + + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1000.0 * (time.perf_counter() - t0) + val_loss, val_bpb = eval_val( + args, model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, + is_boundary_token_lut) + diag = stabilizer.diagnostics.summary() + diag_str = "" + if diag["h_norms"]: + diag_str = (f" h_norms={[f'{v:.1f}' for v in diag['h_norms'][-4:]]} " + f"growth={[f'{v:.3f}' for v in diag['growth_ratios'][-4:]]}") + log0(f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} " + f"val_bpb:{val_bpb:.4f} " + f"train_time:{training_time_ms:.0f}ms " + f"step_avg:{training_time_ms/max(step,1):.2f}ms" + f"{diag_str}") + stabilizer.reset() + torch.cuda.synchronize() + t0 = time.perf_counter() + + if last_step: + if stop_after_step is not None and step < args.iterations: + log0(f"stopping_early train_time:{training_time_ms:.0f}ms " + f"step:{step}/{args.iterations}") + break + + # ---- warm-start phase: freeze core, train only correction ---- + warm_start_active = (cli.warm_start_steps > 0 + and step < cli.warm_start_steps) + if warm_start_active and step == 0: + log0(f"warm_start:freezing core for {cli.warm_start_steps} steps") + for p in list(base_model.core_blocks.parameters()): + p.requires_grad_(False) + core_bank_start = base_model._core_bank_start + core_bank_end = base_model._core_bank_end + if warm_start_active is False and step == cli.warm_start_steps: + log0("warm_start:unfreezing core") + for p in base_model.core_blocks.parameters(): + p.requires_grad_(True) + + elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + scale = lr_mul(step, elapsed_ms) + + if (args.late_qat_threshold > 0 and scale < args.late_qat_threshold + and not CastedLinear._qat_enabled): + CastedLinear._qat_enabled = True + log0(f"late_qat:enabled step:{step} scale:{scale:.4f}") + + zero_grad_all() + train_loss = torch.zeros((), device=device) + + for micro_step in range(grad_accum_steps): + x, y = train_loader.next_batch( + args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = model(x, y, feedback_fn=feedback_fn, + stabilizer=stabilizer) + if stabilizer.jacobian_proxy_weight > 0: + loss = loss + stabilizer.jacobian_proxy_loss( + torch.zeros(1, device=device), + torch.zeros(1, device=device)) + train_loss += loss.detach() + (loss * grad_scale).backward() + + train_loss /= grad_accum_steps + + # momentum warmup for Muon + frac = (min(step / args.muon_momentum_warmup_steps, 1.0) + if args.muon_momentum_warmup_steps > 0 else 1.0) + muon_mom = ((1 - frac) * args.muon_momentum_warmup_start + + frac * args.muon_momentum) + for group in optimizer_muon.param_groups: + group["momentum"] = muon_mom + + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group.get("base_lr", group["lr"]) * scale + + if args.grad_clip_norm > 0: + all_params = (list(base_model.parameters()) + + list(feedback.parameters())) + if residual_scale is not None: + all_params.extend(residual_scale.parameters()) + torch.nn.utils.clip_grad_norm_(all_params, args.grad_clip_norm) + + # 3-phase overlapped optimizer step + optimizer_muon.launch_reduce_scatters() + if distributed: + for p in replicated_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + for opt in optimizers: + if opt is not optimizer_muon: + opt.step() + optimizer_muon.step() + zero_grad_all() + + # EMA update + with torch.no_grad(): + current_state = { + **{f"model.{k}": v for k, v in base_model.state_dict().items()}, + **{f"fb.{k}": v for k, v in feedback.state_dict().items()}, + } + for n, t in current_state.items(): + ema_state[n].mul_(ema_decay).add_( + t.detach().float(), alpha=1.0 - ema_decay) + + step += 1 + approx_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + + # SWA + if args.swa_enabled and scale < 0.2 and step % args.swa_every == 0: + if swa_state is None: + swa_state = {n: t.detach().cpu().clone() + for n, t in current_state.items()} + swa_count = 1 + log0(f"swa:start step:{step}") + else: + for n, t in current_state.items(): + swa_state[n] += t.detach().cpu() + swa_count += 1 + + should_log = (args.train_log_every > 0 + and (step <= 10 or step % args.train_log_every == 0)) + if should_log: + log0(f"step:{step}/{args.iterations} " + f"train_loss:{train_loss.item():.4f} " + f"train_time:{approx_ms:.0f}ms " + f"step_avg:{approx_ms/step:.2f}ms") + + reached_cap = (max_wallclock_ms is not None + and approx_ms >= max_wallclock_ms) + if distributed and max_wallclock_ms is not None: + cap_t = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(cap_t, op=dist.ReduceOp.MAX) + reached_cap = bool(cap_t.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + + # ---- apply EMA ---- + log0("ema:applying EMA weights") + model_sd = base_model.state_dict() + fb_sd = feedback.state_dict() + for n, t in model_sd.items(): + model_sd[n] = ema_state[f"model.{n}"].to(dtype=t.dtype) + for n, t in fb_sd.items(): + fb_sd[n] = ema_state[f"fb.{n}"].to(dtype=t.dtype) + base_model.load_state_dict(model_sd, strict=True) + feedback.load_state_dict(fb_sd, strict=True) + + log0(f"peak memory: {torch.cuda.max_memory_allocated()//1024//1024} MiB") + + # ---- diagnostic eval ---- + torch.cuda.synchronize() + t_diag = time.perf_counter() + diag_loss, diag_bpb = eval_val( + args, compiled_model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, + is_boundary_token_lut) + torch.cuda.synchronize() + log0(f"DIAGNOSTIC post_ema val_loss:{diag_loss:.4f} " + f"val_bpb:{diag_bpb:.4f} " + f"eval_time:{1000*(time.perf_counter()-t_diag):.0f}ms") + + # ---- export + roundtrip eval ---- + eval_model = export_and_roundtrip( + base_model, args, log0, device, rank, world_size, + grad_accum_steps, val_tokens, base_bytes_lut, + has_leading_space_lut, is_boundary_token_lut, + feedback_module=feedback, stabilizer=stabilizer) + + # ---- TTT ---- + if args.ttt_enabled: + torch.cuda.synchronize() + t_ttt = time.perf_counter() + ttt_loss, ttt_bpb = eval_val_sliding_ttt( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, + is_boundary_token_lut, stride=args.eval_stride, log0=log0, + feedback_fn=feedback_fn, stabilizer=stabilizer, + ttt_regime=cli.ttt_regime, + ttt_recurrent_lr_scale=cli.ttt_recurrent_lr_scale) + torch.cuda.synchronize() + log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} " + f"eval_time:{1000*(time.perf_counter()-t_ttt):.0f}ms") + + if distributed: + dist.destroy_process_group() + + +if __name__ == "__main__": + main() diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_qat.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_qat.py new file mode 100644 index 0000000000..8156aff0db --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_qat.py @@ -0,0 +1,334 @@ +"""Script 1: Recurrent core with full-rollout QAT, **no** error feedback. + +This is the baseline recurrence experiment. It answers: how much of the +recurrence-vs-quantization problem disappears if we simply train the real +quantised recurrent rollout starting from the current best recipe? + +Forward pass: + h_0 = Stem(x) + h_{k+1} = f_{W_q}(h_k) (no correction term) + logits = LMHead(Tail(h_K)) +""" +from __future__ import annotations + +import argparse +import copy +import math +import os +import random +import subprocess +import sys +import time +import uuid +from pathlib import Path + +import numpy as np +import sentencepiece as spm +import torch +import torch.distributed as dist +import torch.nn.functional as F + +from model_recurrent_bestbase import ( + CastedLinear, RecurrentGPT, restore_low_dim_params_to_fp32, +) +from stability import RecurrentStabilizer +from train_utils_recurrent import ( + Hyperparameters, Muon, DistributedTokenLoader, + build_sentencepiece_luts, load_validation_tokens, + eval_val, eval_val_sliding, export_and_roundtrip, + build_model, build_optimizers, + add_common_args, apply_cli_overrides, +) +from ttt_recurrent import eval_val_sliding_ttt + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Recurrent bestbase with QAT only (no feedback)") + add_common_args(parser) + return parser.parse_args() + + +def main() -> None: + cli = parse_args() + code = Path(__file__).read_text(encoding="utf-8") + args = Hyperparameters() + args = apply_cli_overrides(args, cli) + + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + grad_accum_steps = 8 // world_size + grad_scale = 1.0 / grad_accum_steps + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + master_process = rank == 0 + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + + logfile = None + if master_process: + os.makedirs("logs", exist_ok=True) + logfile = f"logs/{args.run_id}.txt" + + def log0(msg: str, console: bool = True) -> None: + if not master_process: + return + if console: + print(msg) + if logfile is not None: + with open(logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + + log0(code, console=False) + random.seed(args.seed) + np.random.seed(args.seed) + torch.manual_seed(args.seed) + torch.cuda.manual_seed_all(args.seed) + + sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) + effective_eval_seq_len = (args.eval_seq_len + if args.eval_seq_len > 0 else args.train_seq_len) + val_seq_len = max(args.train_seq_len, effective_eval_seq_len) + val_tokens = load_validation_tokens(args.val_files, val_seq_len) + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = \ + build_sentencepiece_luts(sp, args.vocab_size, device) + + base_model = build_model(args, device) + n_params = sum(p.numel() for p in base_model.parameters()) + log0(f"model_params:{n_params} stem:{base_model.num_stem} " + f"core:{base_model.num_core} tail:{base_model.num_tail} " + f"passes:{base_model.num_passes}") + + stabilizer = RecurrentStabilizer( + clip_hidden=cli.clip_hidden, clip_value=cli.clip_value, + jacobian_proxy_weight=cli.jacobian_proxy_weight) + + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + model = compiled_model + + optimizers, replicated_params = build_optimizers(base_model, args) + optimizer_muon = optimizers[1] + + train_loader = DistributedTokenLoader( + args.train_files, rank, world_size, device) + + def zero_grad_all(): + for opt in optimizers: + opt.zero_grad(set_to_none=True) + + max_wallclock_ms = (1000.0 * args.max_wallclock_seconds + if args.max_wallclock_seconds > 0 else None) + + def lr_mul(step: int, elapsed_ms: float) -> float: + if args.warmdown_iters <= 0: + return 1.0 + if max_wallclock_ms is None: + wd_start = max(args.iterations - args.warmdown_iters, 0) + if wd_start <= step < args.iterations: + return max((args.iterations - step) + / max(args.warmdown_iters, 1), 0.0) + return 1.0 + step_ms = elapsed_ms / max(step, 1) + warmdown_ms = args.warmdown_iters * step_ms + remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) + if remaining_ms <= warmdown_ms: + return remaining_ms / max(warmdown_ms, 1e-9) + return 1.0 + + # warmup + if args.warmup_steps > 0: + init_state = {n: t.detach().cpu().clone() + for n, t in base_model.state_dict().items()} + init_opts = [copy.deepcopy(o.state_dict()) for o in optimizers] + model.train() + for ws in range(args.warmup_steps): + zero_grad_all() + for _ in range(grad_accum_steps): + x, y = train_loader.next_batch( + args.train_batch_tokens, args.train_seq_len, + grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = model(x, y, stabilizer=stabilizer) + (loss * grad_scale).backward() + if distributed: + for p in base_model.parameters(): + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + for opt in optimizers: + opt.step() + zero_grad_all() + base_model.load_state_dict(init_state, strict=True) + for opt, st in zip(optimizers, init_opts, strict=True): + opt.load_state_dict(st) + zero_grad_all() + train_loader = DistributedTokenLoader( + args.train_files, rank, world_size, device) + + ema_state = {n: t.detach().float().clone() + for n, t in base_model.state_dict().items()} + ema_decay = 0.997 + swa_state = None + swa_count = 0 + training_time_ms = 0.0 + stop_after_step = None + torch.cuda.synchronize() + t0 = time.perf_counter() + step = 0 + + while True: + last_step = (step == args.iterations + or (stop_after_step is not None + and step >= stop_after_step)) + should_validate = (last_step + or (args.val_loss_every > 0 + and step % args.val_loss_every == 0)) + + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1000.0 * (time.perf_counter() - t0) + val_loss, val_bpb = eval_val( + args, model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, + is_boundary_token_lut) + log0(f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} " + f"val_bpb:{val_bpb:.4f} " + f"train_time:{training_time_ms:.0f}ms " + f"step_avg:{training_time_ms/max(step,1):.2f}ms") + stabilizer.reset() + torch.cuda.synchronize() + t0 = time.perf_counter() + + if last_step: + break + + elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + scale = lr_mul(step, elapsed_ms) + + if (args.late_qat_threshold > 0 and scale < args.late_qat_threshold + and not CastedLinear._qat_enabled): + CastedLinear._qat_enabled = True + log0(f"late_qat:enabled step:{step}") + + zero_grad_all() + train_loss = torch.zeros((), device=device) + + for _ in range(grad_accum_steps): + x, y = train_loader.next_batch( + args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = model(x, y, stabilizer=stabilizer) + train_loss += loss.detach() + (loss * grad_scale).backward() + + train_loss /= grad_accum_steps + + frac = (min(step / args.muon_momentum_warmup_steps, 1.0) + if args.muon_momentum_warmup_steps > 0 else 1.0) + for group in optimizer_muon.param_groups: + group["momentum"] = ((1 - frac) * args.muon_momentum_warmup_start + + frac * args.muon_momentum) + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group.get("base_lr", group["lr"]) * scale + + if args.grad_clip_norm > 0: + torch.nn.utils.clip_grad_norm_( + base_model.parameters(), args.grad_clip_norm) + + optimizer_muon.launch_reduce_scatters() + if distributed: + for p in replicated_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + for opt in optimizers: + if opt is not optimizer_muon: + opt.step() + optimizer_muon.step() + zero_grad_all() + + with torch.no_grad(): + for n, t in base_model.state_dict().items(): + ema_state[n].mul_(ema_decay).add_( + t.detach().float(), alpha=1.0 - ema_decay) + + step += 1 + approx_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + + if args.swa_enabled and scale < 0.2 and step % args.swa_every == 0: + if swa_state is None: + swa_state = {n: t.detach().cpu().clone() + for n, t in base_model.state_dict().items()} + swa_count = 1 + else: + for n, t in base_model.state_dict().items(): + swa_state[n] += t.detach().cpu() + swa_count += 1 + + should_log = (args.train_log_every > 0 + and (step <= 10 or step % args.train_log_every == 0)) + if should_log: + log0(f"step:{step}/{args.iterations} " + f"train_loss:{train_loss.item():.4f} " + f"train_time:{approx_ms:.0f}ms " + f"step_avg:{approx_ms/step:.2f}ms") + + reached_cap = (max_wallclock_ms is not None + and approx_ms >= max_wallclock_ms) + if distributed and max_wallclock_ms is not None: + cap_t = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(cap_t, op=dist.ReduceOp.MAX) + reached_cap = bool(cap_t.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + + # apply EMA + log0("ema:applying EMA weights") + sd = base_model.state_dict() + for n, t in sd.items(): + sd[n] = ema_state[n].to(dtype=t.dtype) + base_model.load_state_dict(sd, strict=True) + + log0(f"peak memory: {torch.cuda.max_memory_allocated()//1024//1024} MiB") + + # diagnostic + torch.cuda.synchronize() + t_diag = time.perf_counter() + diag_loss, diag_bpb = eval_val( + args, compiled_model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, + is_boundary_token_lut) + torch.cuda.synchronize() + log0(f"DIAGNOSTIC post_ema val_loss:{diag_loss:.4f} " + f"val_bpb:{diag_bpb:.4f}") + + # export + eval_model = export_and_roundtrip( + base_model, args, log0, device, rank, world_size, + grad_accum_steps, val_tokens, base_bytes_lut, + has_leading_space_lut, is_boundary_token_lut) + + # TTT + if args.ttt_enabled: + torch.cuda.synchronize() + t_ttt = time.perf_counter() + ttt_loss, ttt_bpb = eval_val_sliding_ttt( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, + is_boundary_token_lut, stride=args.eval_stride, log0=log0, + ttt_regime=cli.ttt_regime, + ttt_recurrent_lr_scale=cli.ttt_recurrent_lr_scale) + torch.cuda.synchronize() + log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} " + f"eval_time:{1000*(time.perf_counter()-t_ttt):.0f}ms") + + if distributed: + dist.destroy_process_group() + + +if __name__ == "__main__": + main() diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_utils_recurrent.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_utils_recurrent.py new file mode 100644 index 0000000000..45fa72f3a9 --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_utils_recurrent.py @@ -0,0 +1,958 @@ +"""Training infrastructure for recurrent bestbase experiments. + +Contains: Hyperparameters, Muon optimizer, data loading, evaluation +helpers, tokenizer utilities, and quantization export. Mirrors the +current-best record utilities with additions for recurrence. +""" +from __future__ import annotations + +import argparse +import copy +import glob +import io +import lzma +import math +import os +import random +import subprocess +import sys +import time +import uuid +from pathlib import Path + +import numpy as np +import sentencepiece as spm +import torch +import torch.distributed as dist +import torch.nn.functional as F +from torch import Tensor, nn + +from model_recurrent_bestbase import ( + CastedLinear, RecurrentGPT, CONTROL_TENSOR_NAME_PATTERNS, + restore_low_dim_params_to_fp32, +) + +# --------------------------------------------------------------------------- +# Hyperparameters +# --------------------------------------------------------------------------- + +class Hyperparameters: + data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024") + train_files = os.path.join(data_path, "fineweb_train_*.bin") + val_files = os.path.join(data_path, "fineweb_val_*.bin") + tokenizer_path = os.environ.get("TOKENIZER_PATH", + "./data/tokenizers/fineweb_1024_bpe.model") + run_id = os.environ.get("RUN_ID", str(uuid.uuid4())) + seed = int(os.environ.get("SEED", 1337)) + val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288)) + val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 4000)) + train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 500)) + iterations = int(os.environ.get("ITERATIONS", 20000)) + warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 3500)) + warmup_steps = int(os.environ.get("WARMUP_STEPS", 20)) + train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 786_432)) + train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 2048)) + eval_seq_len = int(os.environ.get("EVAL_SEQ_LEN", 2048)) + max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0)) + qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5)) + vocab_size = int(os.environ.get("VOCAB_SIZE", 1024)) + model_dim = int(os.environ.get("MODEL_DIM", 512)) + num_heads = int(os.environ.get("NUM_HEADS", 8)) + num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4)) + mlp_mult = float(os.environ.get("MLP_MULT", 3.0)) + tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1"))) + rope_base = float(os.environ.get("ROPE_BASE", 10000.0)) + logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0)) + embed_lr = float(os.environ.get("EMBED_LR", 0.6)) + head_lr = float(os.environ.get("HEAD_LR", 0.008)) + tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.035)) + tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005)) + matrix_lr = float(os.environ.get("MATRIX_LR", 0.025)) + scalar_lr = float(os.environ.get("SCALAR_LR", 0.025)) + muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.99)) + muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5)) + muon_momentum_warmup_start = float(os.environ.get( + "MUON_MOMENTUM_WARMUP_START", 0.92)) + muon_momentum_warmup_steps = int(os.environ.get( + "MUON_MOMENTUM_WARMUP_STEPS", 1500)) + beta1 = float(os.environ.get("BETA1", 0.9)) + beta2 = float(os.environ.get("BETA2", 0.95)) + adam_eps = float(os.environ.get("ADAM_EPS", 1e-8)) + grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.3)) + eval_stride = int(os.environ.get("EVAL_STRIDE", 64)) + swa_enabled = bool(int(os.environ.get("SWA_ENABLED", "1"))) + swa_every = int(os.environ.get("SWA_EVERY", 50)) + muon_wd = float(os.environ.get("MUON_WD", 0.04)) + adam_wd = float(os.environ.get("ADAM_WD", 0.04)) + bigram_vocab_size = int(os.environ.get("BIGRAM_VOCAB_SIZE", 1536)) + bigram_dim = int(os.environ.get("BIGRAM_DIM", 128)) + xsa_last_n = int(os.environ.get("XSA_LAST_N", 4)) + rope_dims = int(os.environ.get("ROPE_DIMS", 16)) + ln_scale = bool(int(os.environ.get("LN_SCALE", "1"))) + late_qat_threshold = float(os.environ.get("LATE_QAT_THRESHOLD", 0.15)) + ve_enabled = bool(int(os.environ.get("VE_ENABLED", "1"))) + ve_dim = int(os.environ.get("VE_DIM", 128)) + ve_layers = os.environ.get("VE_LAYERS", "6,7") + gated_attention = bool(int(os.environ.get("GATED_ATTENTION", "0"))) + value_residual = bool(int(os.environ.get("VALUE_RESIDUAL", "0"))) + ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "0"))) + ttt_lr = float(os.environ.get("TTT_LR", 0.002)) + ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3)) + ttt_chunk_tokens = int(os.environ.get("TTT_CHUNK_TOKENS", 32768)) + ttt_freeze_blocks = int(os.environ.get("TTT_FREEZE_BLOCKS", 2)) + ttt_momentum = float(os.environ.get("TTT_MOMENTUM", 0.9)) + ttt_batch_seqs = int(os.environ.get("TTT_BATCH_SEQS", 32)) + ttt_grad_clip = float(os.environ.get("TTT_GRAD_CLIP", 1.0)) + # recurrence-specific + num_stem_layers = int(os.environ.get("NUM_STEM_LAYERS", 3)) + num_core_layers = int(os.environ.get("NUM_CORE_LAYERS", 2)) + num_tail_layers = int(os.environ.get("NUM_TAIL_LAYERS", 3)) + num_passes = int(os.environ.get("NUM_PASSES", 3)) + core_quant_bits = int(os.environ.get("CORE_QUANT_BITS", 6)) + core_quant_enabled = bool(int(os.environ.get("CORE_QUANT_ENABLED", "1"))) + + +# --------------------------------------------------------------------------- +# CLI argument parser (overlaid on top of env-based Hyperparameters) +# --------------------------------------------------------------------------- + +def add_common_args(parser: argparse.ArgumentParser) -> None: + g = parser.add_argument_group("recurrence") + g.add_argument("--recurrent-layer-range", type=str, default=None, + help="start,end (e.g. 3,8)") + g.add_argument("--shared-core-layers", type=int, default=None) + g.add_argument("--num-passes", type=int, default=None) + g.add_argument("--quant-bits", type=int, default=None) + g.add_argument("--quant-mode", type=str, default="per_row", + choices=["per_row", "per_tensor"]) + + g = parser.add_argument_group("feedback") + g.add_argument("--feedback-rank", type=int, default=2) + g.add_argument("--feedback-mode", type=str, default="diagonal", + choices=["identity", "diagonal", "low_rank"]) + g.add_argument("--per-pass-feedback", action="store_true") + g.add_argument("--affine-junction", action="store_true") + + g = parser.add_argument_group("stability") + g.add_argument("--clip-hidden", action="store_true") + g.add_argument("--clip-value", type=float, default=10.0) + g.add_argument("--residual-scale-init", type=float, default=1.0) + g.add_argument("--jacobian-proxy-weight", type=float, default=0.0) + + g = parser.add_argument_group("ttt") + g.add_argument("--ttt-regime", type=str, default="tail_only", + choices=["tail_only", "tail_plus_stem", + "all_unique_layers", "all_layers", + "all_layers_with_recurrent_lr_scale"]) + g.add_argument("--ttt-recurrent-lr-scale", type=float, default=0.1) + + g = parser.add_argument_group("precision") + g.add_argument("--leave-embeddings-fp16", action="store_true") + g.add_argument("--leave-head-fp16", action="store_true") + + +def apply_cli_overrides(args: Hyperparameters, + cli: argparse.Namespace) -> Hyperparameters: + if cli.recurrent_layer_range is not None: + s, e = cli.recurrent_layer_range.split(",") + args.num_stem_layers = int(s) + total = args.num_stem_layers + args.num_core_layers + args.num_tail_layers + args.num_tail_layers = total - int(e) + if cli.shared_core_layers is not None: + args.num_core_layers = cli.shared_core_layers + if cli.num_passes is not None: + args.num_passes = cli.num_passes + if cli.quant_bits is not None: + args.core_quant_bits = cli.quant_bits + return args + + +# --------------------------------------------------------------------------- +# Newton-Schulz + Muon +# --------------------------------------------------------------------------- + +def zeropower_via_newtonschulz5(G: Tensor, steps: int = 5, + eps: float = 1e-7) -> Tensor: + a, b, c = (3.4445, -4.7750, 2.0315) + was_2d = G.ndim == 2 + if was_2d: + G = G.unsqueeze(0) + X = G.bfloat16() + transposed = X.size(-2) > X.size(-1) + if transposed: + X = X.mT + X = X / (X.norm(dim=(-2, -1), keepdim=True) + eps) + for _ in range(steps): + A = X @ X.mT + B = b * A + c * (A @ A) + X = a * X + B @ X + if transposed: + X = X.mT + if was_2d: + X = X.squeeze(0) + return X + + +class Muon(torch.optim.Optimizer): + """Parallel Muon: reduce-scatter → local NS5 → all-gather.""" + + def __init__(self, params, lr: float, momentum: float, + backend_steps: int, nesterov: bool = True, + weight_decay: float = 0.0): + super().__init__( + params, + dict(lr=lr, momentum=momentum, backend_steps=backend_steps, + nesterov=nesterov, weight_decay=weight_decay)) + self._built = False + + def _build(self): + self._distributed = dist.is_available() and dist.is_initialized() + self._world_size = dist.get_world_size() if self._distributed else 1 + self._rank = dist.get_rank() if self._distributed else 0 + ws = self._world_size + self._bank_meta = [] + for group in self.param_groups: + for p in group["params"]: + B = p.shape[0] + padded_B = ((B + ws - 1) // ws) * ws + shard_B = padded_B // ws + tail = p.shape[1:] + dev = p.device + self._bank_meta.append({ + "p": p, "B": B, + "padded_grad": torch.zeros(padded_B, *tail, + device=dev, dtype=torch.bfloat16), + "shard": torch.zeros(shard_B, *tail, + device=dev, dtype=torch.bfloat16), + "shard_mom": torch.zeros(shard_B, *tail, + device=dev, dtype=torch.bfloat16), + "full_update": torch.zeros(padded_B, *tail, + device=dev, dtype=torch.bfloat16), + "scale": max(1, p.shape[-2] / p.shape[-1]) ** 0.5, + }) + self._bank_meta.sort(key=lambda m: -m["p"].numel()) + self._built = True + + def launch_reduce_scatters(self): + if not self._built: + self._build() + if not self._distributed: + return + self._rs_futures = [] + for m in self._bank_meta: + p = m["p"] + if p.grad is None: + self._rs_futures.append(None) + continue + pg = m["padded_grad"] + pg[:m["B"]].copy_(p.grad.bfloat16()) + if pg.shape[0] > m["B"]: + pg[m["B"]:].zero_() + fut = dist.reduce_scatter_tensor( + m["shard"], pg, op=dist.ReduceOp.AVG, async_op=True) + self._rs_futures.append(fut) + + @torch.no_grad() + def step(self, closure=None): + loss = None + if closure is not None: + with torch.enable_grad(): + loss = closure() + if not self._built: + self._build() + for group in self.param_groups: + lr = group["lr"] + momentum = group["momentum"] + backend_steps = group["backend_steps"] + nesterov = group["nesterov"] + wd = group.get("weight_decay", 0.0) + prev_ag_handle = None + prev_m = None + sharded = self._distributed and hasattr(self, "_rs_futures") + for i, m in enumerate(self._bank_meta): + p = m["p"] + if p.grad is None: + continue + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m["p"] + upd = prev_m["full_update"][:prev_m["B"]] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m["scale"]) + if sharded and self._rs_futures[i] is not None: + self._rs_futures[i].wait() + g = m["shard"] + buf = m["shard_mom"] + else: + g = p.grad.bfloat16() + state = self.state[p] + if "momentum_buffer" not in state: + state["momentum_buffer"] = torch.zeros_like(g) + buf = state["momentum_buffer"] + buf.mul_(momentum).add_(g) + update = g.add(buf, alpha=momentum) if nesterov else buf + update = zeropower_via_newtonschulz5( + update, steps=backend_steps) + if sharded: + prev_ag_handle = dist.all_gather_into_tensor( + m["full_update"], update, async_op=True) + prev_m = m + else: + if wd > 0.0: + p.data.mul_(1.0 - lr * wd) + p.add_(update.to(dtype=p.dtype), alpha=-lr * m["scale"]) + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m["p"] + upd = prev_m["full_update"][:prev_m["B"]] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m["scale"]) + if hasattr(self, "_rs_futures"): + del self._rs_futures + return loss + + +# --------------------------------------------------------------------------- +# Data loading +# --------------------------------------------------------------------------- + +def load_data_shard(file: Path) -> Tensor: + header_bytes = 256 * np.dtype(" None: + self.file_idx = (self.file_idx + 1) % len(self.files) + self.tokens = load_data_shard(self.files[self.file_idx]) + self.pos = 0 + + def take(self, n: int) -> Tensor: + chunks: list[Tensor] = [] + remaining = n + while remaining > 0: + avail = self.tokens.numel() - self.pos + if avail <= 0: + self._advance_file() + continue + k = min(remaining, avail) + chunks.append(self.tokens[self.pos:self.pos + k]) + self.pos += k + remaining -= k + return chunks[0] if len(chunks) == 1 else torch.cat(chunks) + + +class DistributedTokenLoader: + def __init__(self, pattern: str, rank: int, world_size: int, + device: torch.device): + self.rank = rank + self.world_size = world_size + self.device = device + self.stream = TokenStream(pattern) + + def next_batch(self, global_tokens: int, seq_len: int, + grad_accum_steps: int) -> tuple[Tensor, Tensor]: + local_tokens = global_tokens // (self.world_size * grad_accum_steps) + per_rank_span = local_tokens + 1 + chunk = self.stream.take(per_rank_span * self.world_size) + start = self.rank * per_rank_span + local = chunk[start:start + per_rank_span].to(dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + return (x.to(self.device, non_blocking=True), + y.to(self.device, non_blocking=True)) + + +# --------------------------------------------------------------------------- +# Tokenizer helpers +# --------------------------------------------------------------------------- + +def build_sentencepiece_luts( + sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device, +) -> tuple[Tensor, Tensor, Tensor]: + sp_vocab_size = int(sp.vocab_size()) + table_size = max(sp_vocab_size, vocab_size) + base_bytes_np = np.zeros((table_size,), dtype=np.int16) + has_leading_space_np = np.zeros((table_size,), dtype=np.bool_) + is_boundary_token_np = np.ones((table_size,), dtype=np.bool_) + for token_id in range(sp_vocab_size): + if (sp.is_control(token_id) or sp.is_unknown(token_id) + or sp.is_unused(token_id)): + continue + is_boundary_token_np[token_id] = False + if sp.is_byte(token_id): + base_bytes_np[token_id] = 1 + continue + piece = sp.id_to_piece(token_id) + if piece.startswith("\u2581"): + has_leading_space_np[token_id] = True + piece = piece[1:] + base_bytes_np[token_id] = len(piece.encode("utf-8")) + return ( + torch.tensor(base_bytes_np, dtype=torch.int16, device=device), + torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), + torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), + ) + + +def load_validation_tokens(pattern: str, seq_len: int) -> Tensor: + files = [Path(p) for p in sorted(glob.glob(pattern))] + if not files: + raise FileNotFoundError(f"No files for {pattern}") + tokens = torch.cat([load_data_shard(f) for f in files]).contiguous() + usable = ((tokens.numel() - 1) // seq_len) * seq_len + if usable <= 0: + raise ValueError(f"Validation split too short for seq_len={seq_len}") + return tokens[:usable + 1] + + +# --------------------------------------------------------------------------- +# Evaluation +# --------------------------------------------------------------------------- + +def eval_val( + args: Hyperparameters, + model: nn.Module, + rank: int, world_size: int, device: torch.device, + grad_accum_steps: int, + val_tokens: Tensor, + base_bytes_lut: Tensor, has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + eval_seq_len: int | None = None, +) -> tuple[float, float]: + seq_len = eval_seq_len or args.train_seq_len + local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps) + local_batch_seqs = local_batch_tokens // seq_len + total_seqs = (val_tokens.numel() - 1) // seq_len + seq_start = (total_seqs * rank) // world_size + seq_end = (total_seqs * (rank + 1)) // world_size + val_loss_sum = torch.zeros((), device=device, dtype=torch.float64) + val_token_count = torch.zeros((), device=device, dtype=torch.float64) + val_byte_count = torch.zeros((), device=device, dtype=torch.float64) + model.eval() + with torch.inference_mode(): + for batch_seq_start in range(seq_start, seq_end, local_batch_seqs): + batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end) + raw_start = batch_seq_start * seq_len + raw_end = batch_seq_end * seq_len + 1 + local = val_tokens[raw_start:raw_end].to( + device=device, dtype=torch.int64, non_blocking=True) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + batch_loss = model(x, y).detach() + btc = float(y.numel()) + val_loss_sum += batch_loss.to(torch.float64) * btc + val_token_count += btc + prev_ids, tgt_ids = x.reshape(-1), y.reshape(-1) + tb = base_bytes_lut[tgt_ids].to(dtype=torch.int16) + tb += (has_leading_space_lut[tgt_ids] + & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16) + val_byte_count += tb.to(torch.float64).sum() + if dist.is_available() and dist.is_initialized(): + for t in (val_loss_sum, val_token_count, val_byte_count): + dist.all_reduce(t, op=dist.ReduceOp.SUM) + val_loss = val_loss_sum / val_token_count + bpt = val_loss.item() / math.log(2.0) + tpb = val_token_count.item() / val_byte_count.item() + model.train() + return float(val_loss.item()), float(bpt * tpb) + + +def eval_val_sliding( + args: Hyperparameters, + base_model: nn.Module, + rank: int, world_size: int, device: torch.device, + val_tokens: Tensor, + base_bytes_lut: Tensor, has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + stride: int, batch_seqs: int = 32, + eval_seq_len: int | None = None, +) -> tuple[float, float]: + seq_len = eval_seq_len or args.train_seq_len + total_tokens = val_tokens.numel() - 1 + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= 1] + total_windows = len(window_starts) + my_s = (total_windows * rank) // world_size + my_e = (total_windows * (rank + 1)) // world_size + my_windows = window_starts[my_s:my_e] + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + base_model.eval() + compiled_logits = torch.compile(base_model.forward_logits, + dynamic=False, fullgraph=True) + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, + device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, + device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk = val_tokens[ws:end + 1].to(dtype=torch.int64, + device=device) + x_batch[i, :wlen] = chunk[:-1] + y_batch[i, :wlen] = chunk[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = compiled_logits(x_batch) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), reduction="none", + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt = y_batch[i, s:wlen] + prev = x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] + & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + if dist.is_available() and dist.is_initialized(): + for t in (loss_sum, token_count, byte_count): + dist.all_reduce(t, op=dist.ReduceOp.SUM) + val_loss = (loss_sum / token_count).item() + bpt = val_loss / math.log(2.0) + tpb = token_count.item() / byte_count.item() + base_model.train() + return val_loss, bpt * tpb + + +# --------------------------------------------------------------------------- +# Quantization export (int6 + lzma) +# --------------------------------------------------------------------------- + +INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = CONTROL_TENSOR_NAME_PATTERNS +INT8_KEEP_FLOAT_MAX_NUMEL = 65_536 +INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16 +INT8_PER_ROW_SCALE_DTYPE = torch.float16 +INT8_CLIP_PERCENTILE = 99.99984 +INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0 + + +def quantize_float_tensor(t: Tensor) -> tuple[Tensor, Tensor]: + t32 = t.float() + if t32.ndim == 2: + clip_abs = (torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1) + if t32.numel() else + torch.empty((t32.shape[0],), dtype=torch.float32)) + clipped = torch.clamp(t32, -clip_abs[:, None], clip_abs[:, None]) + scale = (clip_abs / 127.0).clamp_min(1.0 / 127.0) + q = torch.clamp(torch.round(clipped / scale[:, None]), + -127, 127).to(torch.int8).contiguous() + return q, scale.to(dtype=INT8_PER_ROW_SCALE_DTYPE).contiguous() + clip_abs = (float(torch.quantile(t32.abs().flatten(), + INT8_CLIP_Q).item()) if t32.numel() else 0.0) + scale = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, + dtype=torch.float32) + q = torch.clamp( + torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), + -127, 127).to(torch.int8).contiguous() + return q, scale + + +def quantize_int6_per_row(t: Tensor, clip_range: int = 31 + ) -> tuple[Tensor, Tensor]: + t32 = t.float() + if t32.ndim == 2: + best_q, best_s, best_err = None, None, float("inf") + for pct in [0.9990, 0.9995, 0.9999, 0.99999, 1.0]: + row_clip = (torch.quantile(t32.abs(), pct, dim=1) if pct < 1.0 + else t32.abs().amax(dim=1)) + s = (row_clip / clip_range).clamp_min( + 1.0 / clip_range).to(torch.float16) + q = torch.clamp( + torch.round(t32 / s.float()[:, None]), + -clip_range, clip_range).to(torch.int8) + recon = q.float() * s.float()[:, None] + err = (t32 - recon).pow(2).mean().item() + if err < best_err: + best_q, best_s, best_err = q, s, err + return best_q, best_s + amax = t32.abs().max().item() + scale = torch.tensor(amax / clip_range if amax > 0 else 1.0, + dtype=torch.float16) + q = torch.clamp(torch.round(t32 / scale.float()), + -clip_range, clip_range).to(torch.int8) + return q, scale + + +def _classify_param(name: str) -> str: + if "tok_emb" in name or "lm_head" in name: + return "embed" + if ".mlp." in name: + return "mlp" + if ".attn." in name: + return "attn" + return "other" + + +def _unbank_state_dict(sd: dict[str, Tensor], + num_unique: int) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + n = num_unique + for name, tensor in sd.items(): + if name == "qo_bank": + for i in range(n): + out[f"layer.{i}.attn.c_q.weight"] = tensor[i] + out[f"layer.{i}.attn.proj.weight"] = tensor[n + i] + elif name == "kv_bank": + for i in range(n): + out[f"layer.{i}.attn.c_k.weight"] = tensor[i] + out[f"layer.{i}.attn.c_v.weight"] = tensor[n + i] + elif name == "mlp_up_bank": + for i in range(n): + out[f"layer.{i}.mlp.fc.weight"] = tensor[i] + elif name == "mlp_down_bank": + for i in range(n): + out[f"layer.{i}.mlp.proj.weight"] = tensor[i] + else: + out[name] = tensor + return out + + +def _rebank_state_dict(sd: dict[str, Tensor], num_unique: int, + template_sd: dict[str, Tensor]) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + n = num_unique + qo = [None] * (2 * n) + kv = [None] * (2 * n) + up = [None] * n + down = [None] * n + consumed: set[str] = set() + for i in range(n): + for key, arr, idx in [ + (f"layer.{i}.attn.c_q.weight", qo, i), + (f"layer.{i}.attn.proj.weight", qo, n + i), + (f"layer.{i}.attn.c_k.weight", kv, i), + (f"layer.{i}.attn.c_v.weight", kv, n + i), + (f"layer.{i}.mlp.fc.weight", up, i), + (f"layer.{i}.mlp.proj.weight", down, i), + ]: + if key in sd: + arr[idx] = sd[key] + consumed.add(key) + out["qo_bank"] = torch.stack(qo).to(dtype=template_sd["qo_bank"].dtype) + out["kv_bank"] = torch.stack(kv).to(dtype=template_sd["kv_bank"].dtype) + out["mlp_up_bank"] = torch.stack(up).to( + dtype=template_sd["mlp_up_bank"].dtype) + out["mlp_down_bank"] = torch.stack(down).to( + dtype=template_sd["mlp_down_bank"].dtype) + for name, tensor in sd.items(): + if name not in consumed: + out[name] = tensor + return out + + +def mixed_quantize_int6(state_dict: dict[str, Tensor], + int6_cats: set[str]): + result: dict[str, Tensor] = {} + meta: dict[str, object] = {} + for name, tensor in state_dict.items(): + t = tensor.detach().cpu().contiguous() + cat = _classify_param(name) + if not t.is_floating_point() or t.numel() <= 65536: + result[name] = (t.to(torch.float16) + if t.is_floating_point() else t) + meta[name] = "passthrough" + continue + if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS): + result[name] = t.float() + meta[name] = "passthrough_ctrl" + continue + if cat in int6_cats and t.ndim >= 1: + q, s = quantize_int6_per_row(t) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int6"} + else: + q, s = quantize_float_tensor(t) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int8"} + return result, meta + + +def dequantize_mixed_int6(result: dict[str, Tensor], + meta: dict[str, object], + template_sd: dict[str, Tensor]) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + for name, orig in template_sd.items(): + info = meta.get(name) + if info is None: + continue + od = orig.dtype + if info in ("passthrough", "passthrough_ctrl", "passthrough_fp16"): + t = result[name] + if t.dtype == torch.float16 and od in (torch.float32, torch.bfloat16): + t = t.to(od) + out[name] = t + continue + q, s = result[name + ".q"], result[name + ".scale"] + if s.ndim > 0: + out[name] = (q.float() + * s.float().view(q.shape[0], *([1] * (q.ndim - 1))) + ).to(od) + else: + out[name] = (q.float() * float(s.item())).to(od) + return out + + +def export_and_roundtrip( + base_model: RecurrentGPT, + args: Hyperparameters, + log0, + device: torch.device, + rank: int, + world_size: int, + grad_accum_steps: int, + val_tokens: Tensor, + base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + feedback_module=None, + stabilizer=None, +): + """Quantize, export, reload, and evaluate roundtrip quality.""" + master_process = rank == 0 + sd_cpu = {k: v.detach().cpu() for k, v in base_model.state_dict().items()} + # remove feedback / stability from export state_dict + export_sd = {k: v for k, v in sd_cpu.items() + if not k.startswith("_feedback") and not k.startswith("_stab")} + + unbanked = _unbank_state_dict(export_sd, base_model.num_unique) + quant_result, quant_meta = mixed_quantize_int6(unbanked, {"mlp", "attn"}) + + quant_buf = io.BytesIO() + torch.save({"w": quant_result, "m": quant_meta}, quant_buf) + quant_blob = lzma.compress(quant_buf.getvalue(), preset=6) + + if master_process: + with open("final_model.int6.ptz", "wb") as f: + f.write(quant_blob) + code_bytes = len(Path(__file__).read_text("utf-8").encode("utf-8")) + log0(f"Serialized model int6+lzma: {len(quant_blob)} bytes") + log0(f"Total submission size int6+lzma: {len(quant_blob) + code_bytes} bytes") + + if dist.is_available() and dist.is_initialized(): + dist.barrier() + + with open("final_model.int6.ptz", "rb") as f: + quant_blob_disk = f.read() + quant_state = torch.load( + io.BytesIO(lzma.decompress(quant_blob_disk)), map_location="cpu") + deq_unbanked = dequantize_mixed_int6( + quant_state["w"], quant_state["m"], unbanked) + deq_sd = _rebank_state_dict(deq_unbanked, base_model.num_unique, export_sd) + + eval_model = RecurrentGPT( + vocab_size=args.vocab_size, model_dim=args.model_dim, + num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, + mlp_mult=args.mlp_mult, tie_embeddings=args.tie_embeddings, + tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, rope_base=args.rope_base, + qk_gain_init=args.qk_gain_init, + bigram_vocab_size=args.bigram_vocab_size, + bigram_dim=args.bigram_dim, rope_dims=args.rope_dims, + ln_scale=args.ln_scale, + ve_enabled=args.ve_enabled, ve_dim=args.ve_dim, + ve_layers=args.ve_layers, xsa_last_n=args.xsa_last_n, + gated_attention=args.gated_attention, + value_residual=args.value_residual, + num_stem_layers=args.num_stem_layers, + num_core_layers=args.num_core_layers, + num_tail_layers=args.num_tail_layers, + num_passes=args.num_passes, + core_quant_bits=args.core_quant_bits, + core_quant_enabled=False, + ).to(device).bfloat16() + eval_model.qo_bank.data = eval_model.qo_bank.data.float() + eval_model.kv_bank.data = eval_model.kv_bank.data.float() + eval_model.mlp_up_bank.data = eval_model.mlp_up_bank.data.float() + eval_model.mlp_down_bank.data = eval_model.mlp_down_bank.data.float() + for m in eval_model.modules(): + if isinstance(m, CastedLinear): + m.float() + restore_low_dim_params_to_fp32(eval_model) + eval_model.load_state_dict(deq_sd, strict=False) + + effective_eval_seq_len = (args.eval_seq_len + if args.eval_seq_len > 0 else args.train_seq_len) + + torch.cuda.synchronize() + t_qeval = time.perf_counter() + q_loss, q_bpb = eval_val( + args, eval_model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, + is_boundary_token_lut, eval_seq_len=effective_eval_seq_len) + torch.cuda.synchronize() + log0(f"final_int6_roundtrip val_loss:{q_loss:.4f} val_bpb:{q_bpb:.4f} " + f"eval_time:{1000*(time.perf_counter()-t_qeval):.0f}ms") + + if args.eval_stride > 0 and args.eval_stride < effective_eval_seq_len: + torch.cuda.synchronize() + t_slide = time.perf_counter() + sw_loss, sw_bpb = eval_val_sliding( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, + is_boundary_token_lut, stride=args.eval_stride, + eval_seq_len=effective_eval_seq_len) + torch.cuda.synchronize() + log0(f"final_int6_sliding val_loss:{sw_loss:.4f} " + f"val_bpb:{sw_bpb:.4f} stride:{args.eval_stride} " + f"eval_time:{1000*(time.perf_counter()-t_slide):.0f}ms") + + return eval_model + + +# --------------------------------------------------------------------------- +# Model builder +# --------------------------------------------------------------------------- + +def build_model(args: Hyperparameters, + device: torch.device) -> RecurrentGPT: + model = RecurrentGPT( + vocab_size=args.vocab_size, + model_dim=args.model_dim, + num_heads=args.num_heads, + num_kv_heads=args.num_kv_heads, + mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, + tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, + rope_base=args.rope_base, + qk_gain_init=args.qk_gain_init, + bigram_vocab_size=args.bigram_vocab_size, + bigram_dim=args.bigram_dim, + rope_dims=args.rope_dims, + ln_scale=args.ln_scale, + ve_enabled=args.ve_enabled, + ve_dim=args.ve_dim, + ve_layers=args.ve_layers, + xsa_last_n=args.xsa_last_n, + gated_attention=args.gated_attention, + value_residual=args.value_residual, + num_stem_layers=args.num_stem_layers, + num_core_layers=args.num_core_layers, + num_tail_layers=args.num_tail_layers, + num_passes=args.num_passes, + core_quant_bits=args.core_quant_bits, + core_quant_enabled=args.core_quant_enabled, + ).to(device).bfloat16() + model.qo_bank.data = model.qo_bank.data.float() + model.kv_bank.data = model.kv_bank.data.float() + model.mlp_up_bank.data = model.mlp_up_bank.data.float() + model.mlp_down_bank.data = model.mlp_down_bank.data.float() + for m in model.modules(): + if isinstance(m, CastedLinear): + m.float() + restore_low_dim_params_to_fp32(model) + return model + + +def build_optimizers( + base_model: RecurrentGPT, + args: Hyperparameters, + extra_scalar_params: list[nn.Parameter] | None = None, +) -> tuple[list[torch.optim.Optimizer], list[nn.Parameter]]: + """Return (list_of_optimizers, replicated_params).""" + matrix_params = [ + base_model.qo_bank, base_model.kv_bank, + base_model.mlp_up_bank, base_model.mlp_down_bank, + ] + all_blocks = base_model.all_blocks() + block_named_params = [] + for blk in all_blocks: + block_named_params.extend(blk.named_parameters()) + + scalar_params = [ + p for name, p in block_named_params + if p.ndim < 2 or any(pat in name for pat in CONTROL_TENSOR_NAME_PATTERNS) + ] + if base_model.skip_weights.numel() > 0: + scalar_params.append(base_model.skip_weights) + scalar_params.append(base_model.smear.gate) + if base_model.bigram is not None: + scalar_params.append(base_model.bigram.scale) + + token_lr = (args.tied_embed_lr if args.tie_embeddings else args.embed_lr) + tok_groups = [{"params": [base_model.tok_emb.weight], + "lr": token_lr, "base_lr": token_lr}] + if base_model.bigram is not None: + tok_groups.append({"params": [base_model.bigram.embed.weight], + "lr": token_lr, "base_lr": token_lr}) + if base_model.bigram.proj is not None: + scalar_params.append(base_model.bigram.proj.weight) + if base_model.ve_shared is not None: + tok_groups.append({"params": [base_model.ve_shared.embed.weight], + "lr": token_lr, "base_lr": token_lr}) + if base_model.ve_shared.proj is not None: + scalar_params.append(base_model.ve_shared.proj.weight) + scalar_params.append(base_model.ve_shared.scale) + for s in base_model.ve_layer_scales: + scalar_params.append(s) + + if extra_scalar_params: + scalar_params.extend(extra_scalar_params) + + optimizer_tok = torch.optim.AdamW( + tok_groups, betas=(args.beta1, args.beta2), + eps=args.adam_eps, weight_decay=args.adam_wd, fused=True) + optimizer_muon = Muon( + matrix_params, lr=args.matrix_lr, momentum=args.muon_momentum, + backend_steps=args.muon_backend_steps, weight_decay=args.muon_wd) + for group in optimizer_muon.param_groups: + group["base_lr"] = args.matrix_lr + optimizer_scalar = torch.optim.AdamW( + [{"params": scalar_params, "lr": args.scalar_lr, + "base_lr": args.scalar_lr}], + betas=(args.beta1, args.beta2), + eps=args.adam_eps, weight_decay=args.adam_wd, fused=True) + + replicated_params: list[nn.Parameter] = [] + for pg in optimizer_tok.param_groups: + replicated_params.extend(pg["params"]) + replicated_params.extend(scalar_params) + + optimizers = [optimizer_tok, optimizer_muon, optimizer_scalar] + + optimizer_head = None + if base_model.lm_head is not None: + optimizer_head = torch.optim.Adam( + [{"params": [base_model.lm_head.weight], + "lr": args.head_lr, "base_lr": args.head_lr}], + betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True) + replicated_params.append(base_model.lm_head.weight) + optimizers.append(optimizer_head) + + return optimizers, replicated_params diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/ttt_recurrent.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/ttt_recurrent.py new file mode 100644 index 0000000000..d850929bfb --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/ttt_recurrent.py @@ -0,0 +1,259 @@ +"""Recurrent-aware TTT (test-time training) wrapper. + +Preserves the legal score-first protocol from the current best record +while adding recurrence-safe adaptation regimes: + + 1. tail_only – only tail blocks adapt + 2. tail_plus_stem – stem + tail, core frozen + 3. all_unique_layers – stem + core + tail (core at full LR) + 4. all_layers – same as all_unique (alias) + 5. all_layers_with_recurrent_lr_scale – core at reduced LR +""" +from __future__ import annotations + +import math +import time +import torch +import torch.distributed as dist +import torch.nn.functional as F +from torch import Tensor, nn + +from model_recurrent_bestbase import RecurrentGPT +from train_utils_recurrent import Hyperparameters + + +def _select_ttt_params( + model: RecurrentGPT, + regime: str, + recurrent_lr_scale: float = 0.1, + base_lr: float = 0.002, +) -> list[dict]: + """Return param groups for TTT, honouring the chosen regime.""" + stem_params = list(model.stem_blocks.parameters()) + core_params = list(model.core_blocks.parameters()) + tail_params = list(model.tail_blocks.parameters()) + + # Non-block params: embeddings, norms, skip weights, bigram, VE, LM head + other_params = [p for n, p in model.named_parameters() + if not any(tag in n for tag in + ("stem_blocks.", "core_blocks.", "tail_blocks.", + "qo_bank", "kv_bank", "mlp_up_bank", + "mlp_down_bank"))] + bank_params = [model.qo_bank, model.kv_bank, + model.mlp_up_bank, model.mlp_down_bank] + + if regime == "tail_only": + return [{"params": tail_params + other_params, "lr": base_lr}] + + if regime == "tail_plus_stem": + return [{"params": stem_params + tail_params + other_params, + "lr": base_lr}] + + if regime in ("all_unique_layers", "all_layers"): + return [{"params": (stem_params + core_params + tail_params + + other_params + bank_params), + "lr": base_lr}] + + if regime == "all_layers_with_recurrent_lr_scale": + groups = [ + {"params": stem_params + tail_params + other_params, + "lr": base_lr}, + {"params": core_params, "lr": base_lr * recurrent_lr_scale}, + {"params": bank_params, "lr": base_lr * recurrent_lr_scale}, + ] + return groups + + raise ValueError(f"Unknown TTT regime: {regime}") + + +def eval_val_sliding_ttt( + args: Hyperparameters, + base_model: RecurrentGPT, + rank: int, + world_size: int, + device: torch.device, + val_tokens: Tensor, + base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + stride: int, + batch_seqs: int = 32, + log0=print, + feedback_fn=None, + stabilizer=None, + ttt_regime: str = "tail_only", + ttt_recurrent_lr_scale: float = 0.1, +) -> tuple[float, float]: + """Legal score-first TTT with recurrence-aware param selection.""" + seq_len = args.train_seq_len + total_tokens = val_tokens.numel() - 1 + ttt_chunk = args.ttt_chunk_tokens + + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= stride + or ws == 0] + + num_chunks = (total_tokens + ttt_chunk - 1) // ttt_chunk + chunk_windows: list[list[int]] = [[] for _ in range(num_chunks)] + for ws in window_starts: + end = min(ws + seq_len, total_tokens) + wlen = end - ws + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_start = ws + s + ci = min(scored_start // ttt_chunk, num_chunks - 1) + chunk_windows[ci].append(ws) + + log0(f"ttt_sliding:start chunks={num_chunks} chunk_tokens={ttt_chunk} " + f"total_windows={len(window_starts)} stride={stride} " + f"regime={ttt_regime} ttt_lr={args.ttt_lr} " + f"ttt_epochs={args.ttt_epochs}") + + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + + # Build TTT param groups + param_groups = _select_ttt_params( + base_model, ttt_regime, + recurrent_lr_scale=ttt_recurrent_lr_scale, + base_lr=args.ttt_lr) + ttt_params = [] + for pg in param_groups: + ttt_params.extend(pg["params"]) + + # Freeze everything first, then unfreeze TTT params + for p in base_model.parameters(): + p.requires_grad_(False) + for p in ttt_params: + p.requires_grad_(True) + + log0(f"ttt_sliding:params unfrozen={sum(p.numel() for p in ttt_params)} " + f"frozen={sum(p.numel() for p in base_model.parameters() if not p.requires_grad)}") + + optimizer = torch.optim.SGD(param_groups, lr=args.ttt_lr, + momentum=args.ttt_momentum) + t0 = time.perf_counter() + + for ci in range(num_chunks): + windows = chunk_windows[ci] + if not windows: + continue + chunk_start = ci * ttt_chunk + chunk_end = min((ci + 1) * ttt_chunk, total_tokens) + + # --- Phase 1: SCORE (inference_mode) --- + my_s = (len(windows) * rank) // world_size + my_e = (len(windows) * (rank + 1)) // world_size + my_windows = windows[my_s:my_e] + + base_model.eval() + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, + device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, + device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk_tok = val_tokens[ws:end + 1].to( + dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk_tok[:-1] + y_batch[i, :wlen] = chunk_tok[1:] + + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = base_model.forward_logits( + x_batch, feedback_fn=feedback_fn, + stabilizer=stabilizer) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), reduction="none", + ).reshape(bsz, seq_len) + + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt = y_batch[i, s:wlen] + prev = x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] + & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + + # --- Phase 2: TRAIN --- + is_last_chunk = (ci == num_chunks - 1) + if not is_last_chunk and args.ttt_epochs > 0: + base_model.train() + chunk_seqs = (chunk_end - chunk_start) // seq_len + if chunk_seqs > 0: + cos_lr = (args.ttt_lr * 0.5 + * (1.0 + math.cos( + math.pi * ci / max(num_chunks - 1, 1)))) + for pg in optimizer.param_groups: + scale = pg.get("lr", args.ttt_lr) / args.ttt_lr + pg["lr"] = cos_lr * scale + + my_seq_s = (chunk_seqs * rank) // world_size + my_seq_e = (chunk_seqs * (rank + 1)) // world_size + my_chunk_seqs = my_seq_e - my_seq_s + + for _ep in range(args.ttt_epochs): + for bs in range(0, my_chunk_seqs, args.ttt_batch_seqs): + be = min(bs + args.ttt_batch_seqs, my_chunk_seqs) + actual_bs = my_seq_s + bs + start_tok = chunk_start + actual_bs * seq_len + end_tok = (chunk_start + + (my_seq_s + be) * seq_len + 1) + if end_tok > val_tokens.numel(): + continue + local = val_tokens[start_tok:end_tok].to( + device=device, dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + optimizer.zero_grad(set_to_none=True) + with torch.autocast(device_type="cuda", + dtype=torch.bfloat16): + loss = base_model( + x, y, feedback_fn=feedback_fn, + stabilizer=stabilizer) + loss.backward() + if world_size > 1: + for p in ttt_params: + if p.grad is not None: + dist.all_reduce( + p.grad, op=dist.ReduceOp.AVG) + torch.nn.utils.clip_grad_norm_( + ttt_params, args.ttt_grad_clip) + optimizer.step() + + if rank == 0 and (ci % 10 == 0 or ci == num_chunks - 1): + elapsed = time.perf_counter() - t0 + rl = loss_sum.item() / max(token_count.item(), 1) + rbpb = (rl / math.log(2.0) + * (token_count.item() / max(byte_count.item(), 1)) + if token_count.item() > 0 else 0.0) + log0(f" ttt_chunk [{ci+1}/{num_chunks}] bpb={rbpb:.6f} " + f"time={elapsed:.1f}s") + + if dist.is_available() and dist.is_initialized(): + for t in (loss_sum, token_count, byte_count): + dist.all_reduce(t, op=dist.ReduceOp.SUM) + + val_loss = (loss_sum / token_count).item() + val_bpb = (val_loss / math.log(2.0) + * (token_count.item() / byte_count.item())) + + for p in base_model.parameters(): + p.requires_grad_(True) + base_model.eval() + + log0(f"ttt_sliding:done val_loss={val_loss:.6f} val_bpb={val_bpb:.6f} " + f"elapsed={time.perf_counter() - t0:.1f}s") + return val_loss, val_bpb diff --git a/sky.yaml b/sky.yaml new file mode 100644 index 0000000000..6c8753456b --- /dev/null +++ b/sky.yaml @@ -0,0 +1,78 @@ +name: nesta-propensity-rl + +envs: + N_GPUS: 8 + CONFIG_NAME: propensity_ppo_8gpu + PREPROCESS: 0 + MAX_SAMPLES: 0 + WANDB_PROJECT_NAME: prosus-propensity-ppo + WANDB_EXPERIMENT_NAME: qwen3-4b-ppo-8gpu-v1 + TOTAL_EPOCHS: 1 + CONFIG_FOLDER: threshold_prediction + SOURCE_PARQUET: /my_data/propensity/promotions/input_data/propensity_cpo_prediction_v3_equal_sampled + DATA_DIR: /tmp/propensity_cpo + +file_mounts: + /my_data: + source: s3://lcm-ifood-data + mode: MOUNT + +resources: + cloud: nebius + region: eu-north1 + accelerators: H200:8 + disk_size: 1000 + ports: + - 8265 + image_id: docker:verlai/verl:vllm011.latest + +num_nodes: 1 + +workdir: . + +secrets: + WANDB_API_KEY: null + +setup: | + set -euo pipefail + + rm -rf verl + pip cache purge + + git clone https://github.com/volcengine/verl.git + cd verl + pip3 install -v -e .[vllm] + cd .. + + if [[ -n "${WANDB_API_KEY:-}" ]]; then + python3 -c "import wandb; wandb.login(relogin=True, key='${WANDB_API_KEY}')" + fi + + export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + +run: | + set -euo pipefail + + # Preprocess data if requested and not already done + if [[ "${PREPROCESS}" == "1" ]]; then + echo "Preprocessing dataset..." + MAX_SAMPLES_ARG="" + if [[ "${MAX_SAMPLES}" -gt 0 ]]; then + MAX_SAMPLES_ARG="--max_samples ${MAX_SAMPLES}" + fi + python3 custom/propensity/datasets/prepare_data.py \ + --input_path "${SOURCE_PARQUET}" \ + --local_save_dir "${DATA_DIR}" \ + ${MAX_SAMPLES_ARG} + echo "Preprocessing complete." + else + echo "Skipping preprocessing (PREPROCESS=${PREPROCESS}, data exists=$(test -f ${DATA_DIR}/train.parquet && echo yes || echo no))." + fi + + export VLLM_USE_V1=1 + + python3 -m verl.trainer.main_ppo \ + --config-path="${PWD}/configs/experiments/${CONFIG_FOLDER}" \ + --config-name="${CONFIG_NAME}" + + echo "Training completed!" \ No newline at end of file diff --git a/sky_recurrent.yaml b/sky_recurrent.yaml new file mode 100644 index 0000000000..f4cfab36b4 --- /dev/null +++ b/sky_recurrent.yaml @@ -0,0 +1,57 @@ +name: param-golf-recurrent + +envs: + SEED: 1337 + SCRIPT: train_bestbase_recurrent_feedback_learned.py + FEEDBACK_MODE: diagonal + FEEDBACK_RANK: 2 + TTT_REGIME: tail_only + TTT_ENABLED: 0 + +resources: + cloud: nebius + region: eu-north1 + accelerators: H200:8 + disk_size: 200 + image_id: docker:nvcr.io/nvidia/pytorch:24.12-py3 + +num_nodes: 1 + +workdir: . + +setup: | + set -euo pipefail + + pip install sentencepiece numpy huggingface-hub 2>/dev/null + pip install flash-attn --no-build-isolation 2>/dev/null || true + + # Download FineWeb dataset + tokenizer (~2 min) + if [ ! -d "data/datasets/fineweb10B_sp1024" ]; then + echo "Downloading FineWeb dataset..." + python data/cached_challenge_fineweb.py + else + echo "Dataset already present." + fi + +run: | + set -euo pipefail + + RECORD_DIR="records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT" + cd "$RECORD_DIR" + + export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" + export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" + export SEED="${SEED}" + export MAX_WALLCLOCK_SECONDS=600 + export TTT_ENABLED="${TTT_ENABLED}" + + echo "Running ${SCRIPT} with SEED=${SEED}..." + echo "Data: ${DATA_PATH}" + echo "Feedback: mode=${FEEDBACK_MODE} rank=${FEEDBACK_RANK}" + + torchrun --standalone --nproc_per_node=8 "${SCRIPT}" \ + --feedback-mode "${FEEDBACK_MODE}" \ + --feedback-rank "${FEEDBACK_RANK}" \ + --ttt-regime "${TTT_REGIME}" + + echo "Training completed! Logs in logs/" From 09ea3cf396cd2e60f52f4a43d5d5a5a9e930c909 Mon Sep 17 00:00:00 2001 From: nesta Date: Thu, 26 Mar 2026 13:02:23 +0000 Subject: [PATCH 02/23] Attempt one, 4 passes and succesfully achieve contractive layers --- .env | 1 + ablation_3p_noRMS_j0.0.log | 78 + ablation_3p_noRMS_j0.1.log | 80 + ablation_stdout.log | 5 + baseline_50step.log | 44 + baseline_stdout.log | 75 + debug.md | 106 + full_run_stdout.log | 12 + full_run_v2_stdout.log | 52 + full_run_v3_stdout.log | 52 + full_run_v4.log | 52 + full_run_v5.log | 52 + grid_p2_j0.0.log | 79 + grid_p2_j0.001.log | 79 + grid_p2_j0.01.log | 79 + grid_p2_j0.1.log | 78 + grid_p3_j0.0.log | 78 + grid_p3_j0.001.log | 79 + grid_p3_j0.01.log | 78 + grid_p3_j0.1.log | 78 + grid_p4_j0.0.log | 47 + grid_search_results.csv | 9 + grid_search_stdout.log | 25 + .../wandb/debug-internal.log | 1 + .../wandb/debug.log | 1 + .../wandb/latest-run | 1 + .../files/config.yaml | 66 + .../files/output.log | 44 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 38 + .../files/wandb-summary.json | 1 + .../run_full.sh | 63 + .../run_smoke.sh | 48 + ...train_bestbase_recurrent_feedback_fixed.py | 2 +- ...ain_bestbase_recurrent_feedback_learned.py | 6 +- .../train_bestbase_recurrent_qat.py | 2 +- .../train_utils_recurrent.py | 3 +- .../ablation_no_rmsnorm.sh | 77 + .../feedback.py | 138 ++ .../grid_search.sh | 96 + .../recurrence-fixes.md | 376 +++ .../run_3pass.sh | 20 + .../run_4pass_qat.sh | 79 + .../run_4pass_test.sh | 78 + .../run_4pass_ttt.sh | 86 + .../run_full_1gpu.sh | 67 + .../smoke_passes.sh | 63 + .../smoke_test.sh | 58 + .../stability.py | 108 + .../train_gpt_recurrent.py | 2084 +++++++++++++++++ .../wandb/debug-internal.log | 1 + .../wandb/debug.log | 1 + .../wandb/latest-run | 1 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 50 + .../files/output.log | 3 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 50 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 50 + .../files/config.yaml | 95 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 50 + .../files/wandb-summary.json | 1 + .../files/config.yaml | 95 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 50 + .../files/wandb-summary.json | 1 + .../files/config.yaml | 95 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 50 + .../files/wandb-summary.json | 1 + .../files/config.yaml | 95 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 50 + .../files/wandb-summary.json | 1 + .../files/config.yaml | 95 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 50 + .../files/wandb-summary.json | 1 + .../files/config.yaml | 95 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 50 + .../files/wandb-summary.json | 1 + .../files/config.yaml | 95 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 50 + .../files/wandb-summary.json | 1 + .../files/config.yaml | 95 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 50 + .../files/wandb-summary.json | 1 + .../files/output.log | 22 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 50 + .../files/output.log | 18 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 51 + .../files/config.yaml | 96 + .../files/output.log | 50 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 51 + .../files/wandb-summary.json | 1 + .../files/config.yaml | 96 + .../files/output.log | 50 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 51 + .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 51 + .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 51 + .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 51 + .../files/wandb-summary.json | 1 + .../files/output.log | 32 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 51 + .../files/output.log | 18 + .../files/requirements.txt | 101 + .../files/wandb-metadata.json | 51 + report.md | 118 + run_baseline_50step.sh | 98 + sky.yaml | 78 - sky_recurrent.yaml | 57 - smoke_3pass.log | 45 + smoke_noclip.log | 48 + smoke_passes.log | 49 + smoke_passes2.log | 26 + test_4pass_noRMS_j0.1.log | 79 + test_4pass_qat.log | 43 + test_4pass_qat_stdout.log | 1 + test_4pass_stdout.log | 3 + test_4pass_ttt.log | 57 + test_4pass_ttt_stdout.log | 1 + 153 files changed, 10485 insertions(+), 142 deletions(-) create mode 100644 .env create mode 100644 ablation_3p_noRMS_j0.0.log create mode 100644 ablation_3p_noRMS_j0.1.log create mode 100644 ablation_stdout.log create mode 100644 baseline_50step.log create mode 100644 baseline_stdout.log create mode 100644 debug.md create mode 100644 full_run_stdout.log create mode 100644 full_run_v2_stdout.log create mode 100644 full_run_v3_stdout.log create mode 100644 full_run_v4.log create mode 100644 full_run_v5.log create mode 100644 grid_p2_j0.0.log create mode 100644 grid_p2_j0.001.log create mode 100644 grid_p2_j0.01.log create mode 100644 grid_p2_j0.1.log create mode 100644 grid_p3_j0.0.log create mode 100644 grid_p3_j0.001.log create mode 100644 grid_p3_j0.01.log create mode 100644 grid_p3_j0.1.log create mode 100644 grid_p4_j0.0.log create mode 100644 grid_search_results.csv create mode 100644 grid_search_stdout.log create mode 120000 records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/debug-internal.log create mode 120000 records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/debug.log create mode 120000 records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/latest-run create mode 100644 records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/output.log create mode 100644 records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_full.sh create mode 100644 records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_smoke.sh create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/ablation_no_rmsnorm.sh create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/feedback.py create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/grid_search.sh create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/recurrence-fixes.md create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_3pass.sh create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_qat.sh create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_test.sh create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_ttt.sh create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_full_1gpu.sh create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/smoke_passes.sh create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/smoke_test.sh create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/stability.py create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py create mode 120000 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug-internal.log create mode 120000 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug.log create mode 120000 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/wandb-metadata.json create mode 100644 report.md create mode 100755 run_baseline_50step.sh delete mode 100644 sky.yaml delete mode 100644 sky_recurrent.yaml create mode 100644 smoke_3pass.log create mode 100644 smoke_noclip.log create mode 100644 smoke_passes.log create mode 100644 smoke_passes2.log create mode 100644 test_4pass_noRMS_j0.1.log create mode 100644 test_4pass_qat.log create mode 100644 test_4pass_qat_stdout.log create mode 100644 test_4pass_stdout.log create mode 100644 test_4pass_ttt.log create mode 100644 test_4pass_ttt_stdout.log diff --git a/.env b/.env new file mode 100644 index 0000000000..3bdd2a393b --- /dev/null +++ b/.env @@ -0,0 +1 @@ +WANDB_API_KEY=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 \ No newline at end of file diff --git a/ablation_3p_noRMS_j0.0.log b/ablation_3p_noRMS_j0.0.log new file mode 100644 index 0000000000..3710701231 --- /dev/null +++ b/ablation_3p_noRMS_j0.0.log @@ -0,0 +1,78 @@ +logs/46d7a1dd-24ad-4ebb-bfa3-b13f0ce61391.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=3 stem=3 core=5 tail=3 +model_params:26927199 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run xtlv4t52 +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52 +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run ablation_3p_noRMS_j0.0 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/xtlv4t52 +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10160.2', '11311.6', '12555.0', '13896.3', '15332.5', '16950.8', '18657.9', '20459.0', '22380.8', '24439.6', '22273.3', '24354.9', '26573.0', '28946.4', '31482.6'] growth=['1.116', '1.113', '1.110', '1.107', '1.103', '1.106', '1.101', '1.097', '1.094', '1.092', '1.098', '1.093', '1.091', '1.089', '1.088'] +step:1/50 train_loss:6.9310 train_time:2457ms step_avg:2456.70ms +step:2/50 train_loss:8.4480 train_time:4895ms step_avg:2447.55ms +step:3/50 train_loss:7.5656 train_time:7366ms step_avg:2455.22ms +step:4/50 train_loss:7.3715 train_time:9835ms step_avg:2458.84ms +step:5/50 train_loss:7.1882 train_time:12305ms step_avg:2460.94ms +step:6/50 train_loss:7.1200 train_time:14774ms step_avg:2462.35ms +step:7/50 train_loss:7.1275 train_time:17244ms step_avg:2463.46ms +step:8/50 train_loss:7.0234 train_time:19715ms step_avg:2464.42ms +step:9/50 train_loss:6.6287 train_time:22185ms step_avg:2465.05ms +step:10/50 train_loss:6.2775 train_time:24656ms step_avg:2465.57ms +step:20/50 train_loss:5.2073 train_time:49354ms step_avg:2467.72ms +step:25/50 val_loss:4.6001 val_bpb:2.7244 train_time:61743ms step_avg:2469.70ms h_norms=['18012.8', '20576.1', '23734.4', '27658.0', '32476.6', '39019.5', '47115.3', '57297.5', '70090.2', '85988.1', '66911.8', '82425.7', '101585.5', '125680.3', '156003.0'] growth=['1.133', '1.142', '1.153', '1.165', '1.174', '1.201', '1.207', '1.216', '1.223', '1.227', '1.228', '1.232', '1.232', '1.237', '1.241'] +step:30/50 train_loss:4.3938 train_time:74062ms step_avg:2468.73ms +step:40/50 train_loss:4.0561 train_time:98772ms step_avg:2469.31ms +step:50/50 train_loss:3.8233 train_time:123613ms step_avg:2472.25ms +step:50/50 val_loss:3.7814 val_bpb:2.2396 train_time:123647ms step_avg:2472.93ms h_norms=['31577.0', '34240.9', '37755.1', '42362.9', '48395.2', '56432.3', '66325.1', '79064.0', '95012.0', '114485.4', '84419.3', '101309.9', '122596.9', '148869.6', '181094.1'] growth=['1.068', '1.084', '1.103', '1.122', '1.142', '1.166', '1.175', '1.192', '1.202', '1.205', '1.193', '1.200', '1.210', '1.214', '1.216'] +peak memory allocated: 54207 MiB reserved: 55384 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9416 val_bpb:3.5190 eval_time:67959ms +Serialized model: 106023671 bytes +Code size: 98931 bytes +Serialized model int6+lzma: 4809652 bytes +Total submission size int6+lzma: 4908583 bytes +final_int6_roundtrip val_loss:6.1789 val_bpb:3.6595 eval_time:67564ms +final_int6_roundtrip_exact val_loss:6.17886280 val_bpb:3.65947059 +wandb: updating run metadata +wandb: uploading history steps 15-15, summary, console lines 30-31 +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▄▁▃▄▅▅▆▆▆▆▇▇▇█ +wandb: train_loss ▆█▇▆▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▃▁ +wandb: val_loss █▃▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 2472.25004 +wandb: train_loss 3.82329 +wandb: val_bpb 2.23957 +wandb: val_loss 3.78142 +wandb: +wandb: 🚀 View run ablation_3p_noRMS_j0.0 at: https://wandb.ai/propensity/parameter-golf/runs/xtlv4t52 +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_115924-xtlv4t52/logs diff --git a/ablation_3p_noRMS_j0.1.log b/ablation_3p_noRMS_j0.1.log new file mode 100644 index 0000000000..4d67f14be8 --- /dev/null +++ b/ablation_3p_noRMS_j0.1.log @@ -0,0 +1,80 @@ +logs/4bd1dcea-262c-45fd-b47c-6c1070e31866.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=3 stem=3 core=5 tail=3 +model_params:26927199 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run 6rfmco93 +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93 +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run ablation_3p_noRMS_j0.1 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/6rfmco93 +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10143.6', '11247.2', '12436.8', '13721.2', '15093.2', '16638.6', '18260.9', '19974.1', '21802.2', '23760.8', '21655.4', '23628.3', '25729.0', '27978.2', '30382.6'] growth=['1.110', '1.109', '1.106', '1.103', '1.100', '1.102', '1.098', '1.094', '1.092', '1.090', '1.095', '1.091', '1.089', '1.087', '1.086'] +step:1/50 train_loss:6.9310 train_time:2474ms step_avg:2473.98ms +step:2/50 train_loss:8.4480 train_time:4928ms step_avg:2463.85ms +step:3/50 train_loss:7.5657 train_time:7414ms step_avg:2471.28ms +step:4/50 train_loss:7.4125 train_time:9901ms step_avg:2475.24ms +step:5/50 train_loss:7.2581 train_time:12387ms step_avg:2477.37ms +step:6/50 train_loss:7.1563 train_time:14873ms step_avg:2478.80ms +step:7/50 train_loss:7.1205 train_time:17358ms step_avg:2479.79ms +step:8/50 train_loss:7.0021 train_time:19845ms step_avg:2480.59ms +step:9/50 train_loss:6.6191 train_time:22332ms step_avg:2481.32ms +step:10/50 train_loss:6.2241 train_time:24818ms step_avg:2481.82ms +step:20/50 train_loss:4.8854 train_time:49674ms step_avg:2483.72ms +step:25/50 val_loss:4.4102 val_bpb:2.6119 train_time:62144ms step_avg:2485.74ms h_norms=['12925.3', '12168.2', '11607.0', '11186.7', '10890.7', '10691.3', '10518.0', '10429.2', '10384.9', '10395.0', '10468.4', '10350.9', '10317.8', '10323.0', '10377.5'] growth=['0.930', '0.941', '0.954', '0.964', '0.974', '0.982', '0.984', '0.992', '0.996', '1.001', '0.987', '0.989', '0.997', '1.001', '1.005'] +step:30/50 train_loss:4.2124 train_time:74549ms step_avg:2484.96ms +step:40/50 train_loss:3.9336 train_time:99426ms step_avg:2485.66ms +step:50/50 train_loss:3.7638 train_time:124432ms step_avg:2488.64ms +step:50/50 val_loss:3.7456 val_bpb:2.2184 train_time:124466ms step_avg:2489.33ms h_norms=['20394.8', '18235.1', '16671.4', '15574.6', '14825.8', '14555.8', '14297.6', '14121.5', '14031.0', '13984.3', '14335.7', '14174.8', '14069.5', '14035.7', '14026.9'] growth=['0.871', '0.894', '0.914', '0.934', '0.952', '0.982', '0.982', '0.988', '0.994', '0.997', '0.991', '0.989', '0.993', '0.998', '0.999'] +peak memory allocated: 54399 MiB reserved: 55768 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9394 val_bpb:3.5176 eval_time:68125ms +Serialized model: 106023671 bytes +Code size: 98931 bytes +Serialized model int6+lzma: 4804840 bytes +Total submission size int6+lzma: 4903771 bytes +final_int6_roundtrip val_loss:6.1350 val_bpb:3.6335 eval_time:67734ms +final_int6_roundtrip_exact val_loss:6.13503683 val_bpb:3.63351438 +wandb: updating run metadata +wandb: uploading history steps 15-15, summary, console lines 30-31 +wandb: uploading output.log; uploading wandb-summary.json +wandb: uploading data +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▄▁▃▄▅▅▅▆▆▆▇▇▇█ +wandb: train_loss ▆█▇▆▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▂▁ +wandb: val_loss █▂▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 2488.6422 +wandb: train_loss 3.76377 +wandb: val_bpb 2.21838 +wandb: val_loss 3.74564 +wandb: +wandb: 🚀 View run ablation_3p_noRMS_j0.1 at: https://wandb.ai/propensity/parameter-golf/runs/6rfmco93 +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_120745-6rfmco93/logs diff --git a/ablation_stdout.log b/ablation_stdout.log new file mode 100644 index 0000000000..25ebe5996d --- /dev/null +++ b/ablation_stdout.log @@ -0,0 +1,5 @@ +START 3-pass no-RMSnorm jac=0.0 (11:59:20) +DONE jac=0.0 => bpb@50=2.2396 int6=3.65947059 step=2472.25ms mem=54207MiB +START 3-pass no-RMSnorm jac=0.1 (12:07:41) +DONE jac=0.1 => bpb@50=2.2184 int6=3.63351438 step=2488.64ms mem=54399MiB +=== ABLATION COMPLETE (Thu Mar 26 12:16:04 UTC 2026) === diff --git a/baseline_50step.log b/baseline_50step.log new file mode 100644 index 0000000000..55d4ab72ac --- /dev/null +++ b/baseline_50step.log @@ -0,0 +1,44 @@ +logs/bff51f18-fbb9-43cc-9903-c84284e4e76d.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +model_params:26928220 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms +step:1/50 train_loss:6.9310 train_time:1335ms step_avg:1334.89ms +step:2/50 train_loss:8.6894 train_time:2639ms step_avg:1319.33ms +step:3/50 train_loss:7.7641 train_time:3975ms step_avg:1325.02ms +step:4/50 train_loss:7.2309 train_time:5311ms step_avg:1327.85ms +step:5/50 train_loss:7.1292 train_time:6648ms step_avg:1329.55ms +step:6/50 train_loss:7.1698 train_time:7983ms step_avg:1330.57ms +step:7/50 train_loss:7.1045 train_time:9320ms step_avg:1331.38ms +step:8/50 train_loss:6.9776 train_time:10656ms step_avg:1331.99ms +step:9/50 train_loss:6.6169 train_time:11993ms step_avg:1332.53ms +step:10/50 train_loss:6.2604 train_time:13330ms step_avg:1332.96ms +step:20/50 train_loss:5.1681 train_time:26695ms step_avg:1334.74ms +step:25/50 val_loss:4.6120 val_bpb:2.7315 train_time:33413ms step_avg:1336.54ms +step:30/50 train_loss:4.3901 train_time:40068ms step_avg:1335.60ms +step:40/50 train_loss:4.0167 train_time:53443ms step_avg:1336.07ms +step:50/50 train_loss:3.8262 train_time:66820ms step_avg:1336.40ms +step:50/50 val_loss:3.7856 val_bpb:2.2421 train_time:66853ms step_avg:1337.06ms +peak memory allocated: 30083 MiB reserved: 31168 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.8987 val_bpb:3.4935 eval_time:38419ms +Serialized model: 106027446 bytes +Code size: 89458 bytes +Serialized model int6+lzma: 4809376 bytes +Total submission size int6+lzma: 4898834 bytes +final_int6_roundtrip val_loss:6.0576 val_bpb:3.5876 eval_time:38209ms +final_int6_roundtrip_exact val_loss:6.05759208 val_bpb:3.58764724 diff --git a/baseline_stdout.log b/baseline_stdout.log new file mode 100644 index 0000000000..5a653b4e0b --- /dev/null +++ b/baseline_stdout.log @@ -0,0 +1,75 @@ +START baseline SOTA 50-step (12:20:35) +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: setting up run nx8viusx +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run baseline_SOTA_50step +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/nx8viusx +logs/bff51f18-fbb9-43cc-9903-c84284e4e76d.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +model_params:26928220 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms +step:1/50 train_loss:6.9310 train_time:1335ms step_avg:1334.89ms +step:2/50 train_loss:8.6894 train_time:2639ms step_avg:1319.33ms +step:3/50 train_loss:7.7641 train_time:3975ms step_avg:1325.02ms +step:4/50 train_loss:7.2309 train_time:5311ms step_avg:1327.85ms +step:5/50 train_loss:7.1292 train_time:6648ms step_avg:1329.55ms +step:6/50 train_loss:7.1698 train_time:7983ms step_avg:1330.57ms +step:7/50 train_loss:7.1045 train_time:9320ms step_avg:1331.38ms +step:8/50 train_loss:6.9776 train_time:10656ms step_avg:1331.99ms +step:9/50 train_loss:6.6169 train_time:11993ms step_avg:1332.53ms +step:10/50 train_loss:6.2604 train_time:13330ms step_avg:1332.96ms +step:20/50 train_loss:5.1681 train_time:26695ms step_avg:1334.74ms +step:25/50 val_loss:4.6120 val_bpb:2.7315 train_time:33413ms step_avg:1336.54ms +step:30/50 train_loss:4.3901 train_time:40068ms step_avg:1335.60ms +step:40/50 train_loss:4.0167 train_time:53443ms step_avg:1336.07ms +step:50/50 train_loss:3.8262 train_time:66820ms step_avg:1336.40ms +step:50/50 val_loss:3.7856 val_bpb:2.2421 train_time:66853ms step_avg:1337.06ms +peak memory allocated: 30083 MiB reserved: 31168 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.8987 val_bpb:3.4935 eval_time:38419ms +Serialized model: 106027446 bytes +Code size: 89458 bytes +Serialized model int6+lzma: 4809376 bytes +Total submission size int6+lzma: 4898834 bytes +final_int6_roundtrip val_loss:6.0576 val_bpb:3.5876 eval_time:38209ms +final_int6_roundtrip_exact val_loss:6.05759208 val_bpb:3.58764724 +wandb: updating run metadata +wandb: uploading history steps 15-15, summary +wandb: +wandb: Run history: +wandb: step_avg_ms ▁███████████████ +wandb: train_loss ▅█▇▆▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▃▁ +wandb: val_loss █▃▁ +wandb: +wandb: Run summary: +wandb: step_avg_ms 1337.06 +wandb: train_loss 3.8262 +wandb: val_bpb 2.2421 +wandb: val_loss 3.7856 +wandb: +wandb: 🚀 View run baseline_SOTA_50step at: https://wandb.ai/propensity/parameter-golf/runs/nx8viusx +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_122037-nx8viusx/logs +EXIT CODE: -9 +DONE baseline (12:28:16) diff --git a/debug.md b/debug.md new file mode 100644 index 0000000000..8d38520dc1 --- /dev/null +++ b/debug.md @@ -0,0 +1,106 @@ +# Debug Log — Recurrent Core + Learned Feedback + QAT + +## Environment Setup + +| Component | Value | +|-----------|-------| +| GPU | NVIDIA H200 (143GB HBM) | +| CUDA driver | 13.0 | +| PyTorch | 2.11.0+cu130 | +| Python | 3.12.3 | +| Flash Attention | FA3 Hopper (pre-built wheel from `varunneal/flash-attention-hopper`) | +| OS | Ubuntu 24.04 (noble), kernel 6.11.0-1016-nvidia | + +## Setup Steps + +### 1. Virtual environment +```bash +python3 -m venv .venv +source .venv/bin/activate +pip install --upgrade pip +``` + +### 2. PyTorch installation +Initial attempt with `cu124` failed (CUDA version mismatch: system has 13.0, PyTorch compiled for 12.4). +Reinstalled with nightly `cu128` first, then switched to `flash-attn` which pulled in `cu130`: +```bash +pip install flash-attn --no-build-isolation +# This auto-installed torch 2.11.0+cu130 with CUDA 13.0 bindings +``` + +### 3. Flash Attention 3 (Hopper) +The model code imports `from flash_attn_interface import flash_attn_func` which is FA3 (Hopper-specific). + +**Attempt 1 — Build from source:** Cloned `Dao-AILab/flash-attention`, ran `setup.py install` from `hopper/` directory. Extremely slow: 451 CUDA kernel files, only 42/451 completed in ~34 minutes with `MAX_JOBS=4`. Killed. + +**Attempt 2 — Pre-built wheels from HuggingFace:** Downloaded from `varunneal/flash-attention-hopper` (`build/torch210-cxx11-cu130-x86_64-linux/flash_attention_hopper/`). Installed the package into site-packages and created a shim module: +```python +# .venv/lib/python3.12/site-packages/flash_attn_interface.py +from flash_attention_hopper.flash_attn_interface import * +``` +**Result:** FA3 import and forward pass verified working. + +### 4. Data download +```bash +python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10 +``` +Downloaded 10 training shards (~2GB) + 1 validation shard (~124MB) + tokenizer. + +## Issues Encountered + +### Issue 1: CUDA version mismatch +- **Symptom:** PyTorch cu124 couldn't compile extensions against CUDA 13.0 +- **Fix:** Installed PyTorch with cu130 support via flash-attn dependency chain + +### Issue 2: `flash_attn_interface` not available as pip package +- **Symptom:** `ModuleNotFoundError: No module named 'flash_attn_interface'` +- **Root cause:** FA3 (Hopper) is a separate build from the `hopper/` directory, not published to PyPI +- **Fix:** Pre-built wheel from HuggingFace + shim module + +### Issue 3: `torch.compile(fullgraph=True)` crashes on PyTorch 2.11 nightly +- **Symptom:** `FailOnRecompileLimitHit: Hard failure due to fullgraph=True` +- **Root cause:** The `RecurrentStabilizer.record_pass()` method uses `.item()` and `list.append()` which cause graph breaks. PyTorch nightly (2.11) is stricter about recompilation limits. +- **Fix:** Disabled `torch.compile` entirely (`compiled_model = base_model`). Also set `TORCH_COMPILE_DISABLE=1` in run scripts. + +### Issue 4: Shell sandbox failure (Cursor IDE) +- **Symptom:** All shell commands return empty output, 0ms, exit code 0 (even `false`) +- **Impact:** Could not run smoke test or full experiment +- **Workaround:** Created `run_smoke.sh` and `run_full.sh` scripts for manual execution + +## Files Modified + +| File | Change | +|------|--------| +| `train_bestbase_recurrent_feedback_learned.py` | `torch.compile` → `base_model` (eager mode) | +| `train_bestbase_recurrent_feedback_fixed.py` | Same | +| `train_bestbase_recurrent_qat.py` | Same | +| `train_utils_recurrent.py` | `torch.compile(forward_logits)` → `base_model.forward_logits` | + +## Model Verification + +Quick forward pass test (successful): +``` +PyTorch 2.11.0+cu130, CUDA 13.0, GPU: NVIDIA H200 +FA3: OK +Model created: 19,679,297 params +Feedback module: 2,560 params +Stabilizer: OK +Forward pass loss: 6.9405 - ALL OK! +``` + +## How to Run + +### Smoke test (50 steps, ~5 min) +```bash +bash records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_smoke.sh +``` + +### Full experiment (80 min on 1 GPU ≈ 10 min on 8 GPUs) +```bash +bash records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_full.sh +``` + +### Custom duration +```bash +MINUTES=120 SEED=42 bash records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_full.sh +``` diff --git a/full_run_stdout.log b/full_run_stdout.log new file mode 100644 index 0000000000..78203cbc33 --- /dev/null +++ b/full_run_stdout.log @@ -0,0 +1,12 @@ +============================================================ + Full 1-GPU run: learned feedback variant + Wall clock: 80 minutes (4800s) + Seed: 1337 +============================================================ +logs/62a5dffa-8e68-4c33-b497-058fd9aacb29.txt +model_params:19843140 unique_layers:8 stem:3 core:2 tail:3 passes:3 +feedback: mode=diagonal rank=2 per_pass=False affine=False params=2560 +world_size:1 grad_accum_steps:8 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 seed:1337 +warmup_step:10/20 +/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_full.sh: line 60: 175459 Killed PYTHONUNBUFFERED=1 TORCH_COMPILE_DISABLE=1 DATA_PATH="../../../data/datasets/fineweb10B_sp1024" TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" SEED="${SEED}" ITERATIONS=20000 MAX_WALLCLOCK_SECONDS="${WALLCLOCK}" VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 WARMUP_STEPS=20 WARMDOWN_ITERS=3500 TRAIN_BATCH_TOKENS=786432 TRAIN_SEQ_LEN=2048 EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 NUM_STEM_LAYERS=3 NUM_CORE_LAYERS=2 NUM_TAIL_LAYERS=3 NUM_PASSES=3 CORE_QUANT_BITS=6 CORE_QUANT_ENABLED=1 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 VE_ENABLED=1 VE_DIM=128 VE_LAYERS="6,7" MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 MUON_WD=0.04 ADAM_WD=0.04 GRAD_CLIP_NORM=0.3 SWA_ENABLED=1 SWA_EVERY=50 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 TTT_ENABLED=0 $PYTHON train_bestbase_recurrent_feedback_learned.py --feedback-mode diagonal --feedback-rank 2 --ttt-regime tail_only diff --git a/full_run_v2_stdout.log b/full_run_v2_stdout.log new file mode 100644 index 0000000000..611c2acd0e --- /dev/null +++ b/full_run_v2_stdout.log @@ -0,0 +1,52 @@ +============================================================ + Full 1-GPU run: RecurrentSOTA + Learned Feedback + Wall clock: 80 minutes (4800s) + Seed: 1337 +============================================================ +logs/269c0ea3-cfa1-485c-94fa-c52ab3a87114.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=2 stem=3 core=5 tail=3 +model_params:26927196 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000 +seed:1337 +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['20398.7', '24334.0', '28640.7', '32899.3', '36909.1', '1293622.0', '59990772.0', '2812088832.0', '132758822912.0', '6375751548928.0'] growth=['1.236', '1.193', '1.177', '1.149', '1.122', '35.049', '46.374', '46.875', '47.210', '48.025'] +step:1/20000 train_loss:6.9341 train_time:2075ms step_avg:2075.31ms +step:2/20000 train_loss:18.4141 train_time:4023ms step_avg:2011.40ms +step:3/20000 train_loss:22.0334 train_time:6002ms step_avg:2000.77ms +step:4/20000 train_loss:23.7313 train_time:7983ms step_avg:1995.64ms +step:5/20000 train_loss:23.2545 train_time:9963ms step_avg:1992.62ms +step:6/20000 train_loss:22.9113 train_time:11943ms step_avg:1990.51ms +step:7/20000 train_loss:22.5152 train_time:13923ms step_avg:1989.02ms +step:8/20000 train_loss:22.1715 train_time:15903ms step_avg:1987.89ms +step:9/20000 train_loss:20.2900 train_time:17883ms step_avg:1987.02ms +step:10/20000 train_loss:17.9381 train_time:19864ms step_avg:1986.38ms +run_full_1gpu.sh: line 63: 183058 Killed PYTHONUNBUFFERED=1 TORCH_COMPILE_DISABLE=1 DATA_PATH="../../../data/datasets/fineweb10B_sp1024" TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" SEED="${SEED}" ITERATIONS=20000 MAX_WALLCLOCK_SECONDS="${WALLCLOCK}" VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 WARMUP_STEPS=20 WARMDOWN_ITERS=3500 TRAIN_BATCH_TOKENS=786432 TRAIN_SEQ_LEN=2048 EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 NUM_LAYERS=11 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 VE_ENABLED=1 VE_DIM=128 VE_LAYERS="9,10" MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 MUON_WD=0.04 ADAM_WD=0.04 GRAD_CLIP_NORM=0.3 SWA_ENABLED=1 SWA_EVERY=50 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 TTT_ENABLED=0 CORE_START=3 CORE_END=8 NUM_PASSES=2 CORE_QUANT_ENABLED=0 $PYTHON train_gpt_recurrent.py --feedback-mode diagonal --feedback-rank 2 diff --git a/full_run_v3_stdout.log b/full_run_v3_stdout.log new file mode 100644 index 0000000000..dffcefbeda --- /dev/null +++ b/full_run_v3_stdout.log @@ -0,0 +1,52 @@ +============================================================ + Full 1-GPU run: RecurrentSOTA + Learned Feedback + Wall clock: 80 minutes (4800s) + Seed: 1337 +============================================================ +logs/7d04be46-b9bd-40bc-b2c1-a12d16a74ae9.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=4 core_end=7 num_passes=2 stem=4 core=3 tail=4 +model_params:26927708 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000 +seed:1337 +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['41381.7', '41230.5', '42830.4', '101672.3', '106011.9', '100487.2'] growth=['0.815', '0.996', '1.039', '2.374', '1.043', '0.948'] +step:1/20000 train_loss:6.9317 train_time:1705ms step_avg:1705.16ms +step:2/20000 train_loss:17.3581 train_time:3384ms step_avg:1692.12ms +step:3/20000 train_loss:18.3336 train_time:5096ms step_avg:1698.57ms +step:4/20000 train_loss:16.9947 train_time:6807ms step_avg:1701.79ms +step:5/20000 train_loss:14.3763 train_time:8518ms step_avg:1703.64ms +step:6/20000 train_loss:12.6703 train_time:10229ms step_avg:1704.78ms +step:7/20000 train_loss:11.3676 train_time:11939ms step_avg:1705.60ms +step:8/20000 train_loss:11.3329 train_time:13648ms step_avg:1706.02ms +step:9/20000 train_loss:10.9895 train_time:15358ms step_avg:1706.43ms +step:10/20000 train_loss:10.4378 train_time:17067ms step_avg:1706.72ms +run_full_1gpu.sh: line 64: 184945 Killed PYTHONUNBUFFERED=1 TORCH_COMPILE_DISABLE=1 DATA_PATH="../../../data/datasets/fineweb10B_sp1024" TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" SEED="${SEED}" ITERATIONS=20000 MAX_WALLCLOCK_SECONDS="${WALLCLOCK}" VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 WARMUP_STEPS=20 WARMDOWN_ITERS=3500 TRAIN_BATCH_TOKENS=786432 TRAIN_SEQ_LEN=2048 EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 NUM_LAYERS=11 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 VE_ENABLED=1 VE_DIM=128 VE_LAYERS="9,10" MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 MUON_WD=0.04 ADAM_WD=0.04 GRAD_CLIP_NORM=0.3 SWA_ENABLED=1 SWA_EVERY=50 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 TTT_ENABLED=0 CORE_START=4 CORE_END=7 NUM_PASSES=2 CORE_QUANT_ENABLED=0 $PYTHON train_gpt_recurrent.py --feedback-mode diagonal --feedback-rank 2 --clip-hidden --clip-value 15 diff --git a/full_run_v4.log b/full_run_v4.log new file mode 100644 index 0000000000..a2f96c2f17 --- /dev/null +++ b/full_run_v4.log @@ -0,0 +1,52 @@ +============================================================ + Full 1-GPU run: RecurrentSOTA + Learned Feedback + Wall clock: 80 minutes (4800s) + Seed: 1337 +============================================================ +logs/790a87ca-f5de-4379-9d91-dd10629858d5.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=2 stem=3 core=5 tail=3 +model_params:26927198 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000 +seed:1337 +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['19246.8', '24046.2', '29984.3', '37383.8', '46269.2', '10175.1', '13915.1', '18401.4', '23902.2', '30248.5'] growth=['1.252', '1.249', '1.247', '1.247', '1.238', '1.434', '1.368', '1.322', '1.299', '1.266'] +step:1/20000 train_loss:6.9310 train_time:2054ms step_avg:2053.89ms +step:2/20000 train_loss:8.5267 train_time:3988ms step_avg:1993.89ms +step:3/20000 train_loss:7.7478 train_time:5954ms step_avg:1984.59ms +step:4/20000 train_loss:7.0035 train_time:7920ms step_avg:1980.04ms +step:5/20000 train_loss:6.6920 train_time:9886ms step_avg:1977.24ms +step:6/20000 train_loss:6.5508 train_time:11852ms step_avg:1975.39ms +step:7/20000 train_loss:6.5037 train_time:13819ms step_avg:1974.09ms +step:8/20000 train_loss:6.4925 train_time:15785ms step_avg:1973.07ms +step:9/20000 train_loss:6.3444 train_time:17751ms step_avg:1972.30ms +step:10/20000 train_loss:6.1914 train_time:19717ms step_avg:1971.65ms +run_full_1gpu.sh: line 66: 194117 Killed PYTHONUNBUFFERED=1 TORCH_COMPILE_DISABLE=1 DATA_PATH="../../../data/datasets/fineweb10B_sp1024" TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" SEED="${SEED}" ITERATIONS=20000 MAX_WALLCLOCK_SECONDS="${WALLCLOCK}" VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 WARMUP_STEPS=20 WARMDOWN_ITERS=3500 TRAIN_BATCH_TOKENS=786432 TRAIN_SEQ_LEN=2048 EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 NUM_LAYERS=11 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 VE_ENABLED=1 VE_DIM=128 VE_LAYERS="9,10" MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 MUON_WD=0.04 ADAM_WD=0.04 GRAD_CLIP_NORM=0.3 SWA_ENABLED=1 SWA_EVERY=50 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 TTT_ENABLED=0 CORE_START=3 CORE_END=8 NUM_PASSES=2 CORE_QUANT_ENABLED=0 $PYTHON train_gpt_recurrent.py --feedback-mode diagonal --feedback-rank 2 --clip-hidden --clip-value 15 --residual-scale-init 0.5 --jacobian-proxy-weight 0.01 diff --git a/full_run_v5.log b/full_run_v5.log new file mode 100644 index 0000000000..a453ab417f --- /dev/null +++ b/full_run_v5.log @@ -0,0 +1,52 @@ +============================================================ + Full 1-GPU run: RecurrentSOTA + Learned Feedback + Wall clock: 80 minutes (4800s) + Seed: 1337 +============================================================ +logs/31addae0-d139-4e69-8091-9da2628c777f.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=2 stem=3 core=5 tail=3 +model_params:26927198 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000 +seed:1337 +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9310 train_time:777ms step_avg:776.55ms +step:2/20000 train_loss:8.5366 train_time:1538ms step_avg:768.76ms +step:3/20000 train_loss:7.6282 train_time:2352ms step_avg:783.93ms +step:4/20000 train_loss:7.3353 train_time:3168ms step_avg:791.92ms +step:5/20000 train_loss:7.1440 train_time:3978ms step_avg:795.69ms +step:6/20000 train_loss:7.0948 train_time:4793ms step_avg:798.81ms +step:7/20000 train_loss:7.0655 train_time:5600ms step_avg:800.02ms +step:8/20000 train_loss:6.9483 train_time:6411ms step_avg:801.39ms +step:9/20000 train_loss:6.5977 train_time:7230ms step_avg:803.35ms +step:10/20000 train_loss:6.2379 train_time:8049ms step_avg:804.87ms +run_full_1gpu.sh: line 65: 200321 Killed PYTHONUNBUFFERED=1 DATA_PATH="../../../data/datasets/fineweb10B_sp1024" TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" SEED="${SEED}" ITERATIONS=20000 MAX_WALLCLOCK_SECONDS="${WALLCLOCK}" VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 WARMUP_STEPS=20 WARMDOWN_ITERS=3500 TRAIN_BATCH_TOKENS=786432 TRAIN_SEQ_LEN=2048 EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 NUM_LAYERS=11 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 VE_ENABLED=1 VE_DIM=128 VE_LAYERS="9,10" MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 MUON_WD=0.04 ADAM_WD=0.04 GRAD_CLIP_NORM=0.3 SWA_ENABLED=1 SWA_EVERY=50 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 TTT_ENABLED=0 CORE_START=3 CORE_END=8 NUM_PASSES=2 CORE_QUANT_ENABLED=0 $PYTHON train_gpt_recurrent.py --feedback-mode diagonal --feedback-rank 2 --clip-hidden --clip-value 15 --residual-scale-init 0.5 --jacobian-proxy-weight 0.01 diff --git a/grid_p2_j0.0.log b/grid_p2_j0.0.log new file mode 100644 index 0000000000..886cab282f --- /dev/null +++ b/grid_p2_j0.0.log @@ -0,0 +1,79 @@ +logs/7240d6aa-7907-488e-86cf-e35d13f2b540.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=2 stem=3 core=5 tail=3 +model_params:26927198 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run 1bh0d9xu +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run grid_p2_j0.0 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/1bh0d9xu +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.4', '11332.7', '12572.5', '13910.0', '15222.2', '8143.5', '9246.5', '10422.9', '11669.4', '12875.7'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] +step:1/50 train_loss:6.9310 train_time:1951ms step_avg:1951.04ms +step:2/50 train_loss:8.5267 train_time:3880ms step_avg:1940.13ms +step:3/50 train_loss:7.6283 train_time:5841ms step_avg:1947.13ms +step:4/50 train_loss:7.3204 train_time:7802ms step_avg:1950.55ms +step:5/50 train_loss:7.1281 train_time:9763ms step_avg:1952.51ms +step:6/50 train_loss:7.0824 train_time:11723ms step_avg:1953.92ms +step:7/50 train_loss:7.0689 train_time:13684ms step_avg:1954.91ms +step:8/50 train_loss:6.9485 train_time:15646ms step_avg:1955.69ms +step:9/50 train_loss:6.6014 train_time:17607ms step_avg:1956.32ms +step:10/50 train_loss:6.2455 train_time:19569ms step_avg:1956.88ms +step:20/50 train_loss:4.9608 train_time:39179ms step_avg:1958.94ms +step:25/50 val_loss:4.4416 val_bpb:2.6306 train_time:49019ms step_avg:1960.76ms h_norms=['14908.8', '17039.5', '19688.6', '22908.3', '26782.7', '9197.9', '11626.4', '14426.1', '17733.4', '21464.3'] growth=['1.123', '1.143', '1.155', '1.164', '1.169', '1.296', '1.264', '1.241', '1.229', '1.210'] +step:30/50 train_loss:4.2882 train_time:58795ms step_avg:1959.83ms +step:40/50 train_loss:3.9513 train_time:78416ms step_avg:1960.41ms +step:50/50 train_loss:3.7675 train_time:98151ms step_avg:1963.03ms +step:50/50 val_loss:3.7271 val_bpb:2.2074 train_time:98185ms step_avg:1963.71ms h_norms=['24528.6', '27553.7', '31725.9', '37229.2', '44851.6', '10551.0', '14775.8', '19896.1', '26232.7', '33500.8'] growth=['1.086', '1.123', '1.151', '1.173', '1.205', '1.487', '1.400', '1.347', '1.318', '1.277'] +peak memory allocated: 42589 MiB reserved: 43756 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9153 val_bpb:3.5034 eval_time:53981ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4812040 bytes +Total submission size int6+lzma: 4910547 bytes +final_int6_roundtrip val_loss:6.1184 val_bpb:3.6236 eval_time:53666ms +final_int6_roundtrip_exact val_loss:6.11836447 val_bpb:3.62364007 +wandb: updating run metadata +wandb: uploading config.yaml +wandb: uploading history steps 15-15, summary, console lines 30-31 +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▄▁▃▄▅▅▆▆▆▆▇▇▇█ +wandb: train_loss ▆█▇▆▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▃▁ +wandb: val_loss █▃▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 1963.02704 +wandb: train_loss 3.76754 +wandb: val_bpb 2.20741 +wandb: val_loss 3.72711 +wandb: +wandb: 🚀 View run grid_p2_j0.0 at: https://wandb.ai/propensity/parameter-golf/runs/1bh0d9xu +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_104722-1bh0d9xu/logs diff --git a/grid_p2_j0.001.log b/grid_p2_j0.001.log new file mode 100644 index 0000000000..2d2de843cc --- /dev/null +++ b/grid_p2_j0.001.log @@ -0,0 +1,79 @@ +logs/a191e574-1183-450d-b8ae-d69ae7e83370.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=2 stem=3 core=5 tail=3 +model_params:26927198 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run dd5xlg1l +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run grid_p2_j0.001 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/dd5xlg1l +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.3', '11332.9', '12572.4', '13909.9', '15221.9', '8143.5', '9246.2', '10422.5', '11669.1', '12875.2'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] +step:1/50 train_loss:6.9310 train_time:1969ms step_avg:1968.99ms +step:2/50 train_loss:8.5267 train_time:3913ms step_avg:1956.75ms +step:3/50 train_loss:7.6283 train_time:5891ms step_avg:1963.51ms +step:4/50 train_loss:7.3204 train_time:7867ms step_avg:1966.84ms +step:5/50 train_loss:7.1282 train_time:9845ms step_avg:1969.03ms +step:6/50 train_loss:7.0824 train_time:11823ms step_avg:1970.45ms +step:7/50 train_loss:7.0693 train_time:13800ms step_avg:1971.43ms +step:8/50 train_loss:6.9483 train_time:15777ms step_avg:1972.14ms +step:9/50 train_loss:6.6021 train_time:17755ms step_avg:1972.76ms +step:10/50 train_loss:6.2458 train_time:19732ms step_avg:1973.24ms +step:20/50 train_loss:4.9604 train_time:39495ms step_avg:1974.76ms +step:25/50 val_loss:4.4376 val_bpb:2.6282 train_time:49413ms step_avg:1976.52ms h_norms=['14875.2', '16993.9', '19634.3', '22841.7', '26706.4', '9197.5', '11608.3', '14391.8', '17687.0', '21405.1'] growth=['1.123', '1.142', '1.155', '1.163', '1.169', '1.296', '1.262', '1.240', '1.229', '1.210'] +step:30/50 train_loss:4.2700 train_time:59266ms step_avg:1975.53ms +step:40/50 train_loss:3.9376 train_time:79041ms step_avg:1976.01ms +step:50/50 train_loss:3.7755 train_time:98947ms step_avg:1978.94ms +step:50/50 val_loss:3.7384 val_bpb:2.2141 train_time:98981ms step_avg:1979.62ms h_norms=['24517.2', '27557.3', '31658.9', '36931.0', '44051.8', '10467.3', '14516.5', '19325.2', '25155.9', '31522.6'] growth=['1.087', '1.124', '1.149', '1.167', '1.193', '1.475', '1.387', '1.331', '1.302', '1.253'] +peak memory allocated: 42782 MiB reserved: 44140 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9157 val_bpb:3.5036 eval_time:54130ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4808692 bytes +Total submission size int6+lzma: 4907199 bytes +final_int6_roundtrip val_loss:6.1186 val_bpb:3.6238 eval_time:53816ms +final_int6_roundtrip_exact val_loss:6.11859679 val_bpb:3.62377766 +wandb: updating run metadata +wandb: uploading output.log +wandb: uploading history steps 15-15, summary, console lines 30-31 +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▅▁▃▄▅▅▆▆▆▆▇▇▇█ +wandb: train_loss ▆█▇▆▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▃▁ +wandb: val_loss █▃▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 1978.93598 +wandb: train_loss 3.77553 +wandb: val_bpb 2.2141 +wandb: val_loss 3.73842 +wandb: +wandb: 🚀 View run grid_p2_j0.001 at: https://wandb.ai/propensity/parameter-golf/runs/dd5xlg1l +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_105405-dd5xlg1l/logs diff --git a/grid_p2_j0.01.log b/grid_p2_j0.01.log new file mode 100644 index 0000000000..3f3ce1f447 --- /dev/null +++ b/grid_p2_j0.01.log @@ -0,0 +1,79 @@ +logs/68fe46e3-3ad3-47cc-9e37-31cc9f108790.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=2 stem=3 core=5 tail=3 +model_params:26927198 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run pmxdy841 +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841 +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run grid_p2_j0.01 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/pmxdy841 +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.7', '11333.2', '12572.7', '13910.3', '15222.4', '8143.9', '9246.6', '10423.1', '11669.9', '12876.1'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] +step:1/50 train_loss:6.9310 train_time:1965ms step_avg:1964.66ms +step:2/50 train_loss:8.5267 train_time:3905ms step_avg:1952.55ms +step:3/50 train_loss:7.6283 train_time:5878ms step_avg:1959.29ms +step:4/50 train_loss:7.3204 train_time:7850ms step_avg:1962.54ms +step:5/50 train_loss:7.1282 train_time:9822ms step_avg:1964.43ms +step:6/50 train_loss:7.0825 train_time:11795ms step_avg:1965.78ms +step:7/50 train_loss:7.0698 train_time:13768ms step_avg:1966.81ms +step:8/50 train_loss:6.9490 train_time:15741ms step_avg:1967.65ms +step:9/50 train_loss:6.6024 train_time:17715ms step_avg:1968.31ms +step:10/50 train_loss:6.2457 train_time:19688ms step_avg:1968.83ms +step:20/50 train_loss:4.9609 train_time:39413ms step_avg:1970.66ms +step:25/50 val_loss:4.4376 val_bpb:2.6282 train_time:49313ms step_avg:1972.51ms h_norms=['14907.9', '17014.7', '19640.8', '22830.5', '26674.6', '9180.5', '11569.7', '14334.2', '17600.8', '21284.5'] growth=['1.122', '1.141', '1.154', '1.162', '1.168', '1.294', '1.260', '1.239', '1.228', '1.209'] +step:30/50 train_loss:4.2769 train_time:59149ms step_avg:1971.62ms +step:40/50 train_loss:3.9519 train_time:78889ms step_avg:1972.21ms +step:50/50 train_loss:3.7667 train_time:98755ms step_avg:1975.11ms +step:50/50 val_loss:3.7317 val_bpb:2.2101 train_time:98790ms step_avg:1975.79ms h_norms=['24105.4', '27189.2', '31353.5', '36701.3', '43829.7', '10500.5', '14546.0', '19432.5', '25342.5', '31770.4'] growth=['1.093', '1.128', '1.153', '1.171', '1.194', '1.480', '1.385', '1.336', '1.304', '1.254'] +peak memory allocated: 42782 MiB reserved: 44140 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9158 val_bpb:3.5037 eval_time:54005ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4808548 bytes +Total submission size int6+lzma: 4907055 bytes +final_int6_roundtrip val_loss:6.1173 val_bpb:3.6230 eval_time:53697ms +final_int6_roundtrip_exact val_loss:6.11731613 val_bpb:3.62301918 +wandb: updating run metadata +wandb: uploading wandb-summary.json; uploading config.yaml; uploading output.log +wandb: uploading wandb-summary.json +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▅▁▃▄▅▅▅▆▆▆▇▇▇█ +wandb: train_loss ▆█▇▆▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▃▁ +wandb: val_loss █▃▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 1975.10694 +wandb: train_loss 3.76669 +wandb: val_bpb 2.21015 +wandb: val_loss 3.73174 +wandb: +wandb: 🚀 View run grid_p2_j0.01 at: https://wandb.ai/propensity/parameter-golf/runs/pmxdy841 +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_110050-pmxdy841/logs diff --git a/grid_p2_j0.1.log b/grid_p2_j0.1.log new file mode 100644 index 0000000000..8e98f9ee24 --- /dev/null +++ b/grid_p2_j0.1.log @@ -0,0 +1,78 @@ +logs/350c3fce-978c-404e-bce5-7560d5270ab8.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=2 stem=3 core=5 tail=3 +model_params:26927198 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run wq77le9z +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run grid_p2_j0.1 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/wq77le9z +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.5', '11332.8', '12572.3', '13909.9', '15221.9', '8143.4', '9246.3', '10422.6', '11669.3', '12875.4'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] +step:1/50 train_loss:6.9310 train_time:1964ms step_avg:1963.98ms +step:2/50 train_loss:8.5267 train_time:3904ms step_avg:1952.08ms +step:3/50 train_loss:7.6283 train_time:5877ms step_avg:1958.95ms +step:4/50 train_loss:7.3203 train_time:7849ms step_avg:1962.28ms +step:5/50 train_loss:7.1281 train_time:9821ms step_avg:1964.23ms +step:6/50 train_loss:7.0823 train_time:11793ms step_avg:1965.52ms +step:7/50 train_loss:7.0690 train_time:13766ms step_avg:1966.50ms +step:8/50 train_loss:6.9488 train_time:15738ms step_avg:1967.24ms +step:9/50 train_loss:6.6003 train_time:17711ms step_avg:1967.83ms +step:10/50 train_loss:6.2452 train_time:19683ms step_avg:1968.32ms +step:20/50 train_loss:4.9631 train_time:39401ms step_avg:1970.07ms +step:25/50 val_loss:4.4339 val_bpb:2.6260 train_time:49298ms step_avg:1971.91ms h_norms=['14892.2', '17012.7', '19653.8', '22861.1', '26720.5', '9181.8', '11585.2', '14367.3', '17660.0', '21364.8'] growth=['1.122', '1.142', '1.155', '1.163', '1.169', '1.294', '1.262', '1.240', '1.229', '1.210'] +step:30/50 train_loss:4.2683 train_time:59129ms step_avg:1970.98ms +step:40/50 train_loss:3.9478 train_time:78860ms step_avg:1971.49ms +step:50/50 train_loss:3.7715 train_time:98733ms step_avg:1974.66ms +step:50/50 val_loss:3.7726 val_bpb:2.2343 train_time:98767ms step_avg:1975.34ms h_norms=['23953.3', '26694.0', '30396.2', '35111.3', '41668.7', '10222.5', '13936.6', '18323.8', '23608.5', '29594.6'] growth=['1.079', '1.114', '1.139', '1.155', '1.187', '1.441', '1.363', '1.315', '1.288', '1.254'] +peak memory allocated: 42782 MiB reserved: 44140 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9155 val_bpb:3.5035 eval_time:53980ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4811364 bytes +Total submission size int6+lzma: 4909871 bytes +final_int6_roundtrip val_loss:6.1183 val_bpb:3.6236 eval_time:53671ms +final_int6_roundtrip_exact val_loss:6.11825506 val_bpb:3.62357527 +wandb: updating run metadata +wandb: uploading history steps 15-15, summary, console lines 30-31 +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▅▁▃▄▅▅▅▆▆▆▇▇▇█ +wandb: train_loss ▆█▇▆▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▂▁ +wandb: val_loss █▂▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 1974.65617 +wandb: train_loss 3.7715 +wandb: val_bpb 2.23433 +wandb: val_loss 3.77257 +wandb: +wandb: 🚀 View run grid_p2_j0.1 at: https://wandb.ai/propensity/parameter-golf/runs/wq77le9z +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_110735-wq77le9z/logs diff --git a/grid_p3_j0.0.log b/grid_p3_j0.0.log new file mode 100644 index 0000000000..65b432394f --- /dev/null +++ b/grid_p3_j0.0.log @@ -0,0 +1,78 @@ +logs/f2f1b5c3-e1b0-4040-9fa5-3121c4ca0927.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=3 stem=3 core=5 tail=3 +model_params:26927199 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run rr366tug +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run grid_p3_j0.0 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/rr366tug +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10017.6', '11136.0', '12344.2', '13648.8', '14930.5', '8131.3', '9221.1', '10380.0', '11610.4', '12803.6', '8145.6', '9247.8', '10417.9', '11656.5', '12858.6'] growth=['1.114', '1.112', '1.108', '1.106', '1.094', '1.146', '1.134', '1.126', '1.119', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] +step:1/50 train_loss:6.9310 train_time:2550ms step_avg:2549.53ms +step:2/50 train_loss:8.4366 train_time:5080ms step_avg:2540.20ms +step:3/50 train_loss:7.5697 train_time:7644ms step_avg:2547.96ms +step:4/50 train_loss:7.3743 train_time:10207ms step_avg:2551.71ms +step:5/50 train_loss:7.1786 train_time:12769ms step_avg:2553.85ms +step:6/50 train_loss:7.0957 train_time:15332ms step_avg:2555.30ms +step:7/50 train_loss:7.1042 train_time:17895ms step_avg:2556.47ms +step:8/50 train_loss:6.9979 train_time:20459ms step_avg:2557.34ms +step:9/50 train_loss:6.6202 train_time:23024ms step_avg:2558.19ms +step:10/50 train_loss:6.2350 train_time:25588ms step_avg:2558.77ms +step:20/50 train_loss:4.9204 train_time:51213ms step_avg:2560.66ms +step:25/50 val_loss:4.4026 val_bpb:2.6074 train_time:64061ms step_avg:2562.45ms h_norms=['15315.5', '18476.1', '22903.1', '28959.2', '36915.4', '10165.2', '14051.6', '19152.7', '25903.6', '34591.3', '10305.9', '14457.4', '19935.1', '27281.8', '36795.6'] growth=['1.162', '1.206', '1.240', '1.264', '1.275', '1.433', '1.382', '1.363', '1.352', '1.335', '1.453', '1.403', '1.379', '1.369', '1.349'] +step:30/50 train_loss:4.2536 train_time:76845ms step_avg:2561.51ms +step:40/50 train_loss:3.9338 train_time:102476ms step_avg:2561.91ms +step:50/50 train_loss:3.7603 train_time:128239ms step_avg:2564.77ms +step:50/50 val_loss:3.7159 val_bpb:2.2007 train_time:128273ms step_avg:2565.45ms h_norms=['25099.4', '31959.1', '42336.1', '57728.8', '79426.7', '13463.9', '21452.0', '32001.9', '46436.8', '66208.5', '13616.6', '21790.7', '32681.5', '47611.1', '68100.6'] growth=['1.181', '1.273', '1.325', '1.364', '1.376', '1.898', '1.593', '1.492', '1.451', '1.426', '1.919', '1.600', '1.500', '1.457', '1.430'] +peak memory allocated: 54994 MiB reserved: 56152 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9270 val_bpb:3.5103 eval_time:70209ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4807120 bytes +Total submission size int6+lzma: 4905627 bytes +final_int6_roundtrip val_loss:6.1461 val_bpb:3.6401 eval_time:69816ms +final_int6_roundtrip_exact val_loss:6.14613800 val_bpb:3.64008912 +wandb: uploading data; updating run metadata +wandb: uploading history steps 15-15, summary, console lines 30-31 +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▄▁▃▄▅▅▆▆▆▆▇▇▇█ +wandb: train_loss ▆█▇▆▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▂▁ +wandb: val_loss █▂▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 2564.77223 +wandb: train_loss 3.76027 +wandb: val_bpb 2.20074 +wandb: val_loss 3.71586 +wandb: +wandb: 🚀 View run grid_p3_j0.0 at: https://wandb.ai/propensity/parameter-golf/runs/rr366tug +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_111419-rr366tug/logs diff --git a/grid_p3_j0.001.log b/grid_p3_j0.001.log new file mode 100644 index 0000000000..ffdae827c1 --- /dev/null +++ b/grid_p3_j0.001.log @@ -0,0 +1,79 @@ +logs/673c9200-3963-482d-b261-67c7bf418c86.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=3 stem=3 core=5 tail=3 +model_params:26927199 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run w5b84094 +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094 +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run grid_p3_j0.001 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/w5b84094 +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10017.3', '11135.8', '12344.1', '13648.4', '14930.2', '8131.2', '9221.3', '10379.9', '11610.4', '12803.6', '8145.3', '9247.4', '10417.4', '11656.1', '12858.2'] growth=['1.114', '1.112', '1.109', '1.106', '1.094', '1.146', '1.134', '1.126', '1.119', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] +step:1/50 train_loss:6.9310 train_time:2562ms step_avg:2562.16ms +step:2/50 train_loss:8.4366 train_time:5105ms step_avg:2552.70ms +step:3/50 train_loss:7.5697 train_time:7680ms step_avg:2560.17ms +step:4/50 train_loss:7.3744 train_time:10255ms step_avg:2563.83ms +step:5/50 train_loss:7.1786 train_time:12830ms step_avg:2566.10ms +step:6/50 train_loss:7.0956 train_time:15406ms step_avg:2567.62ms +step:7/50 train_loss:7.1032 train_time:17983ms step_avg:2569.03ms +step:8/50 train_loss:6.9995 train_time:20559ms step_avg:2569.91ms +step:9/50 train_loss:6.6208 train_time:23136ms step_avg:2570.65ms +step:10/50 train_loss:6.2351 train_time:25713ms step_avg:2571.25ms +step:20/50 train_loss:4.9225 train_time:51465ms step_avg:2573.26ms +step:25/50 val_loss:4.4045 val_bpb:2.6086 train_time:64376ms step_avg:2575.05ms h_norms=['15389.1', '18568.1', '22944.3', '28900.3', '36680.3', '10100.2', '13942.7', '18871.1', '25382.8', '33689.8', '10227.6', '14320.0', '19633.2', '26714.6', '35810.5'] growth=['1.162', '1.207', '1.236', '1.260', '1.269', '1.424', '1.380', '1.353', '1.345', '1.327', '1.442', '1.400', '1.371', '1.361', '1.340'] +step:30/50 train_loss:4.2545 train_time:77216ms step_avg:2573.88ms +step:40/50 train_loss:3.9681 train_time:102972ms step_avg:2574.30ms +step:50/50 train_loss:3.7646 train_time:128885ms step_avg:2577.70ms +step:50/50 val_loss:3.7386 val_bpb:2.2142 train_time:128919ms step_avg:2578.39ms h_norms=['25102.1', '31613.3', '41339.7', '55618.3', '75871.6', '12896.3', '20216.5', '29704.2', '42853.9', '60665.7', '13066.2', '20603.3', '30423.4', '44103.2', '62670.9'] growth=['1.172', '1.259', '1.308', '1.345', '1.364', '1.818', '1.568', '1.469', '1.443', '1.416', '1.842', '1.577', '1.477', '1.450', '1.421'] +peak memory allocated: 55186 MiB reserved: 56536 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9267 val_bpb:3.5101 eval_time:70241ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4808880 bytes +Total submission size int6+lzma: 4907387 bytes +final_int6_roundtrip val_loss:6.1461 val_bpb:3.6400 eval_time:69823ms +final_int6_roundtrip_exact val_loss:6.14605869 val_bpb:3.64004215 +wandb: updating run metadata +wandb: uploading output.log; uploading wandb-summary.json; uploading config.yaml +wandb: uploading history steps 15-15, summary, console lines 30-31 +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▄▁▃▄▅▅▆▆▆▆▇▇▇█ +wandb: train_loss ▆█▇▆▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▂▁ +wandb: val_loss █▂▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 2577.70402 +wandb: train_loss 3.76461 +wandb: val_bpb 2.21422 +wandb: val_loss 3.73861 +wandb: +wandb: 🚀 View run grid_p3_j0.001 at: https://wandb.ai/propensity/parameter-golf/runs/w5b84094 +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_112256-w5b84094/logs diff --git a/grid_p3_j0.01.log b/grid_p3_j0.01.log new file mode 100644 index 0000000000..bd452ac00e --- /dev/null +++ b/grid_p3_j0.01.log @@ -0,0 +1,78 @@ +logs/828fff46-1179-4ed4-9ca9-bad2f1998032.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=3 stem=3 core=5 tail=3 +model_params:26927199 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run l86ibk0l +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run grid_p3_j0.01 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/l86ibk0l +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10017.8', '11136.4', '12344.8', '13649.3', '14931.1', '8131.1', '9221.2', '10380.0', '11610.7', '12803.9', '8145.5', '9247.7', '10417.7', '11656.5', '12858.6'] growth=['1.114', '1.112', '1.109', '1.106', '1.094', '1.146', '1.134', '1.126', '1.119', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] +step:1/50 train_loss:6.9310 train_time:2558ms step_avg:2558.48ms +step:2/50 train_loss:8.4366 train_time:5098ms step_avg:2548.90ms +step:3/50 train_loss:7.5697 train_time:7670ms step_avg:2556.78ms +step:4/50 train_loss:7.3743 train_time:10241ms step_avg:2560.27ms +step:5/50 train_loss:7.1786 train_time:12811ms step_avg:2562.30ms +step:6/50 train_loss:7.0957 train_time:15383ms step_avg:2563.78ms +step:7/50 train_loss:7.1044 train_time:17953ms step_avg:2564.76ms +step:8/50 train_loss:6.9980 train_time:20525ms step_avg:2565.64ms +step:9/50 train_loss:6.6201 train_time:23097ms step_avg:2566.38ms +step:10/50 train_loss:6.2350 train_time:25670ms step_avg:2567.05ms +step:20/50 train_loss:4.9193 train_time:51381ms step_avg:2569.05ms +step:25/50 val_loss:4.4049 val_bpb:2.6088 train_time:64272ms step_avg:2570.88ms h_norms=['15720.0', '18809.2', '23004.6', '28594.8', '35769.8', '9791.7', '13137.0', '17413.1', '22947.7', '29845.4', '9846.4', '13323.4', '17863.5', '23724.8', '31147.9'] growth=['1.160', '1.197', '1.223', '1.243', '1.251', '1.380', '1.342', '1.325', '1.318', '1.301', '1.388', '1.353', '1.341', '1.328', '1.313'] +step:30/50 train_loss:4.2516 train_time:77099ms step_avg:2569.96ms +step:40/50 train_loss:3.9318 train_time:102821ms step_avg:2570.53ms +step:50/50 train_loss:3.7709 train_time:128698ms step_avg:2573.96ms +step:50/50 val_loss:3.7547 val_bpb:2.2238 train_time:128732ms step_avg:2574.64ms h_norms=['26472.4', '32406.0', '41016.4', '53232.8', '69910.2', '11501.1', '17001.4', '24006.2', '33299.6', '45175.2', '11702.2', '17433.6', '24741.2', '34474.7', '46973.8'] growth=['1.156', '1.224', '1.266', '1.298', '1.313', '1.621', '1.478', '1.412', '1.387', '1.357', '1.649', '1.490', '1.419', '1.393', '1.363'] +peak memory allocated: 55186 MiB reserved: 56536 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9290 val_bpb:3.5115 eval_time:70088ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4808852 bytes +Total submission size int6+lzma: 4907359 bytes +final_int6_roundtrip val_loss:6.1489 val_bpb:3.6417 eval_time:69675ms +final_int6_roundtrip_exact val_loss:6.14887885 val_bpb:3.64171241 +wandb: updating run metadata +wandb: uploading history steps 15-15, summary, console lines 30-31 +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▄▁▃▄▅▅▅▆▆▆▇▇▇█ +wandb: train_loss ▆█▇▆▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▂▁ +wandb: val_loss █▂▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 2573.96071 +wandb: train_loss 3.7709 +wandb: val_bpb 2.22377 +wandb: val_loss 3.75475 +wandb: +wandb: 🚀 View run grid_p3_j0.01 at: https://wandb.ai/propensity/parameter-golf/runs/l86ibk0l +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_113135-l86ibk0l/logs diff --git a/grid_p3_j0.1.log b/grid_p3_j0.1.log new file mode 100644 index 0000000000..1e94757485 --- /dev/null +++ b/grid_p3_j0.1.log @@ -0,0 +1,78 @@ +logs/25035571-2602-4fdf-8873-7b085e526dcb.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=3 stem=3 core=5 tail=3 +model_params:26927199 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run n43r2rb3 +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3 +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run grid_p3_j0.1 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/n43r2rb3 +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10016.7', '11135.2', '12343.3', '13647.7', '14929.4', '8130.4', '9220.7', '10379.4', '11609.9', '12803.1', '8144.8', '9247.1', '10417.0', '11655.6', '12857.6'] growth=['1.114', '1.112', '1.108', '1.106', '1.094', '1.146', '1.134', '1.126', '1.119', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] +step:1/50 train_loss:6.9310 train_time:2558ms step_avg:2558.36ms +step:2/50 train_loss:8.4366 train_time:5098ms step_avg:2548.82ms +step:3/50 train_loss:7.5697 train_time:7669ms step_avg:2556.49ms +step:4/50 train_loss:7.3744 train_time:10241ms step_avg:2560.13ms +step:5/50 train_loss:7.1785 train_time:12812ms step_avg:2562.33ms +step:6/50 train_loss:7.0958 train_time:15382ms step_avg:2563.72ms +step:7/50 train_loss:7.1043 train_time:17954ms step_avg:2564.81ms +step:8/50 train_loss:6.9978 train_time:20526ms step_avg:2565.71ms +step:9/50 train_loss:6.6203 train_time:23098ms step_avg:2566.44ms +step:10/50 train_loss:6.2351 train_time:25670ms step_avg:2566.98ms +step:20/50 train_loss:4.9173 train_time:51378ms step_avg:2568.92ms +step:25/50 val_loss:4.4081 val_bpb:2.6107 train_time:64271ms step_avg:2570.86ms h_norms=['15784.7', '18418.3', '21808.6', '26014.4', '31082.9', '9350.4', '11977.4', '15124.3', '18884.1', '23160.1', '9355.0', '12040.7', '15261.1', '19113.8', '23523.2'] growth=['1.142', '1.167', '1.184', '1.193', '1.195', '1.318', '1.281', '1.263', '1.249', '1.226', '1.319', '1.287', '1.267', '1.252', '1.231'] +step:30/50 train_loss:4.2496 train_time:77100ms step_avg:2569.99ms +step:40/50 train_loss:3.9324 train_time:102820ms step_avg:2570.49ms +step:50/50 train_loss:3.7771 train_time:128694ms step_avg:2573.88ms +step:50/50 val_loss:3.7294 val_bpb:2.2088 train_time:128728ms step_avg:2574.56ms h_norms=['25140.9', '28962.0', '34040.5', '40736.8', '49246.7', '9966.3', '13653.2', '17956.7', '23197.5', '28845.5', '9949.3', '13730.0', '18206.3', '23701.7', '29420.4'] growth=['1.103', '1.152', '1.175', '1.197', '1.209', '1.405', '1.370', '1.315', '1.292', '1.243', '1.402', '1.380', '1.326', '1.302', '1.241'] +peak memory allocated: 55186 MiB reserved: 56536 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9285 val_bpb:3.5112 eval_time:70073ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4808068 bytes +Total submission size int6+lzma: 4906575 bytes +final_int6_roundtrip val_loss:6.1465 val_bpb:3.6403 eval_time:69664ms +final_int6_roundtrip_exact val_loss:6.14650992 val_bpb:3.64030939 +wandb: uploading data; updating run metadata +wandb: uploading output.log; uploading wandb-summary.json; uploading config.yaml +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▄▁▃▄▅▅▅▆▆▆▇▇▇█ +wandb: train_loss ▆█▇▆▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▂▁ +wandb: val_loss █▂▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 2573.87861 +wandb: train_loss 3.77715 +wandb: val_bpb 2.20876 +wandb: val_loss 3.7294 +wandb: +wandb: 🚀 View run grid_p3_j0.1 at: https://wandb.ai/propensity/parameter-golf/runs/n43r2rb3 +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_114011-n43r2rb3/logs diff --git a/grid_p4_j0.0.log b/grid_p4_j0.0.log new file mode 100644 index 0000000000..c6a0cb016a --- /dev/null +++ b/grid_p4_j0.0.log @@ -0,0 +1,47 @@ +logs/e4fcc17e-e3ef-4ddf-812a-580b6bc0d825.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +model_params:26927200 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run v63k38ck +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run grid_p4_j0.0 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/v63k38ck +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['9829.5', '10906.5', '12074.5', '13339.5', '14586.6', '8110.7', '9182.3', '10321.5', '11532.3', '12712.6', '8125.6', '9210.3', '10362.9', '11582.1', '12773.0', '8128.7', '9216.6', '10372.5', '11592.1', '12787.0'] growth=['1.111', '1.110', '1.107', '1.105', '1.093', '1.143', '1.132', '1.124', '1.117', '1.102', '1.145', '1.133', '1.125', '1.118', '1.103', '1.146', '1.134', '1.125', '1.118', '1.103'] +step:1/50 train_loss:6.9310 train_time:3144ms step_avg:3143.65ms +step:2/50 train_loss:8.3561 train_time:6272ms step_avg:3135.84ms +step:3/50 train_loss:7.5194 train_time:9433ms step_avg:3144.22ms +step:4/50 train_loss:7.4222 train_time:12595ms step_avg:3148.75ms +step:5/50 train_loss:7.2305 train_time:15755ms step_avg:3151.08ms +step:6/50 train_loss:7.1013 train_time:18919ms step_avg:3153.09ms +step:7/50 train_loss:7.0637 train_time:22082ms step_avg:3154.61ms +step:8/50 train_loss:6.9894 train_time:25245ms step_avg:3155.59ms +step:9/50 train_loss:6.6143 train_time:28406ms step_avg:3156.21ms +step:10/50 train_loss:6.2371 train_time:31569ms step_avg:3156.90ms +step:20/50 train_loss:4.8567 train_time:63180ms step_avg:3158.98ms +step:25/50 val_loss:4.3859 val_bpb:2.5976 train_time:89760ms step_avg:3590.41ms h_norms=['15124.4', '18985.0', '24623.8', '32563.5', '43308.4', '10881.2', '15824.5', '22366.4', '31313.8', '43085.3', '10968.9', '16075.0', '22846.2', '32046.9', '44227.1', '10951.1', '16019.2', '22824.3', '32056.4', '44286.6'] growth=['1.193', '1.255', '1.297', '1.322', '1.330', '1.534', '1.454', '1.413', '1.400', '1.376', '1.546', '1.466', '1.421', '1.403', '1.380', '1.544', '1.463', '1.425', '1.404', '1.382'] +step:30/50 train_loss:4.2191 train_time:125306ms step_avg:4176.87ms +step:40/50 train_loss:3.9125 train_time:196653ms step_avg:4916.32ms +step:50/50 train_loss:3.7387 train_time:273924ms step_avg:5478.49ms diff --git a/grid_search_results.csv b/grid_search_results.csv new file mode 100644 index 0000000000..a0f0b2ac5c --- /dev/null +++ b/grid_search_results.csv @@ -0,0 +1,9 @@ +passes,jacobian,bpb_0,bpb_25,bpb_50,int6_bpb,step_avg_ms,mem_mib,growth_pass2_step50 +2,0.0,4.1046,2.6306,2.2074,3.62364007,1963.03,42589,1.086, 1.123, 1.151, 1.173, 1.205, 1.487, 1.400, 1.347, 1.318, 1.277 +2,0.001,4.1046,2.6282,2.2141,3.62377766,1978.94,42782,1.087, 1.124, 1.149, 1.167, 1.193, 1.475, 1.387, 1.331, 1.302, 1.253 +2,0.01,4.1046,2.6282,2.2101,3.62301918,1975.11,42782,1.093, 1.128, 1.153, 1.171, 1.194, 1.480, 1.385, 1.336, 1.304, 1.254 +2,0.1,4.1046,2.6260,2.2343,3.62357527,1974.66,42782,1.079, 1.114, 1.139, 1.155, 1.187, 1.441, 1.363, 1.315, 1.288, 1.254 +3,0.0,4.1046,2.6074,2.2007,3.64008912,2564.77,54994,1.181, 1.273, 1.325, 1.364, 1.376, 1.898, 1.593, 1.492, 1.451, 1.426, 1.919, 1.600, 1.500, 1.457, 1.430 +3,0.001,4.1046,2.6086,2.2142,3.64004215,2577.70,55186,1.172, 1.259, 1.308, 1.345, 1.364, 1.818, 1.568, 1.469, 1.443, 1.416, 1.842, 1.577, 1.477, 1.450, 1.421 +3,0.01,4.1046,2.6088,2.2238,3.64171241,2573.96,55186,1.156, 1.224, 1.266, 1.298, 1.313, 1.621, 1.478, 1.412, 1.387, 1.357, 1.649, 1.490, 1.419, 1.393, 1.363 +3,0.1,4.1046,2.6107,2.2088,3.64030939,2573.88,55186,1.103, 1.152, 1.175, 1.197, 1.209, 1.405, 1.370, 1.315, 1.292, 1.243, 1.402, 1.380, 1.326, 1.302, 1.241 diff --git a/grid_search_stdout.log b/grid_search_stdout.log new file mode 100644 index 0000000000..6768cfdb22 --- /dev/null +++ b/grid_search_stdout.log @@ -0,0 +1,25 @@ +[1/12] START passes=2 jac=0.0 (10:47:18) +[1/12] DONE passes=2 jac=0.0 (10:54:01) + => bpb@50=2.2074 int6_bpb=3.62364007 step_avg=1963.03ms mem=42589MiB +[2/12] START passes=2 jac=0.001 (10:54:01) +[2/12] DONE passes=2 jac=0.001 (11:00:46) + => bpb@50=2.2141 int6_bpb=3.62377766 step_avg=1978.94ms mem=42782MiB +[3/12] START passes=2 jac=0.01 (11:00:46) +[3/12] DONE passes=2 jac=0.01 (11:07:31) + => bpb@50=2.2101 int6_bpb=3.62301918 step_avg=1975.11ms mem=42782MiB +[4/12] START passes=2 jac=0.1 (11:07:31) +[4/12] DONE passes=2 jac=0.1 (11:14:15) + => bpb@50=2.2343 int6_bpb=3.62357527 step_avg=1974.66ms mem=42782MiB +[5/12] START passes=3 jac=0.0 (11:14:15) +[5/12] DONE passes=3 jac=0.0 (11:22:52) + => bpb@50=2.2007 int6_bpb=3.64008912 step_avg=2564.77ms mem=54994MiB +[6/12] START passes=3 jac=0.001 (11:22:52) +[6/12] DONE passes=3 jac=0.001 (11:31:31) + => bpb@50=2.2142 int6_bpb=3.64004215 step_avg=2577.70ms mem=55186MiB +[7/12] START passes=3 jac=0.01 (11:31:31) +[7/12] DONE passes=3 jac=0.01 (11:40:08) + => bpb@50=2.2238 int6_bpb=3.64171241 step_avg=2573.96ms mem=55186MiB +[8/12] START passes=3 jac=0.1 (11:40:08) +[8/12] DONE passes=3 jac=0.1 (11:48:45) + => bpb@50=2.2088 int6_bpb=3.64030939 step_avg=2573.88ms mem=55186MiB +[9/12] START passes=4 jac=0.0 (11:48:45) diff --git a/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/debug-internal.log b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/debug-internal.log new file mode 120000 index 0000000000..db9ecf8381 --- /dev/null +++ b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/debug-internal.log @@ -0,0 +1 @@ +run-20260326_122037-nx8viusx/logs/debug-internal.log \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/debug.log b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/debug.log new file mode 120000 index 0000000000..a7a10f1afb --- /dev/null +++ b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/debug.log @@ -0,0 +1 @@ +run-20260326_122037-nx8viusx/logs/debug.log \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/latest-run b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/latest-run new file mode 120000 index 0000000000..684334bbfe --- /dev/null +++ b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/latest-run @@ -0,0 +1 @@ +run-20260326_122037-nx8viusx \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/config.yaml b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/config.yaml new file mode 100644 index 0000000000..64485169c9 --- /dev/null +++ b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/config.yaml @@ -0,0 +1,66 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + 4wnw16h9ndm7pn30ui902o2a0dmpw2g6: + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39173939200" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon + startedAt: "2026-03-26T12:20:37.042131Z" + writerId: 4wnw16h9ndm7pn30ui902o2a0dmpw2g6 + m: [] + python_version: 3.12.3 + t: + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "12": 0.25.1 + "13": linux-x86_64 +iterations: + value: 50 +method: + value: baseline_SOTA +model_dim: + value: 512 +num_heads: + value: 8 +num_layers: + value: 11 +num_passes: + value: 1 +recurrence: + value: false +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/output.log b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/output.log new file mode 100644 index 0000000000..55d4ab72ac --- /dev/null +++ b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/output.log @@ -0,0 +1,44 @@ +logs/bff51f18-fbb9-43cc-9903-c84284e4e76d.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +model_params:26928220 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms +step:1/50 train_loss:6.9310 train_time:1335ms step_avg:1334.89ms +step:2/50 train_loss:8.6894 train_time:2639ms step_avg:1319.33ms +step:3/50 train_loss:7.7641 train_time:3975ms step_avg:1325.02ms +step:4/50 train_loss:7.2309 train_time:5311ms step_avg:1327.85ms +step:5/50 train_loss:7.1292 train_time:6648ms step_avg:1329.55ms +step:6/50 train_loss:7.1698 train_time:7983ms step_avg:1330.57ms +step:7/50 train_loss:7.1045 train_time:9320ms step_avg:1331.38ms +step:8/50 train_loss:6.9776 train_time:10656ms step_avg:1331.99ms +step:9/50 train_loss:6.6169 train_time:11993ms step_avg:1332.53ms +step:10/50 train_loss:6.2604 train_time:13330ms step_avg:1332.96ms +step:20/50 train_loss:5.1681 train_time:26695ms step_avg:1334.74ms +step:25/50 val_loss:4.6120 val_bpb:2.7315 train_time:33413ms step_avg:1336.54ms +step:30/50 train_loss:4.3901 train_time:40068ms step_avg:1335.60ms +step:40/50 train_loss:4.0167 train_time:53443ms step_avg:1336.07ms +step:50/50 train_loss:3.8262 train_time:66820ms step_avg:1336.40ms +step:50/50 val_loss:3.7856 val_bpb:2.2421 train_time:66853ms step_avg:1337.06ms +peak memory allocated: 30083 MiB reserved: 31168 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.8987 val_bpb:3.4935 eval_time:38419ms +Serialized model: 106027446 bytes +Code size: 89458 bytes +Serialized model int6+lzma: 4809376 bytes +Total submission size int6+lzma: 4898834 bytes +final_int6_roundtrip val_loss:6.0576 val_bpb:3.5876 eval_time:38209ms +final_int6_roundtrip_exact val_loss:6.05759208 val_bpb:3.58764724 diff --git a/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/requirements.txt b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/wandb-metadata.json new file mode 100644 index 0000000000..7a7da9eab4 --- /dev/null +++ b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/wandb-metadata.json @@ -0,0 +1,38 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T12:20:37.042131Z", + "program": "", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39173939200" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "4wnw16h9ndm7pn30ui902o2a0dmpw2g6" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/wandb-summary.json b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/wandb-summary.json new file mode 100644 index 0000000000..7af3bac55f --- /dev/null +++ b/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx/files/wandb-summary.json @@ -0,0 +1 @@ +{"step_avg_ms":1337.06,"_timestamp":1.7745278305633738e+09,"train_loss":3.8262,"_wandb":{"runtime":454},"_runtime":454.938437552,"_step":50,"val_loss":3.7856,"val_bpb":2.2421} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_full.sh b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_full.sh new file mode 100644 index 0000000000..e4b315250f --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_full.sh @@ -0,0 +1,63 @@ +#!/bin/bash +set -euo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +MINUTES="${MINUTES:-80}" +WALLCLOCK=$((MINUTES * 60)) +SEED="${SEED:-1337}" + +echo "============================================================" +echo " Full 1-GPU run: learned feedback variant" +echo " Wall clock: ${MINUTES} minutes (${WALLCLOCK}s)" +echo " Seed: ${SEED}" +echo "============================================================" + +cd /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT + +PYTHONUNBUFFERED=1 \ +TORCH_COMPILE_DISABLE=1 \ +DATA_PATH="../../../data/datasets/fineweb10B_sp1024" \ +TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" \ +SEED="${SEED}" \ +ITERATIONS=20000 \ +MAX_WALLCLOCK_SECONDS="${WALLCLOCK}" \ +VAL_LOSS_EVERY=2000 \ +TRAIN_LOG_EVERY=200 \ +WARMUP_STEPS=20 \ +WARMDOWN_ITERS=3500 \ +TRAIN_BATCH_TOKENS=786432 \ +TRAIN_SEQ_LEN=2048 \ +EVAL_SEQ_LEN=2048 \ +EVAL_STRIDE=64 \ +NUM_STEM_LAYERS=3 \ +NUM_CORE_LAYERS=2 \ +NUM_TAIL_LAYERS=3 \ +NUM_PASSES=3 \ +CORE_QUANT_BITS=6 \ +CORE_QUANT_ENABLED=1 \ +BIGRAM_VOCAB_SIZE=1536 \ +XSA_LAST_N=4 \ +ROPE_DIMS=16 \ +LN_SCALE=1 \ +VE_ENABLED=1 \ +VE_DIM=128 \ +VE_LAYERS="6,7" \ +MATRIX_LR=0.025 \ +SCALAR_LR=0.025 \ +TIED_EMBED_LR=0.035 \ +MUON_MOMENTUM=0.99 \ +MUON_MOMENTUM_WARMUP_START=0.92 \ +MUON_MOMENTUM_WARMUP_STEPS=1500 \ +MUON_WD=0.04 \ +ADAM_WD=0.04 \ +GRAD_CLIP_NORM=0.3 \ +SWA_ENABLED=1 \ +SWA_EVERY=50 \ +LATE_QAT=1 \ +LATE_QAT_THRESHOLD=0.15 \ +TTT_ENABLED=0 \ +$PYTHON train_bestbase_recurrent_feedback_learned.py \ + --feedback-mode diagonal --feedback-rank 2 --ttt-regime tail_only + +echo "" +echo "Run complete." diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_smoke.sh b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_smoke.sh new file mode 100644 index 0000000000..43fdd5ac41 --- /dev/null +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_smoke.sh @@ -0,0 +1,48 @@ +#!/bin/bash +set -euo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" + +echo "=== Environment check ===" +$PYTHON -c " +import torch +print(f'PyTorch {torch.__version__}, CUDA {torch.version.cuda}') +print(f'GPU: {torch.cuda.get_device_name(0)}') +from flash_attn_interface import flash_attn_func +print('Flash Attention 3: OK') +" + +echo "" +echo "=== Smoke test: Script 3 (learned feedback, 50 steps) ===" +cd /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT + +TORCH_COMPILE_DISABLE=1 \ +DATA_PATH="../../../data/datasets/fineweb10B_sp1024" \ +TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" \ +ITERATIONS=50 \ +MAX_WALLCLOCK_SECONDS=300 \ +VAL_LOSS_EVERY=25 \ +TRAIN_LOG_EVERY=10 \ +WARMUP_STEPS=5 \ +WARMDOWN_ITERS=10 \ +TRAIN_BATCH_TOKENS=131072 \ +TTT_ENABLED=0 \ +NUM_STEM_LAYERS=3 \ +NUM_CORE_LAYERS=2 \ +NUM_TAIL_LAYERS=3 \ +NUM_PASSES=3 \ +CORE_QUANT_BITS=6 \ +CORE_QUANT_ENABLED=1 \ +BIGRAM_VOCAB_SIZE=1536 \ +XSA_LAST_N=4 \ +ROPE_DIMS=16 \ +LN_SCALE=1 \ +VE_ENABLED=1 \ +VE_DIM=128 \ +VE_LAYERS="6,7" \ +SWA_ENABLED=0 \ +$PYTHON train_bestbase_recurrent_feedback_learned.py \ + --feedback-mode diagonal --feedback-rank 2 --ttt-regime tail_only + +echo "" +echo "=== Smoke test complete ===" diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_fixed.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_fixed.py index ff18bceb2c..85931d3691 100644 --- a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_fixed.py +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_fixed.py @@ -134,7 +134,7 @@ def feedback_fn(h, pass_idx): clip_hidden=cli.clip_hidden, clip_value=cli.clip_value, jacobian_proxy_weight=cli.jacobian_proxy_weight) - compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + compiled_model = base_model model = compiled_model extra_scalar = list(feedback.parameters()) diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_learned.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_learned.py index 29deeaf61a..94df9e02be 100644 --- a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_learned.py +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_feedback_learned.py @@ -158,9 +158,9 @@ def feedback_fn(h, pass_idx): args.num_passes, cli.residual_scale_init ).to(device) - # ---- compile ---- - compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) - model = compiled_model + # ---- compile (disabled: stabilizer .item() causes recompilation storms) ---- + compiled_model = base_model + model = base_model # ---- optimizers ---- extra_scalar = list(feedback.parameters()) diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_qat.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_qat.py index 8156aff0db..36f287949d 100644 --- a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_qat.py +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_bestbase_recurrent_qat.py @@ -108,7 +108,7 @@ def log0(msg: str, console: bool = True) -> None: clip_hidden=cli.clip_hidden, clip_value=cli.clip_value, jacobian_proxy_weight=cli.jacobian_proxy_weight) - compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + compiled_model = base_model model = compiled_model optimizers, replicated_params = build_optimizers(base_model, args) diff --git a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_utils_recurrent.py b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_utils_recurrent.py index 45fa72f3a9..2b33ff5ecf 100644 --- a/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_utils_recurrent.py +++ b/records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/train_utils_recurrent.py @@ -501,8 +501,7 @@ def eval_val_sliding( token_count = torch.zeros((), device=device, dtype=torch.float64) byte_count = torch.zeros((), device=device, dtype=torch.float64) base_model.eval() - compiled_logits = torch.compile(base_model.forward_logits, - dynamic=False, fullgraph=True) + compiled_logits = base_model.forward_logits with torch.inference_mode(): for bi in range(0, len(my_windows), batch_seqs): batch_ws = my_windows[bi:bi + batch_seqs] diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/ablation_no_rmsnorm.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/ablation_no_rmsnorm.sh new file mode 100755 index 0000000000..fe3a255e7a --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/ablation_no_rmsnorm.sh @@ -0,0 +1,77 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +export PYTHONUNBUFFERED=1 +export TORCH_COMPILE_DISABLE=1 +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=50 +export MAX_WALLCLOCK_SECONDS=900 +export VAL_LOSS_EVERY=25 +export TRAIN_LOG_EVERY=10 +export WARMUP_STEPS=5 +export WARMDOWN_ITERS=10 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=0 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=5 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=0 +export LATE_QAT=0 +export TTT_ENABLED=0 +export CORE_START=3 +export CORE_END=8 +export CORE_QUANT_ENABLED=0 +export NUM_PASSES=3 +export WANDB_PROJECT="parameter-golf" + +for JAC in 0.0 0.1; do + export WANDB_NAME="ablation_3p_noRMS_j${JAC}" + LOG="/home/nesta/parameter-golf/ablation_3p_noRMS_j${JAC}.log" + echo "START 3-pass no-RMSnorm jac=$JAC ($(date +%H:%M:%S))" + + $PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight "$JAC" \ + --no-interpass-rmsnorm \ + > "$LOG" 2>&1 || { + echo "FAILED jac=$JAC (exit=$?)" + continue + } + + BPB_50=$(grep 'step:50/.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + INT6_BPB=$(grep 'final_int6_roundtrip_exact.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + MEM=$(grep 'peak memory' "$LOG" | head -1 | sed 's/.*allocated: \([0-9]*\) MiB.*/\1/' || echo "N/A") + STEP_AVG=$(grep 'step:50/.*step_avg:' "$LOG" | head -1 | sed 's/.*step_avg:\([0-9.]*\)ms.*/\1/' || echo "N/A") + + echo "DONE jac=$JAC => bpb@50=$BPB_50 int6=$INT6_BPB step=${STEP_AVG}ms mem=${MEM}MiB" +done + +echo "=== ABLATION COMPLETE ($(date)) ===" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/feedback.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/feedback.py new file mode 100644 index 0000000000..dd34a36c62 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/feedback.py @@ -0,0 +1,138 @@ +"""Error feedback modules for recurrent quantization correction. + +Implements low-rank residual approximation and correction operators +to compensate for quantization error amplification in recurrent passes. + + e_k = U (V^T h_k) -- low-rank residual approx. + c_k = D_k(e_k) -- correction operator + h_{k+1} = f_{W_q}(h_k + c_k) -- corrected recurrent update +""" +from __future__ import annotations +import math +import torch +import torch.nn as nn +from torch import Tensor + + +class LowRankResidual(nn.Module): + """e_k = U (V^T h_k) with U, V in R^{d x r}.""" + + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V = nn.Parameter(torch.zeros(dim, rank)) + self.U = nn.Parameter(torch.zeros(dim, rank)) + + def forward(self, h: Tensor) -> Tensor: + return (h @ self.V) @ self.U.T + + +class DiagonalFeedback(nn.Module): + """c_k = d odot e_k.""" + + def __init__(self, dim: int, init_ones: bool = False): + super().__init__() + init_val = torch.ones(dim) if init_ones else torch.zeros(dim) + self.d = nn.Parameter(init_val) + + def forward(self, e: Tensor) -> Tensor: + return self.d.to(dtype=e.dtype) * e + + +class LowRankFeedback(nn.Module): + """c_k = U_D (V_D^T e_k) with U_D, V_D in R^{d x r}.""" + + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V_D = nn.Parameter(torch.zeros(dim, rank)) + self.U_D = nn.Parameter(torch.zeros(dim, rank)) + + def forward(self, e: Tensor) -> Tensor: + return (e @ self.V_D) @ self.U_D.T + + +class AffineJunction(nn.Module): + """c_k^{aff} = gamma_k odot h_k + beta_k.""" + + def __init__(self, dim: int): + super().__init__() + self.gamma = nn.Parameter(torch.ones(dim)) + self.beta = nn.Parameter(torch.zeros(dim)) + + def forward(self, h: Tensor) -> Tensor: + return self.gamma.to(dtype=h.dtype) * h + self.beta.to(dtype=h.dtype) + + +class ErrorFeedbackModule(nn.Module): + """Combined error-feedback path: residual -> correction -> (optional junction). + + Supports shared or per-pass correction operators. Correction is inactive + on pass 0 (the first recurrence pass sees no prior quantization residual). + + Args: + dim: model hidden dimension + rank: rank for low-rank components + feedback_mode: 'identity' | 'diagonal' | 'low_rank' + per_pass: separate correction per pass if True + num_passes: number of recurrence passes (K) + affine_junction: add an affine junction path + """ + + def __init__( + self, + dim: int, + rank: int = 2, + feedback_mode: str = "diagonal", + per_pass: bool = False, + num_passes: int = 3, + affine_junction: bool = False, + ): + super().__init__() + self.feedback_mode = feedback_mode + self.per_pass = per_pass + self.num_passes = num_passes + + self.residual = LowRankResidual(dim, rank) + + if feedback_mode == "identity": + self.correction: nn.Module | nn.ModuleList | None = None + elif feedback_mode == "diagonal": + if per_pass: + self.correction = nn.ModuleList( + [DiagonalFeedback(dim) for _ in range(num_passes)] + ) + else: + self.correction = DiagonalFeedback(dim) + elif feedback_mode == "low_rank": + if per_pass: + self.correction = nn.ModuleList( + [LowRankFeedback(dim, rank) for _ in range(num_passes)] + ) + else: + self.correction = LowRankFeedback(dim, rank) + else: + raise ValueError(f"Unknown feedback_mode: {feedback_mode}") + + self.junction: AffineJunction | None = ( + AffineJunction(dim) if affine_junction else None + ) + + def forward(self, h: Tensor, pass_idx: int) -> Tensor: + """Return correction tensor (zeros on pass 0).""" + e = self.residual(h) + if self.correction is None: + c = e + elif self.per_pass: + c = self.correction[pass_idx](e) + else: + c = self.correction(e) + if self.junction is not None: + c = c + self.junction(h) + mask = torch.tensor(1.0 if pass_idx > 0 else 0.0, device=h.device, dtype=h.dtype) + return c * mask + + def extra_repr(self) -> str: + return (f"mode={self.feedback_mode}, per_pass={self.per_pass}, " + f"passes={self.num_passes}") + + def param_count(self) -> int: + return sum(p.numel() for p in self.parameters()) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/grid_search.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/grid_search.sh new file mode 100755 index 0000000000..4ff88e237d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/grid_search.sh @@ -0,0 +1,96 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +RESULTS_FILE="/home/nesta/parameter-golf/grid_search_results.csv" +echo "passes,jacobian,bpb_0,bpb_25,bpb_50,int6_bpb,step_avg_ms,mem_mib,growth_pass2_step50" > "$RESULTS_FILE" + +export PYTHONUNBUFFERED=1 +export TORCH_COMPILE_DISABLE=1 +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=50 +export MAX_WALLCLOCK_SECONDS=900 +export VAL_LOSS_EVERY=25 +export TRAIN_LOG_EVERY=10 +export WARMUP_STEPS=5 +export WARMDOWN_ITERS=10 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=0 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=5 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=0 +export LATE_QAT=0 +export TTT_ENABLED=0 +export CORE_START=3 +export CORE_END=8 +export CORE_QUANT_ENABLED=0 +export WANDB_PROJECT="parameter-golf" + +RUN_IDX=0 +for PASSES in 2 3 4; do + for JAC in 0.0 0.001 0.01 0.1; do + RUN_IDX=$((RUN_IDX + 1)) + export NUM_PASSES=$PASSES + export WANDB_NAME="grid_p${PASSES}_j${JAC}" + LOG="/home/nesta/parameter-golf/grid_p${PASSES}_j${JAC}.log" + + if [ -f "$LOG" ] && grep -q "final_int6_roundtrip_exact" "$LOG"; then + echo "[$RUN_IDX/12] SKIP (already done): passes=$PASSES jac=$JAC" + else + echo "[$RUN_IDX/12] START passes=$PASSES jac=$JAC ($(date +%H:%M:%S))" + $PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight "$JAC" \ + > "$LOG" 2>&1 || { + echo "[$RUN_IDX/12] FAILED passes=$PASSES jac=$JAC (exit=$?)" + echo "$PASSES,$JAC,FAIL,FAIL,FAIL,FAIL,FAIL,FAIL,FAIL" >> "$RESULTS_FILE" + continue + } + echo "[$RUN_IDX/12] DONE passes=$PASSES jac=$JAC ($(date +%H:%M:%S))" + fi + + BPB_0=$(grep 'step:0/.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + BPB_25=$(grep 'step:25/.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + BPB_50=$(grep 'step:50/.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + INT6_BPB=$(grep 'final_int6_roundtrip_exact.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + STEP_AVG=$(grep 'step:50/.*step_avg:' "$LOG" | head -1 | sed 's/.*step_avg:\([0-9.]*\)ms.*/\1/' || echo "N/A") + MEM=$(grep 'peak memory' "$LOG" | head -1 | sed 's/.*allocated: \([0-9]*\) MiB.*/\1/' || echo "N/A") + GROWTH_50=$(grep 'step:50/.*growth=' "$LOG" | head -1 | sed "s/.*growth=\[//;s/\].*//;s/'//g" || echo "N/A") + + echo "$PASSES,$JAC,$BPB_0,$BPB_25,$BPB_50,$INT6_BPB,$STEP_AVG,$MEM,$GROWTH_50" >> "$RESULTS_FILE" + echo " => bpb@50=$BPB_50 int6_bpb=$INT6_BPB step_avg=${STEP_AVG}ms mem=${MEM}MiB" + done +done + +echo "" +echo "=== ALL 12 RUNS COMPLETE ($(date)) ===" +echo "Results CSV: $RESULTS_FILE" +cat "$RESULTS_FILE" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/recurrence-fixes.md b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/recurrence-fixes.md new file mode 100644 index 0000000000..449700d6d4 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/recurrence-fixes.md @@ -0,0 +1,376 @@ +# Recurrent SOTA: Complete Fix Plan + +All fixes listed below are required. Apply them in order. + +--- + +## Fix 1: Add inter-pass RMSNorm (CRITICAL) + +The core layers were trained to expect inputs at a specific scale. On pass 2+, the output of the core has a different magnitude than the input. Without renormalization, pass 2 feeds out-of-distribution activations into the same weights, causing 35-48× growth. + +**In `GPT.forward()` and `GPT.forward_logits()`, in the core loop:** + +```python +# --- RECURRENT CORE --- +for k in range(self.num_passes): + if k > 0: + x = F.rms_norm(x, (x.size(-1),)) + for j in range(self.core_start, self.core_end): + # ... layer execution +``` + +Zero extra parameters. This is what Universal Transformers and Huginn both do. + +--- + +## Fix 2: Move feedback call outside the inner layer loop + +The feedback module is designed to correct at **junction points between passes**, not at every layer within a pass. Currently `feedback_fn(x, k)` is called inside the `for j` loop, meaning it fires 5 times per pass (once per core layer) instead of once per pass. + +**Before (wrong):** +```python +for k in range(self.num_passes): + for j in range(self.core_start, self.core_end): + correction = feedback_fn(x, k) if feedback_fn else None + if correction is not None: + x = x + correction + # ... layer execution +``` + +**After (correct):** +```python +for k in range(self.num_passes): + if k > 0: + x = F.rms_norm(x, (x.size(-1),)) + # Junction correction: once per pass, before re-entering core + if feedback_fn is not None: + correction = feedback_fn(x, k) + if correction is not None: + x = x + correction + if stabilizer is not None: + x = stabilizer.clip(x) + h_core_in = x # save for Jacobian proxy loss + for j in range(self.core_start, self.core_end): + h_prev = x + ve = self._get_ve(j, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(j) + x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + if v0 is None and raw_v is not None: + v0 = raw_v + if stabilizer is not None and self.training and not torch.compiler.is_compiling(): + stabilizer.record_pass(h_prev, x) + h_core_out = x # save for Jacobian proxy loss +``` + +--- + +## Fix 3: Zero-initialize the feedback module + +The `LowRankResidual` U/V matrices are initialized with random values, and `DiagonalFeedback.d` is initialized to ones. At step 1, this injects random noise at the same magnitude as the hidden state. The feedback module must be a no-op at initialization so it can't hurt before it's learned anything. + +**In `feedback.py`, change `LowRankResidual.__init__`:** +```python +class LowRankResidual(nn.Module): + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V = nn.Parameter(torch.zeros(dim, rank)) + self.U = nn.Parameter(torch.zeros(dim, rank)) +``` + +**In `feedback.py`, change `DiagonalFeedback.__init__` default:** +```python +class DiagonalFeedback(nn.Module): + def __init__(self, dim: int, init_ones: bool = False): # was True + super().__init__() + init_val = torch.ones(dim) if init_ones else torch.zeros(dim) + self.d = nn.Parameter(init_val) +``` + +**In `feedback.py`, change `LowRankFeedback.__init__`:** +```python +class LowRankFeedback(nn.Module): + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V_D = nn.Parameter(torch.zeros(dim, rank)) + self.U_D = nn.Parameter(torch.zeros(dim, rank)) +``` + +--- + +## Fix 4: Wire up Jacobian proxy loss in the training loop + +The Jacobian proxy loss penalizes the spectral norm of the core block's Jacobian exceeding 1.0. This is the training-time mechanism that ensures the recurrent core is **contractive** — meaning quantization errors shrink rather than grow across passes. Without it, the model has no incentive to learn a stable recurrence. + +**Step A: Have forward() return Jacobian proxy inputs.** + +Change the forward signature to optionally return core boundary activations: + +```python +def forward(self, input_ids, target_ids, feedback_fn=None, stabilizer=None, + return_jacobian_pair=False): + # ... stem ... + + h_core_in = None + h_core_out = None + + # --- RECURRENT CORE --- + for k in range(self.num_passes): + if k > 0: + x = F.rms_norm(x, (x.size(-1),)) + if feedback_fn is not None: + correction = feedback_fn(x, k) + if correction is not None: + x = x + correction + if stabilizer is not None: + x = stabilizer.clip(x) + if k == 0: + h_core_in = x + for j in range(self.core_start, self.core_end): + # ... layer execution ... + pass + if k == self.num_passes - 1: + h_core_out = x + + # ... tail + loss computation ... + + if return_jacobian_pair and h_core_in is not None and h_core_out is not None: + return main_loss, h_core_in, h_core_out + return main_loss +``` + +**Step B: Add Jacobian loss in the training loop in `main()`:** + +```python +for micro_step in range(grad_accum_steps): + x, y = train_loader.next_batch(...) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + if stabilizer is not None and stabilizer.jacobian_proxy_weight > 0: + loss, h_in, h_out = model(x, y, feedback_fn=feedback_fn, + stabilizer=stabilizer, + return_jacobian_pair=True) + loss = loss + stabilizer.jacobian_proxy_loss(h_in, h_out) + else: + loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + train_loss += loss.detach() + (loss * grad_scale).backward() +``` + +**Step C: Set a non-zero default weight.** Change the CLI default: +```python +g.add_argument("--jacobian-proxy-weight", type=float, default=0.01) +``` + +Start with 0.01. If training is too slow to converge (the regularizer fights the language modeling loss), reduce to 0.001. If growth ratios are still above 1.5, increase to 0.1. + +--- + +## Fix 5: Wire up ResidualScale in the forward pass + +`ResidualScale` is instantiated but never called. It dampens the residual update on each pass, giving the model a learnable per-pass attenuation factor. + +**In `GPT.__init__`, add `residual_scale` as a constructor arg:** +```python +def __init__(self, ..., residual_scale: nn.Module | None = None): + # ... + self.residual_scale = residual_scale +``` + +**In the core loop, apply it to the block's residual output. This requires changing how Block output is used:** + +The cleanest approach — apply ResidualScale at the pass level, scaling the entire pass's contribution to x: + +```python +for k in range(self.num_passes): + if k > 0: + x = F.rms_norm(x, (x.size(-1),)) + if feedback_fn is not None: + correction = feedback_fn(x, k) + if correction is not None: + x = x + correction + if stabilizer is not None: + x = stabilizer.clip(x) + + x_before_pass = x + for j in range(self.core_start, self.core_end): + # ... layer execution, x gets updated ... + pass + + # Scale the residual delta of this pass + if self.residual_scale is not None and k > 0: + delta = x - x_before_pass + x = x_before_pass + self.residual_scale(delta, k) +``` + +Pass `residual_scale` through from `main()`: +```python +base_model = GPT(..., residual_scale=residual_scale) +``` + +Initialize `residual_scale` with init_value=0.5 (start conservative, let it learn up): +```python +g.add_argument("--residual-scale-init", type=float, default=0.5) +``` + +--- + +## Fix 6: Factor out `_forward_hidden` to eliminate duplication + +`forward()` and `forward_logits()` duplicate the entire stem/core/tail logic. Every fix above must be applied to both, which is a maintenance nightmare and a guaranteed source of bugs. + +**Create a shared method:** +```python +def _forward_hidden(self, input_ids, feedback_fn=None, stabilizer=None, + return_jacobian_pair=False): + """Run stem/core/tail, return (hidden_states, v0, jacobian_pair_or_None).""" + n = self.num_layers + x = self.tok_emb(input_ids) + if self.bigram is not None: + x = x + self.bigram(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + x = self.smear(x) + x0 = x + v0 = None + skips = [] + ve_cache = {} + + # --- STEM --- + for i in range(self.core_start): + ve = self._get_ve(i, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) + x, raw_v = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + if v0 is None and raw_v is not None: + v0 = raw_v + skips.append(x) + + # --- RECURRENT CORE --- + h_core_in = None + h_core_out = None + for k in range(self.num_passes): + if k > 0: + x = F.rms_norm(x, (x.size(-1),)) + if feedback_fn is not None: + correction = feedback_fn(x, k) + if correction is not None: + x = x + correction + if stabilizer is not None: + x = stabilizer.clip(x) + if k == 0: + h_core_in = x + x_before_pass = x + for j in range(self.core_start, self.core_end): + h_prev = x + ve = self._get_ve(j, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(j) + x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + if v0 is None and raw_v is not None: + v0 = raw_v + if stabilizer is not None and self.training and not torch.compiler.is_compiling(): + stabilizer.record_pass(h_prev, x) + if self.residual_scale is not None and k > 0: + delta = x - x_before_pass + x = x_before_pass + self.residual_scale(delta, k) + if k == self.num_passes - 1: + h_core_out = x + + # --- TAIL --- + for i in range(self.core_end, n): + ti = i - self.core_end + if ti < len(skips): + x = x + self.skip_weights[ti].to(dtype=x.dtype)[None, None, :] * skips.pop() + ve = self._get_ve(i, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) + x, _ = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + + x = self.final_norm(x) + + jac_pair = (h_core_in, h_core_out) if return_jacobian_pair and h_core_in is not None else None + return x, jac_pair +``` + +**Then forward() and forward_logits() become thin wrappers:** +```python +def forward(self, input_ids, target_ids, feedback_fn=None, stabilizer=None, + return_jacobian_pair=False): + x, jac_pair = self._forward_hidden(input_ids, feedback_fn, stabilizer, + return_jacobian_pair) + x_flat = x.reshape(-1, x.size(-1)) + targets = target_ids.reshape(-1) + if self.tie_embeddings: + logits_proj = F.linear(x_flat, self.tok_emb.weight) + else: + logits_proj = self.lm_head(x_flat) + logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + main_loss = F.cross_entropy(logits.float(), targets, reduction="mean") + # ... MTP loss computation stays here ... + if jac_pair is not None: + return main_loss, jac_pair[0], jac_pair[1] + return main_loss + +def forward_logits(self, input_ids, feedback_fn=None, stabilizer=None): + x, _ = self._forward_hidden(input_ids, feedback_fn, stabilizer, False) + if self.tie_embeddings: + logits_proj = F.linear(x, self.tok_emb.weight) + else: + logits_proj = self.lm_head(x) + return self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) +``` + +--- + +## Fix 7: torch.compile compatibility for Jacobian loss + +The `return_jacobian_pair=True` path returns a tuple instead of a single tensor. `torch.compile(fullgraph=True)` will break if the return type changes dynamically. Two options: + +**Option A (simpler):** Always return the tuple, ignore jac_pair when not needed: +```python +# Always return 3 values +def forward(self, ...): + # ... + return main_loss, h_core_in, h_core_out # h_core_in/out can be None +``` + +**Option B (safer for compile):** Compute Jacobian loss inside forward() so the compiled function always returns a scalar: +```python +def forward(self, input_ids, target_ids, feedback_fn=None, stabilizer=None): + x, jac_pair = self._forward_hidden(input_ids, feedback_fn, stabilizer, True) + # ... compute main_loss ... + if jac_pair is not None and stabilizer is not None and stabilizer.jacobian_proxy_weight > 0: + main_loss = main_loss + stabilizer.jacobian_proxy_loss(jac_pair[0], jac_pair[1]) + return main_loss +``` + +**Go with Option B.** It keeps the compiled function signature stable and avoids any compile issues. Pass the stabilizer through — the Jacobian loss computation is pure tensor ops and compile-friendly. + +--- + +## Summary: Apply in this order + +| # | Fix | Files changed | Risk | +|---|-----|--------------|------| +| 1 | Inter-pass RMSNorm | train_gpt_recurrent.py | None — proven technique | +| 2 | Move feedback outside inner loop | train_gpt_recurrent.py | None — bug fix | +| 3 | Zero-initialize feedback | feedback.py | None — strictly safer init | +| 4 | Wire Jacobian proxy loss | train_gpt_recurrent.py, stability.py | Low — use Option B for compile safety | +| 5 | Wire ResidualScale | train_gpt_recurrent.py | Low — init at 0.5 is conservative | +| 6 | Factor out _forward_hidden | train_gpt_recurrent.py | None — refactor only | +| 7 | torch.compile compatibility | train_gpt_recurrent.py | None — use Option B | + +After applying all fixes, the recommended first run config: +```bash +NUM_PASSES=2 \ +CORE_START=3 \ +CORE_END=8 \ +torchrun --standalone --nproc_per_node=8 train_gpt_recurrent.py \ + --feedback-mode diagonal \ + --feedback-rank 2 \ + --clip-hidden \ + --clip-value 15 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.01 +``` + +Expected behavior: growth ratios should be 0.8-1.2 per layer (stable), val_bpb should converge to competitive range within the first 1000 steps, and the inter-pass RMSNorm alone should prevent the 35-48× explosions seen earlier. diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_3pass.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_3pass.sh new file mode 100755 index 0000000000..0d0add01c1 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_3pass.sh @@ -0,0 +1,20 @@ +#!/bin/bash +set -euo pipefail +cd "$(dirname "$0")" +PYTHONUNBUFFERED=1 TORCH_COMPILE_DISABLE=1 \ +DATA_PATH="../../../data/datasets/fineweb10B_sp1024" \ +TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" \ +SEED=1337 ITERATIONS=30 MAX_WALLCLOCK_SECONDS=600 \ +VAL_LOSS_EVERY=15 TRAIN_LOG_EVERY=10 WARMUP_STEPS=5 WARMDOWN_ITERS=5 \ +TRAIN_BATCH_TOKENS=786432 TRAIN_SEQ_LEN=2048 EVAL_SEQ_LEN=2048 EVAL_STRIDE=0 \ +NUM_LAYERS=11 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 \ +BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 \ +VE_ENABLED=1 VE_DIM=128 VE_LAYERS="9,10" \ +MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \ +MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=5 \ +MUON_WD=0.04 ADAM_WD=0.04 GRAD_CLIP_NORM=0.3 \ +SWA_ENABLED=0 LATE_QAT=0 TTT_ENABLED=0 \ +CORE_START=3 CORE_END=8 NUM_PASSES=3 CORE_QUANT_ENABLED=0 \ +/home/nesta/parameter-golf/.venv/bin/python3 train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 --jacobian-proxy-weight 0.01 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_qat.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_qat.sh new file mode 100755 index 0000000000..c107b491d1 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_qat.sh @@ -0,0 +1,79 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +export PYTHONUNBUFFERED=1 +export TORCH_COMPILE_DISABLE=1 +export CUDA_MEM_FRACTION=0.572 +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=50 +export MAX_WALLCLOCK_SECONDS=900 +export VAL_LOSS_EVERY=25 +export TRAIN_LOG_EVERY=10 +export WARMUP_STEPS=5 +export WARMDOWN_ITERS=10 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=0 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=5 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=0 +export LATE_QAT=0 +export TTT_ENABLED=0 +export CORE_START=3 +export CORE_END=8 +export CORE_QUANT_ENABLED=1 +export CORE_QUANT_BITS=6 +export NUM_PASSES=4 +export WANDB_PROJECT="parameter-golf" +export WANDB_NAME="test_4pass_noRMS_j0.1_QAT" + +LOG="/home/nesta/parameter-golf/test_4pass_qat.log" +echo "START 4-pass no-RMSnorm jac=0.1 QAT, 80GB cap ($(date +%H:%M:%S))" + +$PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + > "$LOG" 2>&1 + +EXIT=$? +if [ $EXIT -ne 0 ]; then + echo "FAILED (exit=$EXIT)" + tail -20 "$LOG" +else + BPB_50=$(grep 'step:50/.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + INT6_BPB=$(grep 'final_int6_roundtrip_exact.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + MEM=$(grep 'peak memory' "$LOG" | head -1 | sed 's/.*allocated: \([0-9]*\) MiB.*/\1/' || echo "N/A") + STEP_AVG=$(grep 'step:50/.*step_avg:' "$LOG" | head -1 | sed 's/.*step_avg:\([0-9.]*\)ms.*/\1/' || echo "N/A") + echo "DONE => bpb@50=$BPB_50 int6=$INT6_BPB step=${STEP_AVG}ms mem=${MEM}MiB" +fi + +echo "FINISHED ($(date +%H:%M:%S))" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_test.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_test.sh new file mode 100755 index 0000000000..18f4bf284d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_test.sh @@ -0,0 +1,78 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +export PYTHONUNBUFFERED=1 +export TORCH_COMPILE_DISABLE=1 +export CUDA_MEM_FRACTION=0.572 +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=50 +export MAX_WALLCLOCK_SECONDS=900 +export VAL_LOSS_EVERY=25 +export TRAIN_LOG_EVERY=10 +export WARMUP_STEPS=5 +export WARMDOWN_ITERS=10 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=0 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=5 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=0 +export LATE_QAT=0 +export TTT_ENABLED=0 +export CORE_START=3 +export CORE_END=8 +export CORE_QUANT_ENABLED=0 +export NUM_PASSES=4 +export WANDB_PROJECT="parameter-golf" +export WANDB_NAME="test_4pass_noRMS_j0.1_80GBcap" + +LOG="/home/nesta/parameter-golf/test_4pass_noRMS_j0.1.log" +echo "START 4-pass no-RMSnorm jac=0.1, 80GB memory cap ($(date +%H:%M:%S))" + +$PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + > "$LOG" 2>&1 + +EXIT=$? +if [ $EXIT -ne 0 ]; then + echo "FAILED (exit=$EXIT) — check $LOG for details" + tail -20 "$LOG" +else + BPB_50=$(grep 'step:50/.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + INT6_BPB=$(grep 'final_int6_roundtrip_exact.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + MEM=$(grep 'peak memory' "$LOG" | head -1 | sed 's/.*allocated: \([0-9]*\) MiB.*/\1/' || echo "N/A") + STEP_AVG=$(grep 'step:50/.*step_avg:' "$LOG" | head -1 | sed 's/.*step_avg:\([0-9.]*\)ms.*/\1/' || echo "N/A") + echo "DONE => bpb@50=$BPB_50 int6=$INT6_BPB step=${STEP_AVG}ms mem=${MEM}MiB" +fi + +echo "FINISHED ($(date +%H:%M:%S))" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_ttt.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_ttt.sh new file mode 100755 index 0000000000..bcc873a48a --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_ttt.sh @@ -0,0 +1,86 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +export PYTHONUNBUFFERED=1 +export TORCH_COMPILE_DISABLE=1 +export CUDA_MEM_FRACTION=0.572 +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=50 +export MAX_WALLCLOCK_SECONDS=900 +export VAL_LOSS_EVERY=25 +export TRAIN_LOG_EVERY=10 +export WARMUP_STEPS=5 +export WARMDOWN_ITERS=10 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=64 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=5 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=0 +export LATE_QAT=0 +export CORE_START=3 +export CORE_END=8 +export CORE_QUANT_ENABLED=0 +export NUM_PASSES=4 +export TTT_ENABLED=1 +export TTT_LR=0.002 +export TTT_EPOCHS=3 +export TTT_CHUNK_TOKENS=32768 +export TTT_FREEZE_BLOCKS=2 +export TTT_MOMENTUM=0.9 +export TTT_BATCH_SEQS=32 +export TTT_GRAD_CLIP=1.0 +export WANDB_PROJECT="parameter-golf" +export WANDB_NAME="test_4pass_noRMS_j0.1_TTT" + +LOG="/home/nesta/parameter-golf/test_4pass_ttt.log" +echo "START 4-pass no-RMSnorm jac=0.1 + TTT, 80GB cap ($(date +%H:%M:%S))" + +$PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + > "$LOG" 2>&1 + +EXIT=$? +if [ $EXIT -ne 0 ]; then + echo "FAILED (exit=$EXIT)" + tail -20 "$LOG" +else + BPB_50=$(grep 'step:50/.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + INT6_BPB=$(grep 'final_int6_roundtrip_exact.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + TTT_BPB=$(grep 'legal_ttt_exact.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + SW_BPB=$(grep 'final_int6_sliding_window_exact.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + MEM=$(grep 'peak memory' "$LOG" | head -1 | sed 's/.*allocated: \([0-9]*\) MiB.*/\1/' || echo "N/A") + echo "DONE => bpb@50=$BPB_50 int6=$INT6_BPB sw=$SW_BPB ttt=$TTT_BPB mem=${MEM}MiB" +fi + +echo "FINISHED ($(date +%H:%M:%S))" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_full_1gpu.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_full_1gpu.sh new file mode 100755 index 0000000000..5e1d43b315 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_full_1gpu.sh @@ -0,0 +1,67 @@ +#!/bin/bash +set -euo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +MINUTES="${MINUTES:-80}" +WALLCLOCK=$((MINUTES * 60)) +SEED="${SEED:-1337}" + +echo "============================================================" +echo " Full 1-GPU run: RecurrentSOTA + Learned Feedback" +echo " Wall clock: ${MINUTES} minutes (${WALLCLOCK}s)" +echo " Seed: ${SEED}" +echo "============================================================" + +PYTHONUNBUFFERED=1 \ +DATA_PATH="../../../data/datasets/fineweb10B_sp1024" \ +TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" \ +SEED="${SEED}" \ +ITERATIONS=20000 \ +MAX_WALLCLOCK_SECONDS="${WALLCLOCK}" \ +VAL_LOSS_EVERY=2000 \ +TRAIN_LOG_EVERY=200 \ +WARMUP_STEPS=20 \ +WARMDOWN_ITERS=3500 \ +TRAIN_BATCH_TOKENS=786432 \ +TRAIN_SEQ_LEN=2048 \ +EVAL_SEQ_LEN=2048 \ +EVAL_STRIDE=64 \ +NUM_LAYERS=11 \ +MODEL_DIM=512 \ +NUM_HEADS=8 \ +NUM_KV_HEADS=4 \ +BIGRAM_VOCAB_SIZE=1536 \ +XSA_LAST_N=4 \ +ROPE_DIMS=16 \ +LN_SCALE=1 \ +VE_ENABLED=1 \ +VE_DIM=128 \ +VE_LAYERS="9,10" \ +MATRIX_LR=0.025 \ +SCALAR_LR=0.025 \ +TIED_EMBED_LR=0.035 \ +MUON_MOMENTUM=0.99 \ +MUON_MOMENTUM_WARMUP_START=0.92 \ +MUON_MOMENTUM_WARMUP_STEPS=1500 \ +MUON_WD=0.04 \ +ADAM_WD=0.04 \ +GRAD_CLIP_NORM=0.3 \ +SWA_ENABLED=1 \ +SWA_EVERY=50 \ +LATE_QAT=1 \ +LATE_QAT_THRESHOLD=0.15 \ +TTT_ENABLED=0 \ +CORE_START=3 \ +CORE_END=8 \ +NUM_PASSES=2 \ +CORE_QUANT_ENABLED=0 \ +$PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.01 + +echo "" +echo "Run complete." diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/smoke_passes.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/smoke_passes.sh new file mode 100755 index 0000000000..afb1c9997b --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/smoke_passes.sh @@ -0,0 +1,63 @@ +#!/bin/bash +set -euo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +export PYTHONUNBUFFERED=1 +export TORCH_COMPILE_DISABLE=1 +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=30 +export MAX_WALLCLOCK_SECONDS=600 +export VAL_LOSS_EVERY=15 +export TRAIN_LOG_EVERY=10 +export WARMUP_STEPS=5 +export WARMDOWN_ITERS=5 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=0 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=5 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=0 +export LATE_QAT=0 +export TTT_ENABLED=0 +export CORE_START=3 +export CORE_END=8 +export CORE_QUANT_ENABLED=0 + +for PASSES in 2 3 4; do + echo "" + echo "========================================" + echo " NUM_PASSES=$PASSES (30 steps)" + echo "========================================" + export NUM_PASSES=$PASSES + $PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.01 +done + +echo "" +echo "=== All pass tests complete ===" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/smoke_test.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/smoke_test.sh new file mode 100755 index 0000000000..cac6682283 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/smoke_test.sh @@ -0,0 +1,58 @@ +#!/bin/bash +set -euo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +echo "=== Smoke test: same params as full run, 50 iterations, torch.compile ENABLED ===" + +PYTHONUNBUFFERED=1 \ +DATA_PATH="../../../data/datasets/fineweb10B_sp1024" \ +TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" \ +SEED=1337 \ +ITERATIONS=50 \ +MAX_WALLCLOCK_SECONDS=600 \ +VAL_LOSS_EVERY=25 \ +TRAIN_LOG_EVERY=10 \ +WARMUP_STEPS=5 \ +WARMDOWN_ITERS=10 \ +TRAIN_BATCH_TOKENS=786432 \ +TRAIN_SEQ_LEN=2048 \ +EVAL_SEQ_LEN=2048 \ +EVAL_STRIDE=0 \ +NUM_LAYERS=11 \ +MODEL_DIM=512 \ +NUM_HEADS=8 \ +NUM_KV_HEADS=4 \ +BIGRAM_VOCAB_SIZE=1536 \ +XSA_LAST_N=4 \ +ROPE_DIMS=16 \ +LN_SCALE=1 \ +VE_ENABLED=1 \ +VE_DIM=128 \ +VE_LAYERS="9,10" \ +MATRIX_LR=0.025 \ +SCALAR_LR=0.025 \ +TIED_EMBED_LR=0.035 \ +MUON_MOMENTUM=0.99 \ +MUON_MOMENTUM_WARMUP_START=0.92 \ +MUON_MOMENTUM_WARMUP_STEPS=5 \ +MUON_WD=0.04 \ +ADAM_WD=0.04 \ +GRAD_CLIP_NORM=0.3 \ +SWA_ENABLED=1 \ +SWA_EVERY=10 \ +LATE_QAT=0 \ +TTT_ENABLED=0 \ +CORE_START=3 \ +CORE_END=8 \ +NUM_PASSES=2 \ +CORE_QUANT_ENABLED=0 \ +$PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.01 + +echo "" +echo "=== Smoke test complete ===" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/stability.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/stability.py new file mode 100644 index 0000000000..a02c831638 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/stability.py @@ -0,0 +1,108 @@ +"""Stability monitoring and control for recurrent passes. + +Provides per-pass diagnostics, hidden-state clipping, learnable residual +scaling, and a cheap Jacobian proxy regulariser. +""" +from __future__ import annotations +import torch +import torch.nn as nn +from torch import Tensor +from dataclasses import dataclass, field + + +@dataclass +class PassDiagnostics: + h_norms: list[float] = field(default_factory=list) + delta_norms: list[float] = field(default_factory=list) + error_norms: list[float] = field(default_factory=list) + correction_norms: list[float] = field(default_factory=list) + growth_ratios: list[float] = field(default_factory=list) + + def reset(self): + for lst in (self.h_norms, self.delta_norms, self.error_norms, + self.correction_norms, self.growth_ratios): + lst.clear() + + def summary(self) -> dict[str, list[float]]: + return { + "h_norms": list(self.h_norms), + "delta_norms": list(self.delta_norms), + "error_norms": list(self.error_norms), + "correction_norms": list(self.correction_norms), + "growth_ratios": list(self.growth_ratios), + } + + +class RecurrentStabilizer: + """Manages stability diagnostics and optional controls for recurrence.""" + + def __init__( + self, + clip_hidden: bool = False, + clip_value: float = 10.0, + clip_mode: str = "value", + jacobian_proxy_weight: float = 0.0, + eps: float = 1e-6, + ): + self.clip_hidden = clip_hidden + self.clip_value = clip_value + self.clip_mode = clip_mode + self.jacobian_proxy_weight = jacobian_proxy_weight + self.eps = eps + self.diagnostics = PassDiagnostics() + + def clip(self, h: Tensor) -> Tensor: + if not self.clip_hidden: + return h + if self.clip_mode == "value": + return torch.clamp(h, -self.clip_value, self.clip_value) + norm = h.norm(dim=-1, keepdim=True) + scale = torch.clamp(self.clip_value / (norm + self.eps), max=1.0) + return h * scale + + def record_pass( + self, + h_prev: Tensor, + h_next: Tensor, + error: Tensor | None = None, + correction: Tensor | None = None, + ): + with torch.no_grad(): + h_pn = h_prev.float().norm().item() + h_nn = h_next.float().norm().item() + self.diagnostics.h_norms.append(h_nn) + self.diagnostics.delta_norms.append( + (h_next - h_prev).float().norm().item() + ) + self.diagnostics.growth_ratios.append(h_nn / (h_pn + self.eps)) + if error is not None: + self.diagnostics.error_norms.append(error.float().norm().item()) + if correction is not None: + self.diagnostics.correction_norms.append( + correction.float().norm().item() + ) + + def jacobian_proxy_loss(self, h_in: Tensor, h_out: Tensor) -> Tensor: + """Finite-difference proxy for Jacobian spectral norm.""" + if self.jacobian_proxy_weight <= 0: + return h_in.new_zeros(()) + delta = h_out - h_in + ratio = delta.norm() / (h_in.norm() + self.eps) + return self.jacobian_proxy_weight * torch.relu(ratio - 1.0).square() + + def reset(self): + self.diagnostics.reset() + + +class ResidualScale(nn.Module): + """Learnable per-pass residual scaling: + h_{k+1} = h_k + alpha_k * F(h_k + c_k)""" + + def __init__(self, num_passes: int, init_value: float = 1.0): + super().__init__() + self.scales = nn.Parameter( + torch.full((num_passes,), init_value, dtype=torch.float32) + ) + + def forward(self, residual: Tensor, pass_idx: int) -> Tensor: + return self.scales[pass_idx].to(dtype=residual.dtype) * residual diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py new file mode 100644 index 0000000000..1eebea808a --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py @@ -0,0 +1,2084 @@ +from __future__ import annotations +import copy +import glob +import io +import lzma +import math +import os +import random +import subprocess +import sys +import time +import uuid +import zlib +from pathlib import Path +try: + import zstandard + _COMPRESSOR = "zstd" +except ImportError: + _COMPRESSOR = "zlib" +import numpy as np +import sentencepiece as spm +import torch +import torch.distributed as dist +import torch.nn.functional as F +from torch import Tensor, nn +from torch.nn.parallel import DistributedDataParallel as DDP +_gpu_mem_frac = float(os.environ.get("CUDA_MEM_FRACTION", "0")) +if _gpu_mem_frac > 0: + torch.cuda.set_per_process_memory_fraction(_gpu_mem_frac, 0) +from flash_attn_interface import flash_attn_func as flash_attn_3_func +import argparse +from feedback import ErrorFeedbackModule +from stability import RecurrentStabilizer, ResidualScale +try: + import wandb as _wandb +except ImportError: + _wandb = None +class Hyperparameters: + data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024") + train_files = os.path.join(data_path, "fineweb_train_*.bin") + val_files = os.path.join(data_path, "fineweb_val_*.bin") + tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model") + run_id = os.environ.get("RUN_ID", str(uuid.uuid4())) + seed = int(os.environ.get("SEED", 1337)) + val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288)) + val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 4000)) + train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 500)) + iterations = int(os.environ.get("ITERATIONS", 20000)) + warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 3500)) + warmup_steps = int(os.environ.get("WARMUP_STEPS", 20)) + train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 786_432)) + train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 2048)) + eval_seq_len = int(os.environ.get("EVAL_SEQ_LEN", 2048)) + max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0)) + qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5)) + vocab_size = int(os.environ.get("VOCAB_SIZE", 1024)) + num_layers = int(os.environ.get("NUM_LAYERS", 11)) + num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4)) + model_dim = int(os.environ.get("MODEL_DIM", 512)) + num_heads = int(os.environ.get("NUM_HEADS", 8)) + mlp_mult = float(os.environ.get("MLP_MULT", 3.0)) + tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1"))) + rope_base = float(os.environ.get("ROPE_BASE", 10000.0)) + logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0)) + embed_lr = float(os.environ.get("EMBED_LR", 0.6)) + head_lr = float(os.environ.get("HEAD_LR", 0.008)) + tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.035)) + tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005)) + matrix_lr = float(os.environ.get("MATRIX_LR", 0.025)) + scalar_lr = float(os.environ.get("SCALAR_LR", 0.025)) + muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.99)) + muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5)) + muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.92)) + muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 1500)) + beta1 = float(os.environ.get("BETA1", 0.9)) + beta2 = float(os.environ.get("BETA2", 0.95)) + adam_eps = float(os.environ.get("ADAM_EPS", 1e-8)) + grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.3)) + eval_stride = int(os.environ.get("EVAL_STRIDE", 64)) + mtp_num_heads = int(os.environ.get("MTP_NUM_HEADS", 0)) + mtp_loss_weight = float(os.environ.get("MTP_LOSS_WEIGHT", 0.2)) + muon_beta2 = float(os.environ.get("MUON_BETA2", 0.95)) + swa_enabled = bool(int(os.environ.get("SWA_ENABLED", "1"))) + swa_every = int(os.environ.get("SWA_EVERY", 50)) + lawa_enabled = bool(int(os.environ.get("LAWA_ENABLED", "0"))) + lawa_k = int(os.environ.get("LAWA_K", 10)) + lawa_freq = int(os.environ.get("LAWA_FREQ", 100)) + muon_wd = float(os.environ.get("MUON_WD", 0.04)) + adam_wd = float(os.environ.get("ADAM_WD", 0.04)) + qat_enabled = bool(int(os.environ.get("QAT_ENABLED", "0"))) + bigram_vocab_size = int(os.environ.get("BIGRAM_VOCAB_SIZE", 2048)) + bigram_dim = int(os.environ.get("BIGRAM_DIM", 128)) + xsa_last_n = int(os.environ.get("XSA_LAST_N", 4)) + rope_dims = int(os.environ.get("ROPE_DIMS", 16)) + ln_scale = bool(int(os.environ.get("LN_SCALE", "1"))) + dtg_enabled = bool(int(os.environ.get("DTG_ENABLED", "0"))) + late_qat_threshold = float(os.environ.get("LATE_QAT_THRESHOLD", 0.15)) + ve_enabled = bool(int(os.environ.get("VE_ENABLED", "1"))) + ve_dim = int(os.environ.get("VE_DIM", 128)) + ve_layers = os.environ.get("VE_LAYERS", "9,10") + gated_attention = bool(int(os.environ.get("GATED_ATTENTION", "0"))) + value_residual = bool(int(os.environ.get("VALUE_RESIDUAL", "0"))) + ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "0"))) + ttt_lr = float(os.environ.get("TTT_LR", 0.002)) + ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3)) + ttt_chunk_tokens = int(os.environ.get("TTT_CHUNK_TOKENS", 32768)) + ttt_freeze_blocks = int(os.environ.get("TTT_FREEZE_BLOCKS", 2)) + ttt_momentum = float(os.environ.get("TTT_MOMENTUM", 0.9)) + ttt_batch_seqs = int(os.environ.get("TTT_BATCH_SEQS", 32)) + ttt_grad_clip = float(os.environ.get("TTT_GRAD_CLIP", 1.0)) + # recurrence + core_start = int(os.environ.get("CORE_START", 3)) + core_end = int(os.environ.get("CORE_END", 8)) + num_passes = int(os.environ.get("NUM_PASSES", 1)) + core_quant_bits = int(os.environ.get("CORE_QUANT_BITS", 6)) + core_quant_enabled = bool(int(os.environ.get("CORE_QUANT_ENABLED", "0"))) + +# --- Batched Newton-Schulz orthogonalization --- + +def zeropower_via_newtonschulz5(G: Tensor, steps: int = 5, eps: float = 1e-7) -> Tensor: + """Batched Newton-Schulz orthogonalization. G: (B,M,N) or (M,N).""" + a, b, c = (3.4445, -4.7750, 2.0315) + was_2d = G.ndim == 2 + if was_2d: + G = G.unsqueeze(0) + X = G.bfloat16() + transposed = X.size(-2) > X.size(-1) + if transposed: + X = X.mT + X = X / (X.norm(dim=(-2, -1), keepdim=True) + eps) + for _ in range(steps): + A = X @ X.mT + B = b * A + c * (A @ A) + X = a * X + B @ X + if transposed: + X = X.mT + if was_2d: + X = X.squeeze(0) + return X + +# --- Parallel Muon optimizer --- + +class Muon(torch.optim.Optimizer): + """Parallel Muon: post-backward reduce-scatter -> local NS5 -> all-gather. + + No DDP for bank params. After backward, this optimizer: + 1. Launches async reduce-scatter for all banks (biggest first) + 2. Returns control so Adam can step on small params while RS is in-flight + 3. Waits for each RS, runs local NS5 on the shard, launches async all-gather + 4. Each all-gather overlaps with next bank's NS5 + """ + def __init__(self, params, lr: float, momentum: float, backend_steps: int, + nesterov: bool = True, weight_decay: float = 0.0): + super().__init__( + params, + dict(lr=lr, momentum=momentum, backend_steps=backend_steps, + nesterov=nesterov, weight_decay=weight_decay), + ) + self._built = False + + def _build(self): + self._distributed = dist.is_available() and dist.is_initialized() + self._world_size = dist.get_world_size() if self._distributed else 1 + self._rank = dist.get_rank() if self._distributed else 0 + ws = self._world_size + + self._bank_meta = [] + for group in self.param_groups: + for p in group["params"]: + B = p.shape[0] + padded_B = ((B + ws - 1) // ws) * ws + shard_B = padded_B // ws + tail = p.shape[1:] + dev = p.device + self._bank_meta.append({ + 'p': p, + 'B': B, + 'padded_grad': torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), + 'shard': torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), + 'shard_mom': torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), + 'full_update': torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), + 'scale': max(1, p.shape[-2] / p.shape[-1]) ** 0.5, + }) + # Sort by size descending -- launch biggest reduce-scatters first + self._bank_meta.sort(key=lambda m: -m['p'].numel()) + self._built = True + + def launch_reduce_scatters(self): + """Phase 1: launch async reduce-scatter for all banks. Call right after backward.""" + if not self._built: + self._build() + if not self._distributed: + return + self._rs_futures = [] + for m in self._bank_meta: + p = m['p'] + if p.grad is None: + self._rs_futures.append(None) + continue + pg = m['padded_grad'] + pg[:m['B']].copy_(p.grad.bfloat16()) + if pg.shape[0] > m['B']: + pg[m['B']:].zero_() + fut = dist.reduce_scatter_tensor(m['shard'], pg, op=dist.ReduceOp.AVG, async_op=True) + self._rs_futures.append(fut) + + @torch.no_grad() + def step(self, closure=None): + """Phase 3: wait for RS, local NS5, all-gather. Call AFTER Adam steps.""" + loss = None + if closure is not None: + with torch.enable_grad(): + loss = closure() + + if not self._built: + self._build() + + for group in self.param_groups: + lr = group["lr"] + momentum = group["momentum"] + backend_steps = group["backend_steps"] + nesterov = group["nesterov"] + wd = group.get("weight_decay", 0.0) + + prev_ag_handle = None + prev_m = None + + sharded = self._distributed and hasattr(self, '_rs_futures') + + for i, m in enumerate(self._bank_meta): + p = m['p'] + if p.grad is None: + continue + + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m['p'] + upd = prev_m['full_update'][:prev_m['B']] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m['scale']) + + if sharded and self._rs_futures[i] is not None: + self._rs_futures[i].wait() + g = m['shard'] + buf = m['shard_mom'] + else: + g = p.grad.bfloat16() + state = self.state[p] + if "momentum_buffer" not in state: + state["momentum_buffer"] = torch.zeros_like(g) + buf = state["momentum_buffer"] + + buf.mul_(momentum).add_(g) + if nesterov: + update = g.add(buf, alpha=momentum) + else: + update = buf + + update = zeropower_via_newtonschulz5(update, steps=backend_steps) + + if sharded: + prev_ag_handle = dist.all_gather_into_tensor( + m['full_update'], update, async_op=True) + prev_m = m + else: + if wd > 0.0: + p.data.mul_(1.0 - lr * wd) + p.add_(update.to(dtype=p.dtype), alpha=-lr * m['scale']) + + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m['p'] + upd = prev_m['full_update'][:prev_m['B']] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m['scale']) + + if hasattr(self, '_rs_futures'): + del self._rs_futures + + return loss + +# --- Tokenizer evaluation helpers --- + +def build_sentencepiece_luts( + sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device +) -> tuple[Tensor, Tensor, Tensor]: + sp_vocab_size = int(sp.vocab_size()) + table_size = max(sp_vocab_size, vocab_size) + base_bytes_np = np.zeros((table_size,), dtype=np.int16) + has_leading_space_np = np.zeros((table_size,), dtype=np.bool_) + is_boundary_token_np = np.ones((table_size,), dtype=np.bool_) + for token_id in range(sp_vocab_size): + if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id): + continue + is_boundary_token_np[token_id] = False + if sp.is_byte(token_id): + base_bytes_np[token_id] = 1 + continue + piece = sp.id_to_piece(token_id) + if piece.startswith("\u2581"): + has_leading_space_np[token_id] = True + piece = piece[1:] + base_bytes_np[token_id] = len(piece.encode("utf-8")) + return ( + torch.tensor(base_bytes_np, dtype=torch.int16, device=device), + torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), + torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), + ) +def load_validation_tokens(pattern: str, seq_len: int) -> Tensor: + files = [Path(p) for p in sorted(glob.glob(pattern))] + if not files: + raise FileNotFoundError(f"No files found for pattern: {pattern}") + tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() + usable = ((tokens.numel() - 1) // seq_len) * seq_len + if usable <= 0: + raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") + return tokens[: usable + 1] +def eval_val( + args: Hyperparameters, + model: nn.Module, + rank: int, + world_size: int, + device: torch.device, + grad_accum_steps: int, + val_tokens: Tensor, + base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + eval_seq_len: int | None = None, +) -> tuple[float, float]: + seq_len = eval_seq_len or args.train_seq_len + local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps) + if local_batch_tokens < seq_len: + raise ValueError( + "VAL_BATCH_SIZE must provide at least one sequence per rank; " + f"got VAL_BATCH_SIZE={args.val_batch_size}, WORLD_SIZE={world_size}, " + f"GRAD_ACCUM_STEPS={grad_accum_steps}, seq_len={seq_len}" + ) + local_batch_seqs = local_batch_tokens // seq_len + total_seqs = (val_tokens.numel() - 1) // seq_len + seq_start = (total_seqs * rank) // world_size + seq_end = (total_seqs * (rank + 1)) // world_size + val_loss_sum = torch.zeros((), device=device, dtype=torch.float64) + val_token_count = torch.zeros((), device=device, dtype=torch.float64) + val_byte_count = torch.zeros((), device=device, dtype=torch.float64) + model.eval() + with torch.inference_mode(): + for batch_seq_start in range(seq_start, seq_end, local_batch_seqs): + batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end) + raw_start = batch_seq_start * seq_len + raw_end = batch_seq_end * seq_len + 1 + local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + batch_loss = model(x, y).detach() + batch_token_count = float(y.numel()) + val_loss_sum += batch_loss.to(torch.float64) * batch_token_count + val_token_count += batch_token_count + prev_ids = x.reshape(-1) + tgt_ids = y.reshape(-1) + token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16) + token_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16) + val_byte_count += token_bytes.to(torch.float64).sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM) + val_loss = val_loss_sum / val_token_count + bits_per_token = val_loss.item() / math.log(2.0) + tokens_per_byte = val_token_count.item() / val_byte_count.item() + model.train() + return float(val_loss.item()), float(bits_per_token * tokens_per_byte) + +# --- Quantization helpers --- + +CONTROL_TENSOR_NAME_PATTERNS = tuple( + pattern + for pattern in os.environ.get( + "CONTROL_TENSOR_NAME_PATTERNS", + "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,smear,dtg_gate,ve_layer_scales,ve_shared.scale,attn_gate,vr_lambda", + ).split(",") + if pattern +) +INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple( + pattern + for pattern in os.environ.get( + "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS", + ",".join(CONTROL_TENSOR_NAME_PATTERNS), + ).split(",") + if pattern +) +INT8_KEEP_FLOAT_MAX_NUMEL = 65_536 +INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16 +INT8_PER_ROW_SCALE_DTYPE = torch.float16 +INT8_CLIP_PERCENTILE = 99.99984 +INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0 +def tensor_nbytes(t: Tensor) -> int: + return int(t.numel()) * int(t.element_size()) +def keep_float_tensor(name: str, t: Tensor, passthrough_orig_dtypes: dict[str, str]) -> Tensor: + if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS): + return t.float().contiguous() + if t.dtype in {torch.float32, torch.bfloat16}: + passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.") + return t.to(dtype=INT8_KEEP_FLOAT_STORE_DTYPE).contiguous() + return t +def quantize_float_tensor(t: Tensor) -> tuple[Tensor, Tensor]: + t32 = t.float() + if t32.ndim == 2: + clip_abs = ( + torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1) + if t32.numel() + else torch.empty((t32.shape[0],), dtype=torch.float32) + ) + clipped = torch.maximum(torch.minimum(t32, clip_abs[:, None]), -clip_abs[:, None]) + scale = (clip_abs / 127.0).clamp_min(1.0 / 127.0) + q = torch.clamp(torch.round(clipped / scale[:, None]), -127, 127).to(torch.int8).contiguous() + return q, scale.to(dtype=INT8_PER_ROW_SCALE_DTYPE).contiguous() + clip_abs = float(torch.quantile(t32.abs().flatten(), INT8_CLIP_Q).item()) if t32.numel() else 0.0 + scale = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, dtype=torch.float32) + q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), -127, 127).to(torch.int8).contiguous() + return q, scale +def quantize_state_dict_int8(state_dict: dict[str, Tensor]): + quantized: dict[str, Tensor] = {} + scales: dict[str, Tensor] = {} + dtypes: dict[str, str] = {} + passthrough: dict[str, Tensor] = {} + passthrough_orig_dtypes: dict[str, str] = {} + qmeta: dict[str, dict[str, object]] = {} + stats = dict.fromkeys( + ("param_count", "num_tensors", "num_float_tensors", "num_nonfloat_tensors", "baseline_tensor_bytes", "int8_payload_bytes"), + 0, + ) + for name, tensor in state_dict.items(): + t = tensor.detach().to("cpu").contiguous() + stats["param_count"] += int(t.numel()) + stats["num_tensors"] += 1 + stats["baseline_tensor_bytes"] += tensor_nbytes(t) + if not t.is_floating_point(): + stats["num_nonfloat_tensors"] += 1 + passthrough[name] = t + stats["int8_payload_bytes"] += tensor_nbytes(t) + continue + if t.numel() <= INT8_KEEP_FLOAT_MAX_NUMEL: + kept = keep_float_tensor(name, t, passthrough_orig_dtypes) + passthrough[name] = kept + stats["int8_payload_bytes"] += tensor_nbytes(kept) + continue + stats["num_float_tensors"] += 1 + q, s = quantize_float_tensor(t) + if s.ndim > 0: + qmeta[name] = {"scheme": "per_row", "axis": 0} + quantized[name] = q + scales[name] = s + dtypes[name] = str(t.dtype).removeprefix("torch.") + stats["int8_payload_bytes"] += tensor_nbytes(q) + tensor_nbytes(s) + obj: dict[str, object] = { + "__quant_format__": "int8_clean_per_row_v1", + "quantized": quantized, + "scales": scales, + "dtypes": dtypes, + "passthrough": passthrough, + } + if qmeta: + obj["qmeta"] = qmeta + if passthrough_orig_dtypes: + obj["passthrough_orig_dtypes"] = passthrough_orig_dtypes + return obj, stats +def dequantize_state_dict_int8(obj: dict[str, object]) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + qmeta = obj.get("qmeta", {}) + passthrough_orig_dtypes = obj.get("passthrough_orig_dtypes", {}) + for name, q in obj["quantized"].items(): + dtype = getattr(torch, obj["dtypes"][name]) + s = obj["scales"][name] + if qmeta.get(name, {}).get("scheme") == "per_row" or s.ndim > 0: + s = s.to(dtype=torch.float32) + out[name] = (q.float() * s.view(q.shape[0], *([1] * (q.ndim - 1)))).to(dtype=dtype).contiguous() + else: + scale = float(s.item()) + out[name] = (q.float() * scale).to(dtype=dtype).contiguous() + for name, t in obj["passthrough"].items(): + out_t = t.detach().to("cpu").contiguous() + orig_dtype = passthrough_orig_dtypes.get(name) + if isinstance(orig_dtype, str): + out_t = out_t.to(dtype=getattr(torch, orig_dtype)).contiguous() + out[name] = out_t + return out + +# --- Data loading --- + +def load_data_shard(file: Path) -> Tensor: + header_bytes = 256 * np.dtype(" None: + self.file_idx = (self.file_idx + 1) % len(self.files) + self.tokens = load_data_shard(self.files[self.file_idx]) + self.pos = 0 + def take(self, n: int) -> Tensor: + chunks: list[Tensor] = [] + remaining = n + while remaining > 0: + avail = self.tokens.numel() - self.pos + if avail <= 0: + self._advance_file() + continue + k = min(remaining, avail) + chunks.append(self.tokens[self.pos : self.pos + k]) + self.pos += k + remaining -= k + return chunks[0] if len(chunks) == 1 else torch.cat(chunks) +class DistributedTokenLoader: + def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device): + self.rank = rank + self.world_size = world_size + self.device = device + self.stream = TokenStream(pattern) + def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]: + local_tokens = global_tokens // (self.world_size * grad_accum_steps) + per_rank_span = local_tokens + 1 + chunk = self.stream.take(per_rank_span * self.world_size) + start = self.rank * per_rank_span + local = chunk[start : start + per_rank_span].to(dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True) + +# --- Transformer modules --- + +class RMSNorm(nn.Module): + def __init__(self, eps: float | None = None): + super().__init__() + self.eps = eps + def forward(self, x: Tensor) -> Tensor: + return F.rms_norm(x, (x.size(-1),), eps=self.eps) +class CastedLinear(nn.Linear): + _qat_enabled: bool = False + def forward(self, x: Tensor) -> Tensor: + w = self.weight.to(x.dtype) + if CastedLinear._qat_enabled and self.training and w.ndim == 2: + with torch.no_grad(): + w32 = self.weight.float() + row_max = w32.abs().amax(dim=1) + scale = (row_max / 31.0).clamp_min(1.0 / 31.0) + w_q = (torch.clamp(torch.round(w32 / scale[:, None]), -32, 31) * scale[:, None]).to(x.dtype) + w = w + (w_q - w).detach() + bias = self.bias.to(x.dtype) if self.bias is not None else None + return F.linear(x, w, bias) +def restore_low_dim_params_to_fp32(module: nn.Module) -> None: + with torch.no_grad(): + for name, param in module.named_parameters(): + if (param.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32: + param.data = param.data.float() +class Rotary(nn.Module): + def __init__(self, dim: int, base: float = 10000.0, train_seq_len: int = 1024, rope_dims: int = 0): + super().__init__() + self.dim = dim + self.base = base + self.train_seq_len = train_seq_len + self.rope_dims = rope_dims if rope_dims > 0 else dim + inv_freq = 1.0 / (base ** (torch.arange(0, self.rope_dims, 2, dtype=torch.float32) / self.rope_dims)) + self.register_buffer("inv_freq", inv_freq, persistent=False) + self._seq_len_cached = 0 + self._cos_cached: Tensor | None = None + self._sin_cached: Tensor | None = None + def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]: + if ( + self._cos_cached is None + or self._sin_cached is None + or self._seq_len_cached != seq_len + or self._cos_cached.device != device + ): + rd = self.rope_dims + if seq_len > self.train_seq_len: + scale = seq_len / self.train_seq_len + new_base = self.base * (scale ** (rd / (rd - 2))) + inv_freq = 1.0 / (new_base ** (torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd)) + else: + inv_freq = self.inv_freq.to(device) + t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype) + freqs = torch.outer(t, inv_freq) + self._cos_cached = freqs.cos()[None, :, None, :] + self._sin_cached = freqs.sin()[None, :, None, :] + self._seq_len_cached = seq_len + return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype) +def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor, rope_dims: int = 0) -> Tensor: + if rope_dims > 0 and rope_dims < x.size(-1): + x_rope, x_pass = x[..., :rope_dims], x[..., rope_dims:] + half = rope_dims // 2 + x1, x2 = x_rope[..., :half], x_rope[..., half:] + x_rope = torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) + return torch.cat((x_rope, x_pass), dim=-1) + half = x.size(-1) // 2 + x1, x2 = x[..., :half], x[..., half:] + return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) + +class CausalSelfAttention(nn.Module): + def __init__( + self, + dim: int, + num_heads: int, + num_kv_heads: int, + rope_base: float, + qk_gain_init: float, + gated_attention: bool = False, + value_residual: bool = False, + ): + super().__init__() + if dim % num_heads != 0: + raise ValueError("model_dim must be divisible by num_heads") + if num_heads % num_kv_heads != 0: + raise ValueError("num_heads must be divisible by num_kv_heads") + self.num_heads = num_heads + self.num_kv_heads = num_kv_heads + self.head_dim = dim // num_heads + if self.head_dim % 2 != 0: + raise ValueError("head_dim must be even for RoPE") + # No CastedLinear -- weights come from banks + self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32)) + self.rope_dims = 0 # set by GPT.__init__ for partial RoPE + self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=1024) + self.use_xsa = False # set by GPT.__init__ for deep layers only + # Gated attention and value residual (non-banked small params) + self.gated_attention = gated_attention + if gated_attention: + self.attn_gate = nn.Linear(dim, num_heads, bias=True) + nn.init.zeros_(self.attn_gate.weight) + nn.init.constant_(self.attn_gate.bias, 4.0) + self.value_residual = value_residual + if value_residual: + self.vr_lambda = nn.Parameter(torch.tensor([0.5, 0.5], dtype=torch.float32)) + def _xsa_efficient(self, y: Tensor, v: Tensor) -> Tensor: + """Efficient XSA: subtract self-value projection via GQA-aware reshape (no repeat_interleave). + y: [B, T, H, D], v: [B, T, Hkv, D]. H must be divisible by Hkv.""" + B, T, H, D = y.shape + Hkv = v.size(-2) + group = H // Hkv + y_g = y.reshape(B, T, Hkv, group, D) # [B, T, Hkv, group, D] + vn = F.normalize(v, dim=-1).unsqueeze(-2) # [B, T, Hkv, 1, D] -- broadcast ready + proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn + return (y_g - proj).reshape(B, T, H, D) + def forward(self, x: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, v_embed: Tensor | None = None, v0: Tensor | None = None) -> tuple[Tensor, Tensor | None]: + bsz, seqlen, dim = x.shape + q = F.linear(x, q_w.to(x.dtype)).reshape(bsz, seqlen, self.num_heads, self.head_dim) + k = F.linear(x, k_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + v = F.linear(x, v_w.to(x.dtype)) + if v_embed is not None: + v = v + v_embed + v = v.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + raw_v = v if self.value_residual else None + if self.value_residual and v0 is not None: + lam = self.vr_lambda.to(dtype=v.dtype) + v = lam[0] * v0 + lam[1] * v + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = self.rotary(seqlen, x.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, self.rope_dims) + k = apply_rotary_emb(k, cos, sin, self.rope_dims) + q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None] + y = flash_attn_3_func(q, k, v, causal=True) + if self.use_xsa: + y = self._xsa_efficient(y, v) + if self.gated_attention: + # gate shape: (bsz, seqlen, num_heads) -> (bsz, seqlen, num_heads, 1) for B,T,H,D layout + gate = torch.sigmoid(self.attn_gate(x)).unsqueeze(-1) + y = y * gate + y = y.reshape(bsz, seqlen, dim) + return F.linear(y, out_w.to(x.dtype)), raw_v + +class SmearGate(nn.Module): + def __init__(self, dim: int): + super().__init__() + self.gate = nn.Parameter(torch.zeros(dim, dtype=torch.float32)) + def forward(self, x: Tensor) -> Tensor: + g = torch.sigmoid(self.gate.to(dtype=x.dtype))[None, None, :] + x_prev = torch.cat([torch.zeros_like(x[:, :1]), x[:, :-1]], dim=1) + return (1 - g) * x + g * x_prev + +class BigramHashEmbedding(nn.Module): + def __init__(self, bigram_vocab_size: int, bigram_dim: int, model_dim: int): + super().__init__() + self.bigram_vocab_size = bigram_vocab_size + self.embed = nn.Embedding(bigram_vocab_size, bigram_dim) + nn.init.zeros_(self.embed.weight) + self.proj = CastedLinear(bigram_dim, model_dim, bias=False) if bigram_dim != model_dim else None + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.05, dtype=torch.float32)) + def bigram_hash(self, tokens: Tensor) -> Tensor: + t = tokens.to(torch.int32) + mod = self.bigram_vocab_size - 1 + out = torch.empty_like(t) + out[..., 0] = mod + out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod + return out.long() + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(self.bigram_hash(token_ids)) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) + +class ValueEmbedding(nn.Module): + """Reinject token identity into attention values at specific layers. + Each table maps vocab tokens to a low-dim embedding, projected to model_dim.""" + def __init__(self, vocab_size: int, ve_dim: int, model_dim: int): + super().__init__() + self.embed = nn.Embedding(vocab_size, ve_dim) + nn.init.normal_(self.embed.weight, std=0.01) + self.proj = CastedLinear(ve_dim, model_dim, bias=False) if ve_dim != model_dim else None + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.1, dtype=torch.float32)) + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(token_ids) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) + +class MLP(nn.Module): + def __init__(self, dim: int, mlp_mult: int): + super().__init__() + # No CastedLinear -- weights come from banks + def forward(self, x: Tensor, up_w: Tensor, down_w: Tensor) -> Tensor: + x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5) + return F.linear(x.square(), down_w.to(x.dtype)) + +class Block(nn.Module): + def __init__( + self, + dim: int, + num_heads: int, + num_kv_heads: int, + mlp_mult: int, + rope_base: float, + qk_gain_init: float, + layer_idx: int = 0, + ln_scale: bool = False, + dtg: bool = False, + gated_attention: bool = False, + value_residual: bool = False, + ): + super().__init__() + self.attn_norm = RMSNorm() + self.mlp_norm = RMSNorm() + self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init, + gated_attention=gated_attention, value_residual=value_residual) + self.mlp = MLP(dim, mlp_mult) + self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float()) + self.ln_scale_factor = 1.0 / math.sqrt(layer_idx + 1) if ln_scale else 1.0 + if dtg: + self.dtg_gate = nn.Linear(dim, 1, bias=True) + nn.init.zeros_(self.dtg_gate.weight) + nn.init.constant_(self.dtg_gate.bias, 2.0) + else: + self.dtg_gate = None + def forward(self, x: Tensor, x0: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, up_w: Tensor, down_w: Tensor, v_embed: Tensor | None = None, v0: Tensor | None = None) -> tuple[Tensor, Tensor | None]: + mix = self.resid_mix.to(dtype=x.dtype) + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + attn_out, raw_v = self.attn(self.attn_norm(x_in) * self.ln_scale_factor, q_w, k_w, v_w, out_w, v_embed=v_embed, v0=v0) + x_out = x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) + if self.dtg_gate is not None: + gate = torch.sigmoid(self.dtg_gate(x_in.detach())) + x_out = x_in + gate * (x_out - x_in) + return x_out, raw_v + +def _fake_quantize(w: Tensor, bits: int = 6) -> Tensor: + clip_range = (1 << (bits - 1)) - 1 + w32 = w.float() + if w32.ndim >= 2: + row_max = w32.abs().amax(dim=-1) + scale = (row_max / clip_range).clamp_min(1.0 / clip_range) + dims = (slice(None),) * (w32.ndim - 1) + (None,) + w_q = (torch.clamp(torch.round(w32 / scale[dims]), -clip_range, clip_range) * scale[dims]).to(w.dtype) + else: + amax = w32.abs().max() + scale = (amax / clip_range).clamp_min(1.0 / clip_range) + w_q = (torch.clamp(torch.round(w32 / scale), -clip_range, clip_range) * scale).to(w.dtype) + return w + (w_q - w).detach() + +class GPT(nn.Module): + def __init__( + self, + vocab_size: int, + num_layers: int, + model_dim: int, + num_heads: int, + num_kv_heads: int, + mlp_mult: int, + tie_embeddings: bool, + tied_embed_init_std: float, + logit_softcap: float, + rope_base: float, + qk_gain_init: float, + mtp_num_heads: int = 0, + mtp_loss_weight: float = 0.1, + bigram_vocab_size: int = 0, + bigram_dim: int = 128, + xsa_last_n: int = 0, + rope_dims: int = 0, + ln_scale: bool = False, + dtg: bool = False, + ve_enabled: bool = False, + ve_dim: int = 128, + ve_layers: str = "9,10", + gated_attention: bool = False, + value_residual: bool = False, + core_start: int = 3, + core_end: int = 8, + num_passes: int = 1, + core_quant_bits: int = 6, + core_quant_enabled: bool = False, + residual_scale: nn.Module | None = None, + interpass_rmsnorm: bool = True, + ): + super().__init__() + self._ve_target_dim = num_kv_heads * (model_dim // num_heads) + if logit_softcap <= 0.0: + raise ValueError(f"logit_softcap must be positive, got {logit_softcap}") + self.tie_embeddings = tie_embeddings + self.tied_embed_init_std = tied_embed_init_std + self.logit_softcap = logit_softcap + self.value_residual = value_residual + self.mtp_num_heads = mtp_num_heads + self.mtp_loss_weight = mtp_loss_weight + self.core_start = core_start + self.core_end = min(core_end, num_layers) + self.interpass_rmsnorm = interpass_rmsnorm + self.num_passes = num_passes + self.core_quant_bits = core_quant_bits + self.core_quant_enabled = core_quant_enabled + self.num_stem = core_start + self.num_core = self.core_end - core_start + self.num_tail = num_layers - self.core_end + self.residual_scale = residual_scale + self.tok_emb = nn.Embedding(vocab_size, model_dim) + self.bigram = BigramHashEmbedding(bigram_vocab_size, bigram_dim, model_dim) if bigram_vocab_size > 0 else None + self.smear = SmearGate(model_dim) + self.num_skip_weights = min(self.num_stem, self.num_tail) + self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32)) + # Parameter banks: contiguous 3D tensors for batched optimizer + head_dim = model_dim // num_heads + kv_dim = num_kv_heads * head_dim + mlp_dim = int(mlp_mult * model_dim) + self.num_layers = num_layers + self.qo_bank = nn.Parameter(torch.empty(2 * num_layers, model_dim, model_dim)) + self.kv_bank = nn.Parameter(torch.empty(2 * num_layers, kv_dim, model_dim)) + self.mlp_up_bank = nn.Parameter(torch.empty(num_layers, mlp_dim, model_dim)) + self.mlp_down_bank = nn.Parameter(torch.empty(num_layers, model_dim, mlp_dim)) + self.blocks = nn.ModuleList( + [ + Block( + model_dim, + num_heads, + num_kv_heads, + mlp_mult, + rope_base, + qk_gain_init, + layer_idx=i, + ln_scale=ln_scale, + dtg=dtg, + gated_attention=gated_attention, + value_residual=value_residual, + ) + for i in range(num_layers) + ] + ) + if rope_dims > 0: + head_dim = model_dim // num_heads + for block in self.blocks: + block.attn.rope_dims = rope_dims + block.attn.rotary = Rotary(head_dim, base=rope_base, train_seq_len=1024, rope_dims=rope_dims) + self.ve_layer_indices = [int(x) for x in ve_layers.split(",") if x.strip()] if ve_enabled else [] + kv_dim_ve = self._ve_target_dim + if self.ve_layer_indices: + self.ve_shared = ValueEmbedding(vocab_size, ve_dim, kv_dim_ve) + self.ve_layer_scales = nn.ParameterList( + [nn.Parameter(torch.ones(1, dtype=torch.float32)) for _ in self.ve_layer_indices] + ) + else: + self.ve_shared = None + self.ve_layer_scales = nn.ParameterList() + self.value_embeds = nn.ModuleList() # keep empty for compat + self.final_norm = RMSNorm() + self.lm_head = None if tie_embeddings else CastedLinear(model_dim, vocab_size, bias=False) + if self.lm_head is not None: + self.lm_head._zero_init = True + self.mtp_heads = nn.ModuleList( + [CastedLinear(model_dim, vocab_size, bias=False) for _ in range(mtp_num_heads)] + ) + for head in self.mtp_heads: + head._zero_init = True + if xsa_last_n > 0: + for i in range(max(0, num_layers - xsa_last_n), num_layers): + if i < core_start or i >= self.core_end: + self.blocks[i].attn.use_xsa = True + self._init_weights() + def _init_weights(self) -> None: + if self.tie_embeddings: + nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std) + n = self.num_layers + proj_scale = 1.0 / math.sqrt(2 * n) + # Init banks: orthogonal, with proj layers scaled down and out/down zero-init + for i in range(n): + nn.init.orthogonal_(self.qo_bank.data[i], gain=1.0) # Q + nn.init.zeros_(self.qo_bank.data[n + i]) # Out (zero init) + nn.init.orthogonal_(self.kv_bank.data[i], gain=1.0) # K + nn.init.orthogonal_(self.kv_bank.data[n + i], gain=1.0) # V + nn.init.orthogonal_(self.mlp_up_bank.data[i], gain=1.0) # MLP up + nn.init.zeros_(self.mlp_down_bank.data[i]) # MLP down (zero init) + # Scale proj layers (out_proj and mlp_down are "proj" layers) + self.qo_bank.data[n + i].mul_(proj_scale) + self.mlp_down_bank.data[i].mul_(proj_scale) + # Init remaining nn.Linear modules (bigram proj, mtp heads, lm_head) + for name, module in self.named_modules(): + if isinstance(module, nn.Linear): + if getattr(module, "_zero_init", False): + nn.init.zeros_(module.weight) + elif module.weight.ndim == 2 and module.weight.shape[0] >= 64 and module.weight.shape[1] >= 64: + nn.init.orthogonal_(module.weight, gain=1.0) + def _get_ve(self, layer_idx: int, input_ids: Tensor, ve_cache: dict | None = None) -> Tensor | None: + """Get value embedding for a specific layer using shared table + per-layer scale.""" + if self.ve_shared is None or layer_idx not in self.ve_layer_indices: + return None + if ve_cache is not None and 've' not in ve_cache: + ve_cache['ve'] = self.ve_shared(input_ids) + ve_base = ve_cache['ve'] if ve_cache is not None else self.ve_shared(input_ids) + ve_idx = self.ve_layer_indices.index(layer_idx) + return ve_base * self.ve_layer_scales[ve_idx].to(dtype=ve_base.dtype) + def _get_bank_weights(self, bi: int) -> tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]: + n = self.num_layers + q_w = self.qo_bank[bi] + out_w = self.qo_bank[n + bi] + k_w = self.kv_bank[bi] + v_w = self.kv_bank[n + bi] + up_w = self.mlp_up_bank[bi] + down_w = self.mlp_down_bank[bi] + if self.core_quant_enabled and self.training and self.core_start <= bi < self.core_end: + q_w = _fake_quantize(q_w, self.core_quant_bits) + out_w = _fake_quantize(out_w, self.core_quant_bits) + k_w = _fake_quantize(k_w, self.core_quant_bits) + v_w = _fake_quantize(v_w, self.core_quant_bits) + up_w = _fake_quantize(up_w, self.core_quant_bits) + down_w = _fake_quantize(down_w, self.core_quant_bits) + return q_w, k_w, v_w, out_w, up_w, down_w + + def _forward_hidden(self, input_ids: Tensor, feedback_fn=None, + stabilizer=None) -> tuple[Tensor, Tensor, Tensor]: + n = self.num_layers + x = self.tok_emb(input_ids) + if self.bigram is not None: + x = x + self.bigram(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + x = self.smear(x) + x0 = x + v0 = None + skips: list[Tensor] = [] + ve_cache: dict = {} + # --- STEM --- + for i in range(self.core_start): + ve = self._get_ve(i, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) + x, raw_v = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + if v0 is None and raw_v is not None: + v0 = raw_v + skips.append(x) + # --- RECURRENT CORE (Fixes 1, 2, 5) --- + h_core_in = x + for k in range(self.num_passes): + if k > 0 and self.interpass_rmsnorm: + x = F.rms_norm(x, (x.size(-1),)) + if feedback_fn is not None: + x = x + feedback_fn(x, k) + if stabilizer is not None: + x = stabilizer.clip(x) + x_before_pass = x + for j in range(self.core_start, self.core_end): + h_prev = x + ve = self._get_ve(j, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(j) + x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + if v0 is None and raw_v is not None: + v0 = raw_v + if stabilizer is not None and self.training and not torch.compiler.is_compiling(): + stabilizer.record_pass(h_prev, x) + if self.residual_scale is not None and k > 0: + delta = x - x_before_pass + x = x_before_pass + self.residual_scale(delta, k) + h_core_out = x + # --- TAIL --- + for i in range(self.core_end, n): + ti = i - self.core_end + if ti < len(skips): + x = x + self.skip_weights[ti].to(dtype=x.dtype)[None, None, :] * skips.pop() + ve = self._get_ve(i, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) + x, _ = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + x = self.final_norm(x) + return x, h_core_in, h_core_out + + def forward(self, input_ids: Tensor, target_ids: Tensor, + feedback_fn=None, stabilizer=None) -> Tensor: + x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) + x_flat = x.reshape(-1, x.size(-1)) + targets = target_ids.reshape(-1) + if self.tie_embeddings: + logits_proj = F.linear(x_flat, self.tok_emb.weight) + else: + if self.lm_head is None: + raise RuntimeError("lm_head is required when tie_embeddings=False") + logits_proj = self.lm_head(x_flat) + logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + main_loss = F.cross_entropy(logits.float(), targets, reduction="mean") + if self.training and self.mtp_num_heads > 0 and self.mtp_loss_weight > 0.0: + _, seqlen, dim = x.shape + mtp_loss_sum = x.new_zeros(()) + mtp_loss_count = 0 + for k, mtp_head in enumerate(self.mtp_heads): + valid_t = seqlen - (k + 1) + if valid_t <= 0: + continue + mtp_hidden = x[:, :valid_t, :].reshape(-1, dim) + mtp_targets = target_ids[:, k + 1 :].reshape(-1) + mtp_logits_proj = mtp_head(mtp_hidden) + mtp_logits = self.logit_softcap * torch.tanh(mtp_logits_proj / self.logit_softcap) + mtp_loss_sum = mtp_loss_sum + F.cross_entropy(mtp_logits.float(), mtp_targets, reduction="mean") + mtp_loss_count += 1 + if mtp_loss_count > 0: + main_loss = main_loss + self.mtp_loss_weight * (mtp_loss_sum / mtp_loss_count) + if stabilizer is not None and stabilizer.jacobian_proxy_weight > 0: + main_loss = main_loss + stabilizer.jacobian_proxy_loss(h_core_in, h_core_out) + return main_loss + + def forward_logits(self, input_ids: Tensor, + feedback_fn=None, stabilizer=None) -> Tensor: + """Return logits (bsz, seq_len, vocab) without computing loss.""" + x, _, _ = self._forward_hidden(input_ids, feedback_fn, stabilizer) + if self.tie_embeddings: + logits_proj = F.linear(x, self.tok_emb.weight) + else: + logits_proj = self.lm_head(x) + return self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + +# --- Sliding window evaluation --- + +def eval_val_sliding( + args: Hyperparameters, + base_model: nn.Module, + rank: int, + world_size: int, + device: torch.device, + val_tokens: Tensor, + base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + stride: int, + batch_seqs: int = 32, + eval_seq_len: int | None = None, +) -> tuple[float, float]: + """Sliding window evaluation: each token scored with maximum context.""" + seq_len = eval_seq_len or args.train_seq_len + total_tokens = val_tokens.numel() - 1 + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= 1] + total_windows = len(window_starts) + my_s = (total_windows * rank) // world_size + my_e = (total_windows * (rank + 1)) // world_size + my_windows = window_starts[my_s:my_e] + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + base_model.eval() + compiled_logits = torch.compile(base_model.forward_logits, dynamic=False, fullgraph=True) + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk[:-1] + y_batch[i, :wlen] = chunk[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = compiled_logits(x_batch) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), + reduction="none", + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt = y_batch[i, s:wlen] + prev = x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + val_loss = (loss_sum / token_count).item() + bits_per_token = val_loss / math.log(2.0) + tokens_per_byte = token_count.item() / byte_count.item() + base_model.train() + return val_loss, bits_per_token * tokens_per_byte + + +def eval_val_sliding_ttt( + args: Hyperparameters, base_model: nn.Module, rank: int, world_size: int, + device: torch.device, val_tokens: Tensor, base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, is_boundary_token_lut: Tensor, + stride: int, batch_seqs: int = 32, log0=print, +) -> tuple[float, float]: + """Legal score-first TTT (PR #461 recipe): score each chunk with sliding windows, + then train on it. Every token scored BEFORE any update that could use it.""" + seq_len = args.train_seq_len + total_tokens = val_tokens.numel() - 1 + ttt_chunk = args.ttt_chunk_tokens + + # Pre-compute all window starts + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0] + + # Assign each window to a chunk based on the first token it scores + num_chunks = (total_tokens + ttt_chunk - 1) // ttt_chunk + chunk_windows: list[list[int]] = [[] for _ in range(num_chunks)] + for ws in window_starts: + end = min(ws + seq_len, total_tokens) + wlen = end - ws + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_start = ws + s + ci = min(scored_start // ttt_chunk, num_chunks - 1) + chunk_windows[ci].append(ws) + + log0(f"ttt_sliding:start chunks={num_chunks} chunk_tokens={ttt_chunk} " + f"total_windows={len(window_starts)} stride={stride} " + f"ttt_lr={args.ttt_lr} ttt_epochs={args.ttt_epochs} " + f"freeze_blocks={args.ttt_freeze_blocks}") + + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + + # Freeze first N blocks + frozen_block_ids = set(range(min(args.ttt_freeze_blocks, len(base_model.blocks)))) + ttt_params = [] + for name, p in base_model.named_parameters(): + freeze = False + for bi in frozen_block_ids: + if f"blocks.{bi}." in name: + freeze = True + break + if freeze: + p.requires_grad_(False) + else: + p.requires_grad_(True) + ttt_params.append(p) + + log0(f"ttt_sliding:params unfrozen={sum(p.numel() for p in ttt_params)} " + f"frozen={sum(p.numel() for p in base_model.parameters() if not p.requires_grad)}") + + optimizer = torch.optim.SGD(ttt_params, lr=args.ttt_lr, momentum=args.ttt_momentum) + t0 = time.perf_counter() + + for ci in range(num_chunks): + windows = chunk_windows[ci] + if not windows: + continue + chunk_start = ci * ttt_chunk + chunk_end = min((ci + 1) * ttt_chunk, total_tokens) + + # --- Phase 1: SCORE this chunk's windows (inference_mode) --- + my_s = (len(windows) * rank) // world_size + my_e = (len(windows) * (rank + 1)) // world_size + my_windows = windows[my_s:my_e] + + base_model.eval() + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk_tok = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk_tok[:-1] + y_batch[i, :wlen] = chunk_tok[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = base_model.forward_logits(x_batch) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), reduction="none", + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt, prev = y_batch[i, s:wlen], x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + + # --- Phase 2: TRAIN on this chunk (already scored = legal) --- + is_last_chunk = (ci == num_chunks - 1) + if not is_last_chunk and args.ttt_epochs > 0: + base_model.train() + chunk_seqs = (chunk_end - chunk_start) // seq_len + if chunk_seqs > 0: + cos_lr = args.ttt_lr * 0.5 * (1.0 + math.cos(math.pi * ci / max(num_chunks - 1, 1))) + for pg in optimizer.param_groups: + pg['lr'] = cos_lr + my_seq_s = (chunk_seqs * rank) // world_size + my_seq_e = (chunk_seqs * (rank + 1)) // world_size + my_chunk_seqs = my_seq_e - my_seq_s + for _ep in range(args.ttt_epochs): + for bs in range(0, my_chunk_seqs, args.ttt_batch_seqs): + be = min(bs + args.ttt_batch_seqs, my_chunk_seqs) + actual_bs = my_seq_s + bs + start_tok = chunk_start + actual_bs * seq_len + end_tok = chunk_start + (my_seq_s + be) * seq_len + 1 + if end_tok > val_tokens.numel(): + continue + local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + optimizer.zero_grad(set_to_none=True) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = base_model(x, y) + loss.backward() + if world_size > 1: + for p in ttt_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + torch.nn.utils.clip_grad_norm_(ttt_params, args.ttt_grad_clip) + optimizer.step() + + if rank == 0 and (ci % 10 == 0 or ci == num_chunks - 1): + elapsed = time.perf_counter() - t0 + rl = loss_sum.item() / max(token_count.item(), 1) + rbpb = rl / math.log(2.0) * (token_count.item() / max(byte_count.item(), 1)) if token_count.item() > 0 else 0.0 + log0(f" ttt_chunk [{ci+1}/{num_chunks}] bpb={rbpb:.6f} time={elapsed:.1f}s") + + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + + val_loss = (loss_sum / token_count).item() + val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item()) + + for p in base_model.parameters(): + p.requires_grad_(True) + base_model.eval() + + log0(f"ttt_sliding:done val_loss={val_loss:.6f} val_bpb={val_bpb:.6f} " + f"elapsed={time.perf_counter() - t0:.1f}s") + return val_loss, val_bpb + + +# --- GPTQ-lite int6 quantization --- + +def _classify_param(name: str) -> str: + if "tok_emb" in name or "lm_head" in name: + return "embed" + if ".mlp." in name: + return "mlp" + if ".attn." in name or (".proj." in name and ".mlp." not in name): + return "attn" + return "other" +def quantize_int6_per_row(t: Tensor, clip_range: int = 31) -> tuple[Tensor, Tensor]: + t32 = t.float() + if t32.ndim == 2: + best_q, best_s, best_err = None, None, float('inf') + for pct in [0.9990, 0.9995, 0.9999, 0.99999, 1.0]: + if pct < 1.0: + row_clip = torch.quantile(t32.abs(), pct, dim=1) + else: + row_clip = t32.abs().amax(dim=1) + s = (row_clip / clip_range).clamp_min(1.0 / clip_range).to(torch.float16) + q = torch.clamp(torch.round(t32 / s.float()[:, None]), -clip_range, clip_range).to(torch.int8) + recon = q.float() * s.float()[:, None] + err = (t32 - recon).pow(2).mean().item() + if err < best_err: + best_q, best_s, best_err = q, s, err + return best_q, best_s + amax = t32.abs().max().item() + scale = torch.tensor(amax / clip_range if amax > 0 else 1.0, dtype=torch.float16) + q = torch.clamp(torch.round(t32 / scale.float()), -clip_range, clip_range).to(torch.int8) + return q, scale + +def _unbank_state_dict(sd: dict[str, Tensor], num_layers: int) -> dict[str, Tensor]: + """Convert 3D bank tensors into individual 2D tensors with standard names.""" + out: dict[str, Tensor] = {} + n = num_layers + for name, tensor in sd.items(): + if name == "qo_bank": + for i in range(n): + out[f"blocks.{i}.attn.c_q.weight"] = tensor[i] + out[f"blocks.{i}.attn.proj.weight"] = tensor[n + i] + elif name == "kv_bank": + for i in range(n): + out[f"blocks.{i}.attn.c_k.weight"] = tensor[i] + out[f"blocks.{i}.attn.c_v.weight"] = tensor[n + i] + elif name == "mlp_up_bank": + for i in range(n): + out[f"blocks.{i}.mlp.fc.weight"] = tensor[i] + elif name == "mlp_down_bank": + for i in range(n): + out[f"blocks.{i}.mlp.proj.weight"] = tensor[i] + else: + out[name] = tensor + return out + +def _rebank_state_dict(sd: dict[str, Tensor], num_layers: int, template_sd: dict[str, Tensor]) -> dict[str, Tensor]: + """Convert individual 2D tensors back into 3D bank tensors.""" + out: dict[str, Tensor] = {} + n = num_layers + # Reconstruct banks from individual weight keys + qo_slices = [None] * (2 * n) + kv_slices = [None] * (2 * n) + up_slices = [None] * n + down_slices = [None] * n + consumed = set() + for i in range(n): + qk = f"blocks.{i}.attn.c_q.weight" + if qk in sd: + qo_slices[i] = sd[qk] + consumed.add(qk) + ok = f"blocks.{i}.attn.proj.weight" + if ok in sd: + qo_slices[n + i] = sd[ok] + consumed.add(ok) + kk = f"blocks.{i}.attn.c_k.weight" + if kk in sd: + kv_slices[i] = sd[kk] + consumed.add(kk) + vk = f"blocks.{i}.attn.c_v.weight" + if vk in sd: + kv_slices[n + i] = sd[vk] + consumed.add(vk) + fk = f"blocks.{i}.mlp.fc.weight" + if fk in sd: + up_slices[i] = sd[fk] + consumed.add(fk) + dk = f"blocks.{i}.mlp.proj.weight" + if dk in sd: + down_slices[i] = sd[dk] + consumed.add(dk) + out["qo_bank"] = torch.stack(qo_slices).to(dtype=template_sd["qo_bank"].dtype) + out["kv_bank"] = torch.stack(kv_slices).to(dtype=template_sd["kv_bank"].dtype) + out["mlp_up_bank"] = torch.stack(up_slices).to(dtype=template_sd["mlp_up_bank"].dtype) + out["mlp_down_bank"] = torch.stack(down_slices).to(dtype=template_sd["mlp_down_bank"].dtype) + for name, tensor in sd.items(): + if name not in consumed: + out[name] = tensor + return out + +def mixed_quantize_int6(state_dict: dict[str, Tensor], int6_cats: set[str]): + num_layers_total = max( + (int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")), + default=0, + ) + 1 + late_k_layers = set(range(num_layers_total - 2, num_layers_total)) + result: dict[str, Tensor] = {} + meta: dict[str, object] = {} + for name, tensor in state_dict.items(): + t = tensor.detach().cpu().contiguous() + cat = _classify_param(name) + if not t.is_floating_point() or t.numel() <= 65536: + result[name] = t.to(torch.float16) if t.is_floating_point() else t + meta[name] = "passthrough" + continue + if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS): + result[name] = t.float() + meta[name] = "passthrough_ctrl" + continue + if cat in int6_cats and t.ndim >= 1: + q, s = quantize_int6_per_row(t) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int6"} + else: + q, s = quantize_float_tensor(t) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int8"} + return result, meta +def dequantize_mixed_int6(result: dict[str, Tensor], meta: dict[str, object], + template_sd: dict[str, Tensor]) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + for name, orig in template_sd.items(): + info = meta.get(name) + if info is None: + continue + orig_dtype = orig.dtype + if info in ("passthrough", "passthrough_ctrl", "passthrough_fp16"): + t = result[name] + if t.dtype == torch.float16 and orig_dtype in (torch.float32, torch.bfloat16): + t = t.to(orig_dtype) + out[name] = t + continue + q, s = result[name + ".q"], result[name + ".scale"] + if s.ndim > 0: + out[name] = (q.float() * s.float().view(q.shape[0], *([1] * (q.ndim - 1)))).to(orig_dtype) + else: + out[name] = (q.float() * float(s.item())).to(orig_dtype) + return out + +# --- Training --- + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser(description="Recurrent SOTA with stabilization") + g = parser.add_argument_group("feedback") + g.add_argument("--feedback-rank", type=int, default=2) + g.add_argument("--feedback-mode", type=str, default="diagonal", + choices=["identity", "diagonal", "low_rank", "none"]) + g.add_argument("--per-pass-feedback", action="store_true") + g.add_argument("--affine-junction", action="store_true") + g = parser.add_argument_group("stability") + g.add_argument("--clip-hidden", action="store_true") + g.add_argument("--clip-value", type=float, default=10.0) + g.add_argument("--residual-scale-init", type=float, default=0.5) + g.add_argument("--jacobian-proxy-weight", type=float, default=0.01) + g.add_argument("--no-interpass-rmsnorm", action="store_true") + return parser.parse_args() + +def main() -> None: + cli = parse_args() + code = Path(__file__).read_text(encoding="utf-8") + args = Hyperparameters() + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + if world_size <= 0: + raise ValueError(f"WORLD_SIZE must be positive, got {world_size}") + if 8 % world_size != 0: + raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral") + grad_accum_steps = 8 // world_size + grad_scale = 1.0 / grad_accum_steps + if not torch.cuda.is_available(): + raise RuntimeError("CUDA is required") + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + master_process = rank == 0 + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp + enable_cudnn_sdp(False) + enable_flash_sdp(True) + enable_mem_efficient_sdp(False) + enable_math_sdp(False) + logfile = None + if master_process: + os.makedirs("logs", exist_ok=True) + logfile = f"logs/{args.run_id}.txt" + print(logfile) + def log0(msg: str, console: bool = True) -> None: + if not master_process: + return + if console: + print(msg) + if logfile is not None: + with open(logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + log0(code, console=False) + log0("=" * 100, console=False) + log0(f"Running Python {sys.version}", console=False) + log0(f"Running PyTorch {torch.__version__}", console=False) + log0( + subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=False).stdout, + console=False, + ) + log0("=" * 100, console=False) + random.seed(args.seed) + np.random.seed(args.seed) + torch.manual_seed(args.seed) + torch.cuda.manual_seed_all(args.seed) + if not args.tokenizer_path.endswith(".model"): + raise ValueError(f"Script only setup for SentencePiece .model file: {args.tokenizer_path}") + sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) + if int(sp.vocab_size()) != args.vocab_size: + raise ValueError( + f"VOCAB_SIZE={args.vocab_size} does not match tokenizer vocab_size={int(sp.vocab_size())}" + ) + dataset_dir = Path(args.data_path).resolve() + actual_train_files = len(list(dataset_dir.glob("fineweb_train_*.bin"))) + effective_eval_seq_len = args.eval_seq_len if args.eval_seq_len > 0 else args.train_seq_len + val_seq_len = max(args.train_seq_len, effective_eval_seq_len) + val_tokens = load_validation_tokens(args.val_files, val_seq_len) + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts( + sp, args.vocab_size, device + ) + log0(f"val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path={args.tokenizer_path}") + log0(f"train_loader:dataset:{dataset_dir.name} train_shards:{actual_train_files}") + log0(f"val_loader:shards pattern={args.val_files} tokens:{val_tokens.numel() - 1}") + CastedLinear._qat_enabled = args.qat_enabled + base_model = GPT( + vocab_size=args.vocab_size, + num_layers=args.num_layers, + model_dim=args.model_dim, + num_heads=args.num_heads, + num_kv_heads=args.num_kv_heads, + mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, + tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, + rope_base=args.rope_base, + qk_gain_init=args.qk_gain_init, + mtp_num_heads=args.mtp_num_heads, + mtp_loss_weight=args.mtp_loss_weight, + bigram_vocab_size=args.bigram_vocab_size, + bigram_dim=args.bigram_dim, + xsa_last_n=args.xsa_last_n, + rope_dims=args.rope_dims, + ln_scale=args.ln_scale, + dtg=args.dtg_enabled, + ve_enabled=args.ve_enabled, + ve_dim=args.ve_dim, + ve_layers=args.ve_layers, + gated_attention=args.gated_attention, + value_residual=args.value_residual, + core_start=args.core_start, + core_end=args.core_end, + num_passes=args.num_passes, + core_quant_bits=args.core_quant_bits, + core_quant_enabled=args.core_quant_enabled, + residual_scale=None, + interpass_rmsnorm=not cli.no_interpass_rmsnorm, + ).to(device).bfloat16() + # Banks stay FP32 (like CastedLinear weights), cast to BF16 in forward + base_model.qo_bank.data = base_model.qo_bank.data.float() + base_model.kv_bank.data = base_model.kv_bank.data.float() + base_model.mlp_up_bank.data = base_model.mlp_up_bank.data.float() + base_model.mlp_down_bank.data = base_model.mlp_down_bank.data.float() + for module in base_model.modules(): + if isinstance(module, CastedLinear): + module.float() + restore_low_dim_params_to_fp32(base_model) + # --- feedback / stabilizer --- + feedback = None + feedback_fn = None + stabilizer = None + residual_scale = None + extra_scalar_params: list[nn.Parameter] = [] + if cli.feedback_mode != "none" and args.num_passes > 1: + feedback = ErrorFeedbackModule( + dim=args.model_dim, rank=cli.feedback_rank, + feedback_mode=cli.feedback_mode, + per_pass=cli.per_pass_feedback, + num_passes=args.num_passes, + affine_junction=cli.affine_junction, + ).to(device).bfloat16() + restore_low_dim_params_to_fp32(feedback) + extra_scalar_params.extend(feedback.parameters()) + def feedback_fn(h, pass_idx): + return feedback(h, pass_idx) + log0(f"feedback: mode={cli.feedback_mode} rank={cli.feedback_rank} " + f"per_pass={cli.per_pass_feedback} params={sum(p.numel() for p in feedback.parameters())}") + if args.num_passes > 1: + stabilizer = RecurrentStabilizer( + clip_hidden=cli.clip_hidden, clip_value=cli.clip_value, + jacobian_proxy_weight=cli.jacobian_proxy_weight) + if cli.residual_scale_init != 1.0: + residual_scale = ResidualScale(args.num_passes, cli.residual_scale_init).to(device) + base_model.residual_scale = residual_scale + extra_scalar_params.extend(residual_scale.parameters()) + log0(f"recurrence: core_start={args.core_start} core_end={args.core_end} " + f"num_passes={args.num_passes} stem={base_model.num_stem} " + f"core={base_model.num_core} tail={base_model.num_tail}") + # No DDP -- Parallel Muon handles bank grad communication via reduce-scatter, + # and non-bank grads are manually all-reduced before Adam steps. + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + model = compiled_model + + # Optimizer split: + # - 4 parameter banks -> Muon (batched Newton-Schulz) + # - token embedding -> Adam + # - scalars/control tensors -> Adam + # - bigram proj, mtp heads, VE proj -> Adam (small matrix params not worth banking) + matrix_params = [ + base_model.qo_bank, base_model.kv_bank, + base_model.mlp_up_bank, base_model.mlp_down_bank, + ] + block_named_params = list(base_model.blocks.named_parameters()) + scalar_params = [ + p + for name, p in block_named_params + if p.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS) + ] + if base_model.skip_weights.numel() > 0: + scalar_params.append(base_model.skip_weights) + scalar_params.append(base_model.smear.gate) + if base_model.bigram is not None: + scalar_params.append(base_model.bigram.scale) + token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr + tok_params = [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}] + if base_model.bigram is not None: + tok_params.append({"params": [base_model.bigram.embed.weight], "lr": token_lr, "base_lr": token_lr}) + if base_model.bigram.proj is not None: + scalar_params.append(base_model.bigram.proj.weight) + if base_model.ve_shared is not None: + tok_params.append({"params": [base_model.ve_shared.embed.weight], "lr": token_lr, "base_lr": token_lr}) + if base_model.ve_shared.proj is not None: + scalar_params.append(base_model.ve_shared.proj.weight) + scalar_params.append(base_model.ve_shared.scale) + for s in base_model.ve_layer_scales: + scalar_params.append(s) + optimizer_tok = torch.optim.AdamW( + tok_params, + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + weight_decay=args.adam_wd, + fused=True, + ) + optimizer_muon = Muon( + matrix_params, + lr=args.matrix_lr, + momentum=args.muon_momentum, + backend_steps=args.muon_backend_steps, + weight_decay=args.muon_wd, + ) + for group in optimizer_muon.param_groups: + group["base_lr"] = args.matrix_lr + scalar_params.extend(extra_scalar_params) + optimizer_scalar = torch.optim.AdamW( + [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}], + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + weight_decay=args.adam_wd, + fused=True, + ) + # Non-bank params that need manual all-reduce (replicated across GPUs) + replicated_params = list(optimizer_tok.param_groups[0]["params"]) + for pg in optimizer_tok.param_groups[1:]: + replicated_params.extend(pg["params"]) + replicated_params.extend(scalar_params) + + optimizer_head = None + if base_model.lm_head is not None: + optimizer_head = torch.optim.Adam( + [{"params": [base_model.lm_head.weight], "lr": args.head_lr, "base_lr": args.head_lr}], + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + fused=True, + ) + replicated_params.append(base_model.lm_head.weight) + optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar] + if optimizer_head is not None: + optimizers.append(optimizer_head) + n_params = sum(p.numel() for p in base_model.parameters()) + mtp_params = sum(p.numel() for p in base_model.mtp_heads.parameters()) + log0(f"model_params:{n_params}") + log0(f"mtp_num_heads:{args.mtp_num_heads} mtp_loss_weight:{args.mtp_loss_weight} mtp_params:{mtp_params}") + xsa_layers = [i for i, b in enumerate(base_model.blocks) if b.attn.use_xsa] + log0(f"XSA:last_{args.xsa_last_n} active_layers:{xsa_layers}") + log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}") + log0("sdp_backends:cudnn=False flash=True mem_efficient=False math=False") + log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}") + log0( + f"tie_embeddings:{args.tie_embeddings} embed_lr:{token_lr} " + f"head_lr:{args.head_lr if base_model.lm_head is not None else 0.0} " + f"matrix_lr:{args.matrix_lr} scalar_lr:{args.scalar_lr}" + ) + log0( + f"train_batch_tokens:{args.train_batch_tokens} train_seq_len:{args.train_seq_len} " + f"iterations:{args.iterations} warmup_steps:{args.warmup_steps} " + f"max_wallclock_seconds:{args.max_wallclock_seconds:.3f}" + ) + log0(f"seed:{args.seed}") + use_wandb = _wandb is not None and rank == 0 and os.environ.get("WANDB_DISABLED", "0") != "1" + if use_wandb: + _wandb.init( + project=os.environ.get("WANDB_PROJECT", "parameter-golf"), + name=os.environ.get("WANDB_NAME", f"recurrent_p{args.num_passes}_s{args.seed}"), + config={ + "num_layers": args.num_layers, "model_dim": args.model_dim, + "num_passes": args.num_passes, "core_start": args.core_start, + "core_end": args.core_end, "seed": args.seed, + "train_batch_tokens": args.train_batch_tokens, + "train_seq_len": args.train_seq_len, "iterations": args.iterations, + "matrix_lr": args.matrix_lr, "scalar_lr": args.scalar_lr, + "feedback_mode": cli.feedback_mode, "feedback_rank": cli.feedback_rank, + "jacobian_proxy_weight": cli.jacobian_proxy_weight, + "residual_scale_init": cli.residual_scale_init, + "interpass_rmsnorm": not cli.no_interpass_rmsnorm, + "n_params": sum(p.numel() for p in base_model.parameters()), + }, + reinit=True, + ) + log0("wandb:initialized") + train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) + def zero_grad_all() -> None: + for opt in optimizers: + opt.zero_grad(set_to_none=True) + max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None + def lr_mul(step: int, elapsed_ms: float) -> float: + if args.warmdown_iters <= 0: + return 1.0 + if max_wallclock_ms is None: + warmdown_start = max(args.iterations - args.warmdown_iters, 0) + return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0) if warmdown_start <= step < args.iterations else 1.0 + step_ms = elapsed_ms / max(step, 1) + warmdown_ms = args.warmdown_iters * step_ms + remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) + return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0 + if args.warmup_steps > 0: + initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()} + initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] + model.train() + for warmup_step in range(args.warmup_steps): + zero_grad_all() + for micro_step in range(grad_accum_steps): + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + (warmup_loss * grad_scale).backward() + if distributed: + for p in base_model.parameters(): + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + if feedback is not None: + for p in feedback.parameters(): + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + for opt in optimizers: + opt.step() + zero_grad_all() + if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps: + log0(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}") + base_model.load_state_dict(initial_model_state, strict=True) + for opt, state in zip(optimizers, initial_optimizer_states, strict=True): + opt.load_state_dict(state) + zero_grad_all() + train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) + swa_state: dict[str, Tensor] | None = None + swa_count = 0 + from collections import deque + lawa_queue: deque[dict[str, Tensor]] = deque(maxlen=args.lawa_k) + _all_state = dict(base_model.state_dict()) + if feedback is not None: + for k, v in feedback.state_dict().items(): + _all_state[f"_fb.{k}"] = v + ema_state = {name: t.detach().float().clone() for name, t in _all_state.items()} + ema_decay = 0.997 + training_time_ms = 0.0 + stop_after_step: int | None = None + torch.cuda.synchronize() + t0 = time.perf_counter() + step = 0 + while True: + last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step) + should_validate = last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0) + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1000.0 * (time.perf_counter() - t0) + val_loss, val_bpb = eval_val( + args, + model, + rank, + world_size, + device, + grad_accum_steps, + val_tokens, + base_bytes_lut, + has_leading_space_lut, + is_boundary_token_lut, + ) + diag_str = "" + if stabilizer is not None and stabilizer.diagnostics.h_norms: + hn = [f"{v:.1f}" for v in stabilizer.diagnostics.h_norms[-args.num_passes*base_model.num_core:]] + gr = [f"{v:.3f}" for v in stabilizer.diagnostics.growth_ratios[-args.num_passes*base_model.num_core:]] + diag_str = f" h_norms={hn} growth={gr}" + stabilizer.reset() + log0( + f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} " + f"train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms / max(step, 1):.2f}ms" + f"{diag_str}" + ) + if use_wandb: + wb_data = {"val_loss": val_loss, "val_bpb": val_bpb} + if stabilizer is not None and stabilizer.diagnostics.growth_ratios: + wb_data["max_growth"] = max(stabilizer.diagnostics.growth_ratios) + wb_data["mean_growth"] = sum(stabilizer.diagnostics.growth_ratios) / len(stabilizer.diagnostics.growth_ratios) + _wandb.log(wb_data, step=step) + torch.cuda.synchronize() + t0 = time.perf_counter() + if last_step: + if stop_after_step is not None and step < args.iterations: + log0( + f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms " + f"step:{step}/{args.iterations}" + ) + break + elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + scale = lr_mul(step, elapsed_ms) + if args.late_qat_threshold > 0 and scale < args.late_qat_threshold and not CastedLinear._qat_enabled: + CastedLinear._qat_enabled = True + log0(f"late_qat:enabled step:{step} scale:{scale:.4f}") + zero_grad_all() + train_loss = torch.zeros((), device=device) + for micro_step in range(grad_accum_steps): + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + train_loss += loss.detach() + (loss * grad_scale).backward() + train_loss /= grad_accum_steps + frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0 + muon_momentum = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum + for group in optimizer_muon.param_groups: + group["momentum"] = muon_momentum + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group["base_lr"] * scale + if args.grad_clip_norm > 0: + torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm) + # === 3-phase overlapped optimizer step === + # Phase 1: Launch async reduce-scatter for banks (biggest first) + optimizer_muon.launch_reduce_scatters() + # Phase 2: All-reduce non-bank grads + step Adam (while bank RS is in-flight) + if distributed: + for p in replicated_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + optimizer_tok.step() + optimizer_scalar.step() + if optimizer_head is not None: + optimizer_head.step() + # Phase 3: Wait for RS, local NS5, all-gather (banks processed last) + optimizer_muon.step() + zero_grad_all() + # EMA update + with torch.no_grad(): + _cur = dict(base_model.state_dict()) + if feedback is not None: + for k, v in feedback.state_dict().items(): + _cur[f"_fb.{k}"] = v + for name, t in _cur.items(): + ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay) + step += 1 + approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + if args.swa_enabled and scale < 0.2 and step % args.swa_every == 0: + if swa_state is None: + swa_state = {name: t.detach().cpu().clone() for name, t in base_model.state_dict().items()} + swa_count = 1 + log0(f"swa:start step:{step}") + else: + for name, t in base_model.state_dict().items(): + swa_state[name] += t.detach().cpu() + swa_count += 1 + if args.lawa_enabled and step % args.lawa_freq == 0: + lawa_queue.append({name: t.detach().cpu().clone() for name, t in base_model.state_dict().items()}) + should_log_train = ( + args.train_log_every > 0 + and (step <= 10 or step % args.train_log_every == 0 or stop_after_step is not None) + ) + if should_log_train: + tl = train_loss.item() + log0( + f"step:{step}/{args.iterations} train_loss:{tl:.4f} " + f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms / step:.2f}ms" + ) + if use_wandb: + _wandb.log({"train_loss": tl, "step_avg_ms": approx_training_time_ms / step, "lr_scale": scale}, step=step) + reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms + if distributed and max_wallclock_ms is not None: + reached_cap_tensor = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX) + reached_cap = bool(reached_cap_tensor.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + log0( + f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB " + f"reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB" + ) + # Apply weight averaging + if args.lawa_enabled and len(lawa_queue) > 1: + log0(f"lawa:applying LAWA averaging k={len(lawa_queue)}") + current_state = base_model.state_dict() + avg_state = {name: torch.zeros(t.shape, dtype=torch.float32, device='cpu') for name, t in current_state.items()} + for snap in lawa_queue: + for name in avg_state: + avg_state[name] += snap[name].float() + for name in avg_state: + avg_state[name] /= len(lawa_queue) + avg_state[name] = avg_state[name].to(dtype=current_state[name].dtype) + base_model.load_state_dict(avg_state, strict=True) + else: + log0("ema:applying EMA weights") + current_state = base_model.state_dict() + model_ema = {k: v for k, v in ema_state.items() if not k.startswith("_fb.")} + avg_state = {name: model_ema[name].to(dtype=current_state[name].dtype) for name in current_state} + base_model.load_state_dict(avg_state, strict=True) + if feedback is not None: + fb_ema = {k.removeprefix("_fb."): v for k, v in ema_state.items() if k.startswith("_fb.")} + fb_state = feedback.state_dict() + fb_avg = {k: fb_ema[k].to(dtype=fb_state[k].dtype) for k in fb_state} + feedback.load_state_dict(fb_avg, strict=True) + torch.cuda.synchronize() + t_diag = time.perf_counter() + diag_val_loss, diag_val_bpb = eval_val( + args, compiled_model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + ) + torch.cuda.synchronize() + log0( + f"DIAGNOSTIC post_ema val_loss:{diag_val_loss:.4f} val_bpb:{diag_val_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_diag):.0f}ms" + ) + full_state_dict = base_model.state_dict() + export_sd = {k: v for k, v in full_state_dict.items() if "mtp_heads" not in k} + excluded_mtp = sum(int(t.numel()) for k, t in full_state_dict.items() if "mtp_heads" in k) + if excluded_mtp > 0: + log0(f"export_excluding_mtp_params:{excluded_mtp}") + if master_process: + torch.save(export_sd, "final_model.pt") + model_bytes = os.path.getsize("final_model.pt") + code_bytes = len(code.encode("utf-8")) + log0(f"Serialized model: {model_bytes} bytes") + log0(f"Code size: {code_bytes} bytes") + # Unbank 3D tensors into individual 2D tensors for quantization + sd_cpu = {k: v.detach().cpu() for k, v in export_sd.items()} + unbanked_sd = _unbank_state_dict(sd_cpu, args.num_layers) + quant_result, quant_meta = mixed_quantize_int6(unbanked_sd, {"mlp", "attn"}) + quant_buf = io.BytesIO() + torch.save({"w": quant_result, "m": quant_meta}, quant_buf) + quant_raw = quant_buf.getvalue() + quant_blob = lzma.compress(quant_raw, preset=6) + if master_process: + with open("final_model.int6.ptz", "wb") as f: + f.write(quant_blob) + quant_file_bytes = len(quant_blob) + code_bytes = len(code.encode("utf-8")) + log0(f"Serialized model int6+lzma: {quant_file_bytes} bytes") + log0(f"Total submission size int6+lzma: {quant_file_bytes + code_bytes} bytes") + if distributed: + dist.barrier() + with open("final_model.int6.ptz", "rb") as f: + quant_blob_disk = f.read() + quant_state = torch.load( + io.BytesIO(lzma.decompress(quant_blob_disk)), + map_location="cpu", + ) + deq_unbanked = dequantize_mixed_int6(quant_state["w"], quant_state["m"], unbanked_sd) + # Re-bank the dequantized tensors + deq_state = _rebank_state_dict(deq_unbanked, args.num_layers, sd_cpu) + eval_model = GPT( + vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim, + num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init, + mtp_num_heads=0, mtp_loss_weight=0.0, + bigram_vocab_size=args.bigram_vocab_size, bigram_dim=args.bigram_dim, + xsa_last_n=args.xsa_last_n, + rope_dims=args.rope_dims, ln_scale=args.ln_scale, dtg=args.dtg_enabled, + ve_enabled=args.ve_enabled, ve_dim=args.ve_dim, ve_layers=args.ve_layers, + gated_attention=args.gated_attention, value_residual=args.value_residual, + core_start=args.core_start, core_end=args.core_end, + num_passes=args.num_passes, + interpass_rmsnorm=not cli.no_interpass_rmsnorm, + ).to(device).bfloat16() + if residual_scale is not None: + eval_rs = ResidualScale(args.num_passes, cli.residual_scale_init).to(device) + eval_model.residual_scale = eval_rs + eval_model.qo_bank.data = eval_model.qo_bank.data.float() + eval_model.kv_bank.data = eval_model.kv_bank.data.float() + eval_model.mlp_up_bank.data = eval_model.mlp_up_bank.data.float() + eval_model.mlp_down_bank.data = eval_model.mlp_down_bank.data.float() + for m in eval_model.modules(): + if isinstance(m, CastedLinear): + m.float() + restore_low_dim_params_to_fp32(eval_model) + eval_model.load_state_dict(deq_state, strict=True) + compiled_eval = torch.compile(eval_model, dynamic=False, fullgraph=True) + torch.cuda.synchronize() + t_qeval = time.perf_counter() + q_val_loss, q_val_bpb = eval_val( + args, compiled_eval, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + eval_seq_len=effective_eval_seq_len, + ) + torch.cuda.synchronize() + log0( + f"final_int6_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms" + ) + log0(f"final_int6_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}") + sw_seq_len = effective_eval_seq_len + if args.eval_stride > 0 and args.eval_stride < sw_seq_len: + torch.cuda.synchronize() + t_slide = time.perf_counter() + sw_val_loss, sw_val_bpb = eval_val_sliding( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, + eval_seq_len=sw_seq_len, + ) + torch.cuda.synchronize() + log0( + f"final_int6_sliding_window val_loss:{sw_val_loss:.4f} val_bpb:{sw_val_bpb:.4f} " + f"stride:{args.eval_stride} eval_time:{1000.0 * (time.perf_counter() - t_slide):.0f}ms" + ) + log0(f"final_int6_sliding_window_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}") + log0(f"final_int8_zlib_roundtrip_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}") + if args.eval_stride > 0 and args.eval_stride != 64 and 64 < sw_seq_len: + torch.cuda.synchronize() + t_slide64 = time.perf_counter() + sw64_val_loss, sw64_val_bpb = eval_val_sliding( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=64, + eval_seq_len=sw_seq_len, + ) + torch.cuda.synchronize() + log0( + f"final_int6_sliding_window_s64 val_loss:{sw64_val_loss:.4f} val_bpb:{sw64_val_bpb:.4f} " + f"stride:64 eval_time:{1000.0 * (time.perf_counter() - t_slide64):.0f}ms" + ) + log0(f"final_int6_sliding_window_s64_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}") + log0(f"final_int8_zlib_roundtrip_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}") + # Legal score-first TTT (PR #461 recipe) + if args.ttt_enabled: + torch.cuda.synchronize() + t_ttt = time.perf_counter() + ttt_loss, ttt_bpb = eval_val_sliding_ttt( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, log0=log0, + ) + torch.cuda.synchronize() + log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_ttt):.0f}ms") + log0(f"legal_ttt_exact val_loss:{ttt_loss:.8f} val_bpb:{ttt_bpb:.8f}") + if use_wandb: + _wandb.finish() + if distributed: + dist.destroy_process_group() +if __name__ == "__main__": + main() diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug-internal.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug-internal.log new file mode 120000 index 0000000000..8e9b7f0d45 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug-internal.log @@ -0,0 +1 @@ +run-20260326_125242-meaoom9b/logs/debug-internal.log \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug.log new file mode 120000 index 0000000000..733e002e6a --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug.log @@ -0,0 +1 @@ +run-20260326_125242-meaoom9b/logs/debug.log \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run new file mode 120000 index 0000000000..699dd5e8ca --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run @@ -0,0 +1 @@ +run-20260326_125242-meaoom9b \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/output.log new file mode 100644 index 0000000000..c553d12bd9 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10184.1', '11333.6', '12573.3', '13910.9', '15223.2', '8143.8', '9246.6', '10423.0', '11669.9', '12876.4'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] +step:1/50 train_loss:6.9310 train_time:1952ms step_avg:1952.03ms +step:2/50 train_loss:8.5267 train_time:3880ms step_avg:1940.18ms +step:3/50 train_loss:7.6283 train_time:5841ms step_avg:1947.02ms +step:4/50 train_loss:7.3205 train_time:7802ms step_avg:1950.38ms +step:5/50 train_loss:7.1281 train_time:9762ms step_avg:1952.40ms +step:6/50 train_loss:7.0824 train_time:11723ms step_avg:1953.81ms +step:7/50 train_loss:7.0693 train_time:13683ms step_avg:1954.74ms +step:8/50 train_loss:6.9484 train_time:15645ms step_avg:1955.59ms +step:9/50 train_loss:6.6018 train_time:17606ms step_avg:1956.19ms +step:10/50 train_loss:6.2455 train_time:19568ms step_avg:1956.75ms +step:20/50 train_loss:4.9604 train_time:39172ms step_avg:1958.61ms +step:25/50 val_loss:4.4390 val_bpb:2.6290 train_time:49012ms step_avg:1960.49ms h_norms=['14873.4', '16984.5', '19609.5', '22769.3', '26596.0', '9176.2', '11567.0', '14326.1', '17549.3', '21209.4'] growth=['1.122', '1.142', '1.155', '1.161', '1.168', '1.293', '1.261', '1.239', '1.225', '1.209'] +step:30/50 train_loss:4.2813 train_time:58788ms step_avg:1959.60ms +step:40/50 train_loss:3.9445 train_time:78404ms step_avg:1960.11ms +step:50/50 train_loss:3.7524 train_time:98153ms step_avg:1963.05ms +step:50/50 val_loss:3.7228 val_bpb:2.2048 train_time:98186ms step_avg:1963.73ms h_norms=['24291.9', '27271.2', '31301.3', '36517.1', '43661.8', '10419.3', '14412.6', '19169.9', '24972.2', '31506.7'] growth=['1.087', '1.123', '1.148', '1.167', '1.196', '1.469', '1.383', '1.330', '1.303', '1.262'] +peak memory allocated: 42589 MiB reserved: 43756 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9163 val_bpb:3.5040 eval_time:53998ms +Serialized model: 106023671 bytes +Code size: 98482 bytes +Serialized model int6+lzma: 4808664 bytes +Total submission size int6+lzma: 4907146 bytes +final_int6_roundtrip val_loss:6.1192 val_bpb:3.6242 eval_time:53678ms +final_int6_roundtrip_exact val_loss:6.11924756 val_bpb:3.62416309 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/wandb-metadata.json new file mode 100644 index 0000000000..125b916774 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/wandb-metadata.json @@ -0,0 +1,50 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T10:20:57.757647Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39163809792" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "srh6199e3woms4qu9azho2s6gvavn90t" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/output.log new file mode 100644 index 0000000000..355a614aaf --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/output.log @@ -0,0 +1,3 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/wandb-metadata.json new file mode 100644 index 0000000000..28a9c2ba18 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/wandb-metadata.json @@ -0,0 +1,50 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T10:36:45.118964Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39164932096" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "yx89lnasccbpekl0j4o7x9y4zum59zhs" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/output.log new file mode 100644 index 0000000000..59b3f03cc7 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.6', '11333.2', '12572.8', '13910.4', '15222.6', '8143.9', '9247.0', '10423.5', '11670.2', '12876.4'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] +step:1/50 train_loss:6.9310 train_time:1949ms step_avg:1949.49ms +step:2/50 train_loss:8.5267 train_time:3878ms step_avg:1938.79ms +step:3/50 train_loss:7.6283 train_time:5839ms step_avg:1946.17ms +step:4/50 train_loss:7.3204 train_time:7799ms step_avg:1949.66ms +step:5/50 train_loss:7.1281 train_time:9759ms step_avg:1951.73ms +step:6/50 train_loss:7.0824 train_time:11719ms step_avg:1953.12ms +step:7/50 train_loss:7.0695 train_time:13679ms step_avg:1954.13ms +step:8/50 train_loss:6.9486 train_time:15639ms step_avg:1954.91ms +step:9/50 train_loss:6.6021 train_time:17599ms step_avg:1955.45ms +step:10/50 train_loss:6.2460 train_time:19559ms step_avg:1955.87ms +step:20/50 train_loss:4.9600 train_time:39160ms step_avg:1958.01ms +step:25/50 val_loss:4.4394 val_bpb:2.6293 train_time:48999ms step_avg:1959.95ms h_norms=['14881.1', '16985.9', '19606.6', '22793.0', '26627.8', '9172.8', '11563.4', '14325.3', '17588.9', '21265.7'] growth=['1.122', '1.141', '1.154', '1.163', '1.168', '1.293', '1.261', '1.239', '1.228', '1.209'] +step:30/50 train_loss:4.2822 train_time:58772ms step_avg:1959.06ms +step:40/50 train_loss:3.9688 train_time:78387ms step_avg:1959.68ms +step:50/50 train_loss:3.7881 train_time:98126ms step_avg:1962.52ms +step:50/50 val_loss:3.7348 val_bpb:2.2120 train_time:98160ms step_avg:1963.20ms h_norms=['24218.6', '27362.1', '31613.8', '37171.5', '44612.3', '10585.9', '14797.2', '19826.0', '26015.5', '32857.0'] growth=['1.092', '1.130', '1.155', '1.176', '1.200', '1.492', '1.398', '1.340', '1.312', '1.263'] +peak memory allocated: 42589 MiB reserved: 43756 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9155 val_bpb:3.5035 eval_time:53996ms +Serialized model: 106023671 bytes +Code size: 98482 bytes +Serialized model int6+lzma: 4809880 bytes +Total submission size int6+lzma: 4908362 bytes +final_int6_roundtrip val_loss:6.1179 val_bpb:3.6234 eval_time:53675ms +final_int6_roundtrip_exact val_loss:6.11791815 val_bpb:3.62337574 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/wandb-metadata.json new file mode 100644 index 0000000000..2449cdbaea --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/wandb-metadata.json @@ -0,0 +1,50 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T10:37:37.845855Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39165120512" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "czvy3uj9pzs3a75jgw4247r47mg09bgt" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/config.yaml new file mode 100644 index 0000000000..d6e48b92ca --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/config.yaml @@ -0,0 +1,95 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + dldmmu7898u3wxp50x9s5v424dyq7s0p: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39165960192" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T10:47:22.053609Z" + writerId: dldmmu7898u3wxp50x9s5v424dyq7s0p + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +iterations: + value: 50 +jacobian_proxy_weight: + value: 0 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927198 +num_layers: + value: 11 +num_passes: + value: 2 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/output.log new file mode 100644 index 0000000000..094928789d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.4', '11332.7', '12572.5', '13910.0', '15222.2', '8143.5', '9246.5', '10422.9', '11669.4', '12875.7'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] +step:1/50 train_loss:6.9310 train_time:1951ms step_avg:1951.04ms +step:2/50 train_loss:8.5267 train_time:3880ms step_avg:1940.13ms +step:3/50 train_loss:7.6283 train_time:5841ms step_avg:1947.13ms +step:4/50 train_loss:7.3204 train_time:7802ms step_avg:1950.55ms +step:5/50 train_loss:7.1281 train_time:9763ms step_avg:1952.51ms +step:6/50 train_loss:7.0824 train_time:11723ms step_avg:1953.92ms +step:7/50 train_loss:7.0689 train_time:13684ms step_avg:1954.91ms +step:8/50 train_loss:6.9485 train_time:15646ms step_avg:1955.69ms +step:9/50 train_loss:6.6014 train_time:17607ms step_avg:1956.32ms +step:10/50 train_loss:6.2455 train_time:19569ms step_avg:1956.88ms +step:20/50 train_loss:4.9608 train_time:39179ms step_avg:1958.94ms +step:25/50 val_loss:4.4416 val_bpb:2.6306 train_time:49019ms step_avg:1960.76ms h_norms=['14908.8', '17039.5', '19688.6', '22908.3', '26782.7', '9197.9', '11626.4', '14426.1', '17733.4', '21464.3'] growth=['1.123', '1.143', '1.155', '1.164', '1.169', '1.296', '1.264', '1.241', '1.229', '1.210'] +step:30/50 train_loss:4.2882 train_time:58795ms step_avg:1959.83ms +step:40/50 train_loss:3.9513 train_time:78416ms step_avg:1960.41ms +step:50/50 train_loss:3.7675 train_time:98151ms step_avg:1963.03ms +step:50/50 val_loss:3.7271 val_bpb:2.2074 train_time:98185ms step_avg:1963.71ms h_norms=['24528.6', '27553.7', '31725.9', '37229.2', '44851.6', '10551.0', '14775.8', '19896.1', '26232.7', '33500.8'] growth=['1.086', '1.123', '1.151', '1.173', '1.205', '1.487', '1.400', '1.347', '1.318', '1.277'] +peak memory allocated: 42589 MiB reserved: 43756 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9153 val_bpb:3.5034 eval_time:53981ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4812040 bytes +Total submission size int6+lzma: 4910547 bytes +final_int6_roundtrip val_loss:6.1184 val_bpb:3.6236 eval_time:53666ms +final_int6_roundtrip_exact val_loss:6.11836447 val_bpb:3.62364007 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/wandb-metadata.json new file mode 100644 index 0000000000..7010564d2e --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/wandb-metadata.json @@ -0,0 +1,50 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T10:47:22.053609Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39165960192" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "dldmmu7898u3wxp50x9s5v424dyq7s0p" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/wandb-summary.json new file mode 100644 index 0000000000..480b5ccc06 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/wandb-summary.json @@ -0,0 +1 @@ +{"_timestamp":1.7745223135023565e+09,"val_loss":3.7271137242587393,"val_bpb":2.207406685677728,"_runtime":396.160674534,"train_loss":3.767535924911499,"step_avg_ms":1963.0270427200594,"_step":50,"_wandb":{"runtime":396},"lr_scale":1} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/config.yaml new file mode 100644 index 0000000000..84d26cfa25 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/config.yaml @@ -0,0 +1,95 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + 7zt9kqhoenh4znl80mx5k2htd8t6ll0v: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.001" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39166717952" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T10:54:05.714551Z" + writerId: 7zt9kqhoenh4znl80mx5k2htd8t6ll0v + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +iterations: + value: 50 +jacobian_proxy_weight: + value: 0.001 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927198 +num_layers: + value: 11 +num_passes: + value: 2 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/output.log new file mode 100644 index 0000000000..5d14acfddd --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.3', '11332.9', '12572.4', '13909.9', '15221.9', '8143.5', '9246.2', '10422.5', '11669.1', '12875.2'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] +step:1/50 train_loss:6.9310 train_time:1969ms step_avg:1968.99ms +step:2/50 train_loss:8.5267 train_time:3913ms step_avg:1956.75ms +step:3/50 train_loss:7.6283 train_time:5891ms step_avg:1963.51ms +step:4/50 train_loss:7.3204 train_time:7867ms step_avg:1966.84ms +step:5/50 train_loss:7.1282 train_time:9845ms step_avg:1969.03ms +step:6/50 train_loss:7.0824 train_time:11823ms step_avg:1970.45ms +step:7/50 train_loss:7.0693 train_time:13800ms step_avg:1971.43ms +step:8/50 train_loss:6.9483 train_time:15777ms step_avg:1972.14ms +step:9/50 train_loss:6.6021 train_time:17755ms step_avg:1972.76ms +step:10/50 train_loss:6.2458 train_time:19732ms step_avg:1973.24ms +step:20/50 train_loss:4.9604 train_time:39495ms step_avg:1974.76ms +step:25/50 val_loss:4.4376 val_bpb:2.6282 train_time:49413ms step_avg:1976.52ms h_norms=['14875.2', '16993.9', '19634.3', '22841.7', '26706.4', '9197.5', '11608.3', '14391.8', '17687.0', '21405.1'] growth=['1.123', '1.142', '1.155', '1.163', '1.169', '1.296', '1.262', '1.240', '1.229', '1.210'] +step:30/50 train_loss:4.2700 train_time:59266ms step_avg:1975.53ms +step:40/50 train_loss:3.9376 train_time:79041ms step_avg:1976.01ms +step:50/50 train_loss:3.7755 train_time:98947ms step_avg:1978.94ms +step:50/50 val_loss:3.7384 val_bpb:2.2141 train_time:98981ms step_avg:1979.62ms h_norms=['24517.2', '27557.3', '31658.9', '36931.0', '44051.8', '10467.3', '14516.5', '19325.2', '25155.9', '31522.6'] growth=['1.087', '1.124', '1.149', '1.167', '1.193', '1.475', '1.387', '1.331', '1.302', '1.253'] +peak memory allocated: 42782 MiB reserved: 44140 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9157 val_bpb:3.5036 eval_time:54130ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4808692 bytes +Total submission size int6+lzma: 4907199 bytes +final_int6_roundtrip val_loss:6.1186 val_bpb:3.6238 eval_time:53816ms +final_int6_roundtrip_exact val_loss:6.11859679 val_bpb:3.62377766 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/wandb-metadata.json new file mode 100644 index 0000000000..a7e9ba9e75 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/wandb-metadata.json @@ -0,0 +1,50 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T10:54:05.714551Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.001" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39166717952" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "7zt9kqhoenh4znl80mx5k2htd8t6ll0v" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/wandb-summary.json new file mode 100644 index 0000000000..7723db47fc --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/wandb-summary.json @@ -0,0 +1 @@ +{"_step":50,"_runtime":397.849574269,"lr_scale":1,"_wandb":{"runtime":397},"_timestamp":1.774522718399421e+09,"step_avg_ms":1978.93598405979,"train_loss":3.775528907775879,"val_loss":3.7384158033878103,"val_bpb":2.2141004135533198} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/config.yaml new file mode 100644 index 0000000000..8e39d129f7 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/config.yaml @@ -0,0 +1,95 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + g4rj9sg7cepqs6jxzrfgkojfbv7f2gir: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.01" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39167303680" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T11:00:50.777009Z" + writerId: g4rj9sg7cepqs6jxzrfgkojfbv7f2gir + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +iterations: + value: 50 +jacobian_proxy_weight: + value: 0.01 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927198 +num_layers: + value: 11 +num_passes: + value: 2 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/output.log new file mode 100644 index 0000000000..a929e78e0c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.7', '11333.2', '12572.7', '13910.3', '15222.4', '8143.9', '9246.6', '10423.1', '11669.9', '12876.1'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] +step:1/50 train_loss:6.9310 train_time:1965ms step_avg:1964.66ms +step:2/50 train_loss:8.5267 train_time:3905ms step_avg:1952.55ms +step:3/50 train_loss:7.6283 train_time:5878ms step_avg:1959.29ms +step:4/50 train_loss:7.3204 train_time:7850ms step_avg:1962.54ms +step:5/50 train_loss:7.1282 train_time:9822ms step_avg:1964.43ms +step:6/50 train_loss:7.0825 train_time:11795ms step_avg:1965.78ms +step:7/50 train_loss:7.0698 train_time:13768ms step_avg:1966.81ms +step:8/50 train_loss:6.9490 train_time:15741ms step_avg:1967.65ms +step:9/50 train_loss:6.6024 train_time:17715ms step_avg:1968.31ms +step:10/50 train_loss:6.2457 train_time:19688ms step_avg:1968.83ms +step:20/50 train_loss:4.9609 train_time:39413ms step_avg:1970.66ms +step:25/50 val_loss:4.4376 val_bpb:2.6282 train_time:49313ms step_avg:1972.51ms h_norms=['14907.9', '17014.7', '19640.8', '22830.5', '26674.6', '9180.5', '11569.7', '14334.2', '17600.8', '21284.5'] growth=['1.122', '1.141', '1.154', '1.162', '1.168', '1.294', '1.260', '1.239', '1.228', '1.209'] +step:30/50 train_loss:4.2769 train_time:59149ms step_avg:1971.62ms +step:40/50 train_loss:3.9519 train_time:78889ms step_avg:1972.21ms +step:50/50 train_loss:3.7667 train_time:98755ms step_avg:1975.11ms +step:50/50 val_loss:3.7317 val_bpb:2.2101 train_time:98790ms step_avg:1975.79ms h_norms=['24105.4', '27189.2', '31353.5', '36701.3', '43829.7', '10500.5', '14546.0', '19432.5', '25342.5', '31770.4'] growth=['1.093', '1.128', '1.153', '1.171', '1.194', '1.480', '1.385', '1.336', '1.304', '1.254'] +peak memory allocated: 42782 MiB reserved: 44140 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9158 val_bpb:3.5037 eval_time:54005ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4808548 bytes +Total submission size int6+lzma: 4907055 bytes +final_int6_roundtrip val_loss:6.1173 val_bpb:3.6230 eval_time:53697ms +final_int6_roundtrip_exact val_loss:6.11731613 val_bpb:3.62301918 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/wandb-metadata.json new file mode 100644 index 0000000000..afaab527eb --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/wandb-metadata.json @@ -0,0 +1,50 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T11:00:50.777009Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.01" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39167303680" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "g4rj9sg7cepqs6jxzrfgkojfbv7f2gir" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/wandb-summary.json new file mode 100644 index 0000000000..2ff3f2fd12 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/wandb-summary.json @@ -0,0 +1 @@ +{"val_bpb":2.2101466433035046,"_step":50,"_runtime":396.900782582,"_wandb":{"runtime":396},"_timestamp":1.774523123213491e+09,"val_loss":3.731740027937702,"step_avg_ms":1975.1069359398389,"train_loss":3.766688823699951,"lr_scale":1} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/config.yaml new file mode 100644 index 0000000000..9e3f80b726 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/config.yaml @@ -0,0 +1,95 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + qzrmunkdb4dq46gyq93pifr4v3kdrurg: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39167815680" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T11:07:35.119729Z" + writerId: qzrmunkdb4dq46gyq93pifr4v3kdrurg + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +iterations: + value: 50 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927198 +num_layers: + value: 11 +num_passes: + value: 2 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/output.log new file mode 100644 index 0000000000..7b128f42dc --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.5', '11332.8', '12572.3', '13909.9', '15221.9', '8143.4', '9246.3', '10422.6', '11669.3', '12875.4'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] +step:1/50 train_loss:6.9310 train_time:1964ms step_avg:1963.98ms +step:2/50 train_loss:8.5267 train_time:3904ms step_avg:1952.08ms +step:3/50 train_loss:7.6283 train_time:5877ms step_avg:1958.95ms +step:4/50 train_loss:7.3203 train_time:7849ms step_avg:1962.28ms +step:5/50 train_loss:7.1281 train_time:9821ms step_avg:1964.23ms +step:6/50 train_loss:7.0823 train_time:11793ms step_avg:1965.52ms +step:7/50 train_loss:7.0690 train_time:13766ms step_avg:1966.50ms +step:8/50 train_loss:6.9488 train_time:15738ms step_avg:1967.24ms +step:9/50 train_loss:6.6003 train_time:17711ms step_avg:1967.83ms +step:10/50 train_loss:6.2452 train_time:19683ms step_avg:1968.32ms +step:20/50 train_loss:4.9631 train_time:39401ms step_avg:1970.07ms +step:25/50 val_loss:4.4339 val_bpb:2.6260 train_time:49298ms step_avg:1971.91ms h_norms=['14892.2', '17012.7', '19653.8', '22861.1', '26720.5', '9181.8', '11585.2', '14367.3', '17660.0', '21364.8'] growth=['1.122', '1.142', '1.155', '1.163', '1.169', '1.294', '1.262', '1.240', '1.229', '1.210'] +step:30/50 train_loss:4.2683 train_time:59129ms step_avg:1970.98ms +step:40/50 train_loss:3.9478 train_time:78860ms step_avg:1971.49ms +step:50/50 train_loss:3.7715 train_time:98733ms step_avg:1974.66ms +step:50/50 val_loss:3.7726 val_bpb:2.2343 train_time:98767ms step_avg:1975.34ms h_norms=['23953.3', '26694.0', '30396.2', '35111.3', '41668.7', '10222.5', '13936.6', '18323.8', '23608.5', '29594.6'] growth=['1.079', '1.114', '1.139', '1.155', '1.187', '1.441', '1.363', '1.315', '1.288', '1.254'] +peak memory allocated: 42782 MiB reserved: 44140 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9155 val_bpb:3.5035 eval_time:53980ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4811364 bytes +Total submission size int6+lzma: 4909871 bytes +final_int6_roundtrip val_loss:6.1183 val_bpb:3.6236 eval_time:53671ms +final_int6_roundtrip_exact val_loss:6.11825506 val_bpb:3.62357527 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/wandb-metadata.json new file mode 100644 index 0000000000..85d96865d7 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/wandb-metadata.json @@ -0,0 +1,50 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T11:07:35.119729Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39167815680" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "qzrmunkdb4dq46gyq93pifr4v3kdrurg" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/wandb-summary.json new file mode 100644 index 0000000000..e9ae3f06e1 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/wandb-summary.json @@ -0,0 +1 @@ +{"_wandb":{"runtime":396},"_timestamp":1.7745235273271282e+09,"val_bpb":2.234328113232034,"lr_scale":1,"_runtime":396.498539837,"val_loss":3.7725694269914163,"train_loss":3.7714996337890625,"step_avg_ms":1974.656167600042,"_step":50} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/config.yaml new file mode 100644 index 0000000000..ed1b347833 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/config.yaml @@ -0,0 +1,95 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + temknjjx3q58bmt65bmdsugcshsqe2s9: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39168368640" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T11:14:19.170748Z" + writerId: temknjjx3q58bmt65bmdsugcshsqe2s9 + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +iterations: + value: 50 +jacobian_proxy_weight: + value: 0 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927199 +num_layers: + value: 11 +num_passes: + value: 3 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/output.log new file mode 100644 index 0000000000..a87a91730e --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10017.6', '11136.0', '12344.2', '13648.8', '14930.5', '8131.3', '9221.1', '10380.0', '11610.4', '12803.6', '8145.6', '9247.8', '10417.9', '11656.5', '12858.6'] growth=['1.114', '1.112', '1.108', '1.106', '1.094', '1.146', '1.134', '1.126', '1.119', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] +step:1/50 train_loss:6.9310 train_time:2550ms step_avg:2549.53ms +step:2/50 train_loss:8.4366 train_time:5080ms step_avg:2540.20ms +step:3/50 train_loss:7.5697 train_time:7644ms step_avg:2547.96ms +step:4/50 train_loss:7.3743 train_time:10207ms step_avg:2551.71ms +step:5/50 train_loss:7.1786 train_time:12769ms step_avg:2553.85ms +step:6/50 train_loss:7.0957 train_time:15332ms step_avg:2555.30ms +step:7/50 train_loss:7.1042 train_time:17895ms step_avg:2556.47ms +step:8/50 train_loss:6.9979 train_time:20459ms step_avg:2557.34ms +step:9/50 train_loss:6.6202 train_time:23024ms step_avg:2558.19ms +step:10/50 train_loss:6.2350 train_time:25588ms step_avg:2558.77ms +step:20/50 train_loss:4.9204 train_time:51213ms step_avg:2560.66ms +step:25/50 val_loss:4.4026 val_bpb:2.6074 train_time:64061ms step_avg:2562.45ms h_norms=['15315.5', '18476.1', '22903.1', '28959.2', '36915.4', '10165.2', '14051.6', '19152.7', '25903.6', '34591.3', '10305.9', '14457.4', '19935.1', '27281.8', '36795.6'] growth=['1.162', '1.206', '1.240', '1.264', '1.275', '1.433', '1.382', '1.363', '1.352', '1.335', '1.453', '1.403', '1.379', '1.369', '1.349'] +step:30/50 train_loss:4.2536 train_time:76845ms step_avg:2561.51ms +step:40/50 train_loss:3.9338 train_time:102476ms step_avg:2561.91ms +step:50/50 train_loss:3.7603 train_time:128239ms step_avg:2564.77ms +step:50/50 val_loss:3.7159 val_bpb:2.2007 train_time:128273ms step_avg:2565.45ms h_norms=['25099.4', '31959.1', '42336.1', '57728.8', '79426.7', '13463.9', '21452.0', '32001.9', '46436.8', '66208.5', '13616.6', '21790.7', '32681.5', '47611.1', '68100.6'] growth=['1.181', '1.273', '1.325', '1.364', '1.376', '1.898', '1.593', '1.492', '1.451', '1.426', '1.919', '1.600', '1.500', '1.457', '1.430'] +peak memory allocated: 54994 MiB reserved: 56152 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9270 val_bpb:3.5103 eval_time:70209ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4807120 bytes +Total submission size int6+lzma: 4905627 bytes +final_int6_roundtrip val_loss:6.1461 val_bpb:3.6401 eval_time:69816ms +final_int6_roundtrip_exact val_loss:6.14613800 val_bpb:3.64008912 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/wandb-metadata.json new file mode 100644 index 0000000000..6e1680a6b3 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/wandb-metadata.json @@ -0,0 +1,50 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T11:14:19.170748Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39168368640" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "temknjjx3q58bmt65bmdsugcshsqe2s9" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/wandb-summary.json new file mode 100644 index 0000000000..70891cb091 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/wandb-summary.json @@ -0,0 +1 @@ +{"_step":50,"val_loss":3.715856487409959,"step_avg_ms":2564.772228019865,"lr_scale":1,"_timestamp":1.7745240125574348e+09,"val_bpb":2.2007395159263687,"_wandb":{"runtime":510},"_runtime":510.275896055,"train_loss":3.7602663040161133} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/config.yaml new file mode 100644 index 0000000000..ae19cb42cf --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/config.yaml @@ -0,0 +1,95 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + 4wbfpxtqbwjhsjfnj4vtbkeunz9rpimo: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.001" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39169040384" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T11:22:56.814413Z" + writerId: 4wbfpxtqbwjhsjfnj4vtbkeunz9rpimo + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +iterations: + value: 50 +jacobian_proxy_weight: + value: 0.001 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927199 +num_layers: + value: 11 +num_passes: + value: 3 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/output.log new file mode 100644 index 0000000000..4939084d7f --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10017.3', '11135.8', '12344.1', '13648.4', '14930.2', '8131.2', '9221.3', '10379.9', '11610.4', '12803.6', '8145.3', '9247.4', '10417.4', '11656.1', '12858.2'] growth=['1.114', '1.112', '1.109', '1.106', '1.094', '1.146', '1.134', '1.126', '1.119', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] +step:1/50 train_loss:6.9310 train_time:2562ms step_avg:2562.16ms +step:2/50 train_loss:8.4366 train_time:5105ms step_avg:2552.70ms +step:3/50 train_loss:7.5697 train_time:7680ms step_avg:2560.17ms +step:4/50 train_loss:7.3744 train_time:10255ms step_avg:2563.83ms +step:5/50 train_loss:7.1786 train_time:12830ms step_avg:2566.10ms +step:6/50 train_loss:7.0956 train_time:15406ms step_avg:2567.62ms +step:7/50 train_loss:7.1032 train_time:17983ms step_avg:2569.03ms +step:8/50 train_loss:6.9995 train_time:20559ms step_avg:2569.91ms +step:9/50 train_loss:6.6208 train_time:23136ms step_avg:2570.65ms +step:10/50 train_loss:6.2351 train_time:25713ms step_avg:2571.25ms +step:20/50 train_loss:4.9225 train_time:51465ms step_avg:2573.26ms +step:25/50 val_loss:4.4045 val_bpb:2.6086 train_time:64376ms step_avg:2575.05ms h_norms=['15389.1', '18568.1', '22944.3', '28900.3', '36680.3', '10100.2', '13942.7', '18871.1', '25382.8', '33689.8', '10227.6', '14320.0', '19633.2', '26714.6', '35810.5'] growth=['1.162', '1.207', '1.236', '1.260', '1.269', '1.424', '1.380', '1.353', '1.345', '1.327', '1.442', '1.400', '1.371', '1.361', '1.340'] +step:30/50 train_loss:4.2545 train_time:77216ms step_avg:2573.88ms +step:40/50 train_loss:3.9681 train_time:102972ms step_avg:2574.30ms +step:50/50 train_loss:3.7646 train_time:128885ms step_avg:2577.70ms +step:50/50 val_loss:3.7386 val_bpb:2.2142 train_time:128919ms step_avg:2578.39ms h_norms=['25102.1', '31613.3', '41339.7', '55618.3', '75871.6', '12896.3', '20216.5', '29704.2', '42853.9', '60665.7', '13066.2', '20603.3', '30423.4', '44103.2', '62670.9'] growth=['1.172', '1.259', '1.308', '1.345', '1.364', '1.818', '1.568', '1.469', '1.443', '1.416', '1.842', '1.577', '1.477', '1.450', '1.421'] +peak memory allocated: 55186 MiB reserved: 56536 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9267 val_bpb:3.5101 eval_time:70241ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4808880 bytes +Total submission size int6+lzma: 4907387 bytes +final_int6_roundtrip val_loss:6.1461 val_bpb:3.6400 eval_time:69823ms +final_int6_roundtrip_exact val_loss:6.14605869 val_bpb:3.64004215 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/wandb-metadata.json new file mode 100644 index 0000000000..c35138d78d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/wandb-metadata.json @@ -0,0 +1,50 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T11:22:56.814413Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.001" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39169040384" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "4wbfpxtqbwjhsjfnj4vtbkeunz9rpimo" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/wandb-summary.json new file mode 100644 index 0000000000..5149be8370 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/wandb-summary.json @@ -0,0 +1 @@ +{"_wandb":{"runtime":511},"train_loss":3.764610767364502,"_step":50,"_runtime":511.283981529,"_timestamp":1.774524530800683e+09,"step_avg_ms":2577.7040203400247,"lr_scale":1,"val_bpb":2.2142178775199093,"val_loss":3.7386141363750314} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/config.yaml new file mode 100644 index 0000000000..1796edd229 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/config.yaml @@ -0,0 +1,95 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + guxaav71qrqdqyi004ekla2tm6jqy7hq: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.01" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39169708032" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T11:31:35.129469Z" + writerId: guxaav71qrqdqyi004ekla2tm6jqy7hq + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +iterations: + value: 50 +jacobian_proxy_weight: + value: 0.01 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927199 +num_layers: + value: 11 +num_passes: + value: 3 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/output.log new file mode 100644 index 0000000000..68160f35fb --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10017.8', '11136.4', '12344.8', '13649.3', '14931.1', '8131.1', '9221.2', '10380.0', '11610.7', '12803.9', '8145.5', '9247.7', '10417.7', '11656.5', '12858.6'] growth=['1.114', '1.112', '1.109', '1.106', '1.094', '1.146', '1.134', '1.126', '1.119', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] +step:1/50 train_loss:6.9310 train_time:2558ms step_avg:2558.48ms +step:2/50 train_loss:8.4366 train_time:5098ms step_avg:2548.90ms +step:3/50 train_loss:7.5697 train_time:7670ms step_avg:2556.78ms +step:4/50 train_loss:7.3743 train_time:10241ms step_avg:2560.27ms +step:5/50 train_loss:7.1786 train_time:12811ms step_avg:2562.30ms +step:6/50 train_loss:7.0957 train_time:15383ms step_avg:2563.78ms +step:7/50 train_loss:7.1044 train_time:17953ms step_avg:2564.76ms +step:8/50 train_loss:6.9980 train_time:20525ms step_avg:2565.64ms +step:9/50 train_loss:6.6201 train_time:23097ms step_avg:2566.38ms +step:10/50 train_loss:6.2350 train_time:25670ms step_avg:2567.05ms +step:20/50 train_loss:4.9193 train_time:51381ms step_avg:2569.05ms +step:25/50 val_loss:4.4049 val_bpb:2.6088 train_time:64272ms step_avg:2570.88ms h_norms=['15720.0', '18809.2', '23004.6', '28594.8', '35769.8', '9791.7', '13137.0', '17413.1', '22947.7', '29845.4', '9846.4', '13323.4', '17863.5', '23724.8', '31147.9'] growth=['1.160', '1.197', '1.223', '1.243', '1.251', '1.380', '1.342', '1.325', '1.318', '1.301', '1.388', '1.353', '1.341', '1.328', '1.313'] +step:30/50 train_loss:4.2516 train_time:77099ms step_avg:2569.96ms +step:40/50 train_loss:3.9318 train_time:102821ms step_avg:2570.53ms +step:50/50 train_loss:3.7709 train_time:128698ms step_avg:2573.96ms +step:50/50 val_loss:3.7547 val_bpb:2.2238 train_time:128732ms step_avg:2574.64ms h_norms=['26472.4', '32406.0', '41016.4', '53232.8', '69910.2', '11501.1', '17001.4', '24006.2', '33299.6', '45175.2', '11702.2', '17433.6', '24741.2', '34474.7', '46973.8'] growth=['1.156', '1.224', '1.266', '1.298', '1.313', '1.621', '1.478', '1.412', '1.387', '1.357', '1.649', '1.490', '1.419', '1.393', '1.363'] +peak memory allocated: 55186 MiB reserved: 56536 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9290 val_bpb:3.5115 eval_time:70088ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4808852 bytes +Total submission size int6+lzma: 4907359 bytes +final_int6_roundtrip val_loss:6.1489 val_bpb:3.6417 eval_time:69675ms +final_int6_roundtrip_exact val_loss:6.14887885 val_bpb:3.64171241 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/wandb-metadata.json new file mode 100644 index 0000000000..c7455ec815 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/wandb-metadata.json @@ -0,0 +1,50 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T11:31:35.129469Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.01" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39169708032" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "guxaav71qrqdqyi004ekla2tm6jqy7hq" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/wandb-summary.json new file mode 100644 index 0000000000..91335c6f1a --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":509.0728732,"train_loss":3.7708992958068848,"val_bpb":2.2237740575643943,"_step":50,"_wandb":{"runtime":509},"val_loss":3.754749346088031,"_timestamp":1.7745250486946363e+09,"lr_scale":1,"step_avg_ms":2573.96071168012} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/config.yaml new file mode 100644 index 0000000000..bc27b84971 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/config.yaml @@ -0,0 +1,95 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + 3quhs4ayq7ifvml7ew8yo7mtbqymx7fb: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39170318336" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T11:40:11.906209Z" + writerId: 3quhs4ayq7ifvml7ew8yo7mtbqymx7fb + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +iterations: + value: 50 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927199 +num_layers: + value: 11 +num_passes: + value: 3 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/output.log new file mode 100644 index 0000000000..3f49ae6f56 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10016.7', '11135.2', '12343.3', '13647.7', '14929.4', '8130.4', '9220.7', '10379.4', '11609.9', '12803.1', '8144.8', '9247.1', '10417.0', '11655.6', '12857.6'] growth=['1.114', '1.112', '1.108', '1.106', '1.094', '1.146', '1.134', '1.126', '1.119', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] +step:1/50 train_loss:6.9310 train_time:2558ms step_avg:2558.36ms +step:2/50 train_loss:8.4366 train_time:5098ms step_avg:2548.82ms +step:3/50 train_loss:7.5697 train_time:7669ms step_avg:2556.49ms +step:4/50 train_loss:7.3744 train_time:10241ms step_avg:2560.13ms +step:5/50 train_loss:7.1785 train_time:12812ms step_avg:2562.33ms +step:6/50 train_loss:7.0958 train_time:15382ms step_avg:2563.72ms +step:7/50 train_loss:7.1043 train_time:17954ms step_avg:2564.81ms +step:8/50 train_loss:6.9978 train_time:20526ms step_avg:2565.71ms +step:9/50 train_loss:6.6203 train_time:23098ms step_avg:2566.44ms +step:10/50 train_loss:6.2351 train_time:25670ms step_avg:2566.98ms +step:20/50 train_loss:4.9173 train_time:51378ms step_avg:2568.92ms +step:25/50 val_loss:4.4081 val_bpb:2.6107 train_time:64271ms step_avg:2570.86ms h_norms=['15784.7', '18418.3', '21808.6', '26014.4', '31082.9', '9350.4', '11977.4', '15124.3', '18884.1', '23160.1', '9355.0', '12040.7', '15261.1', '19113.8', '23523.2'] growth=['1.142', '1.167', '1.184', '1.193', '1.195', '1.318', '1.281', '1.263', '1.249', '1.226', '1.319', '1.287', '1.267', '1.252', '1.231'] +step:30/50 train_loss:4.2496 train_time:77100ms step_avg:2569.99ms +step:40/50 train_loss:3.9324 train_time:102820ms step_avg:2570.49ms +step:50/50 train_loss:3.7771 train_time:128694ms step_avg:2573.88ms +step:50/50 val_loss:3.7294 val_bpb:2.2088 train_time:128728ms step_avg:2574.56ms h_norms=['25140.9', '28962.0', '34040.5', '40736.8', '49246.7', '9966.3', '13653.2', '17956.7', '23197.5', '28845.5', '9949.3', '13730.0', '18206.3', '23701.7', '29420.4'] growth=['1.103', '1.152', '1.175', '1.197', '1.209', '1.405', '1.370', '1.315', '1.292', '1.243', '1.402', '1.380', '1.326', '1.302', '1.241'] +peak memory allocated: 55186 MiB reserved: 56536 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9285 val_bpb:3.5112 eval_time:70073ms +Serialized model: 106023671 bytes +Code size: 98507 bytes +Serialized model int6+lzma: 4808068 bytes +Total submission size int6+lzma: 4906575 bytes +final_int6_roundtrip val_loss:6.1465 val_bpb:3.6403 eval_time:69664ms +final_int6_roundtrip_exact val_loss:6.14650992 val_bpb:3.64030939 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/wandb-metadata.json new file mode 100644 index 0000000000..f8098c097d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/wandb-metadata.json @@ -0,0 +1,50 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T11:40:11.906209Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39170318336" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "3quhs4ayq7ifvml7ew8yo7mtbqymx7fb" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/wandb-summary.json new file mode 100644 index 0000000000..3caf0d7ad9 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/wandb-summary.json @@ -0,0 +1 @@ +{"train_loss":3.7771451473236084,"val_bpb":2.2087608504636838,"_timestamp":1.7745255652383268e+09,"_step":50,"_wandb":{"runtime":510},"_runtime":510.138606162,"val_loss":3.7294001747761674,"lr_scale":1,"step_avg_ms":2573.8786139001604} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/output.log new file mode 100644 index 0000000000..05fcf59779 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/output.log @@ -0,0 +1,22 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['9829.5', '10906.5', '12074.5', '13339.5', '14586.6', '8110.7', '9182.3', '10321.5', '11532.3', '12712.6', '8125.6', '9210.3', '10362.9', '11582.1', '12773.0', '8128.7', '9216.6', '10372.5', '11592.1', '12787.0'] growth=['1.111', '1.110', '1.107', '1.105', '1.093', '1.143', '1.132', '1.124', '1.117', '1.102', '1.145', '1.133', '1.125', '1.118', '1.103', '1.146', '1.134', '1.125', '1.118', '1.103'] +step:1/50 train_loss:6.9310 train_time:3144ms step_avg:3143.65ms +step:2/50 train_loss:8.3561 train_time:6272ms step_avg:3135.84ms +step:3/50 train_loss:7.5194 train_time:9433ms step_avg:3144.22ms +step:4/50 train_loss:7.4222 train_time:12595ms step_avg:3148.75ms +step:5/50 train_loss:7.2305 train_time:15755ms step_avg:3151.08ms +step:6/50 train_loss:7.1013 train_time:18919ms step_avg:3153.09ms +step:7/50 train_loss:7.0637 train_time:22082ms step_avg:3154.61ms +step:8/50 train_loss:6.9894 train_time:25245ms step_avg:3155.59ms +step:9/50 train_loss:6.6143 train_time:28406ms step_avg:3156.21ms +step:10/50 train_loss:6.2371 train_time:31569ms step_avg:3156.90ms +step:20/50 train_loss:4.8567 train_time:63180ms step_avg:3158.98ms +step:25/50 val_loss:4.3859 val_bpb:2.5976 train_time:89760ms step_avg:3590.41ms h_norms=['15124.4', '18985.0', '24623.8', '32563.5', '43308.4', '10881.2', '15824.5', '22366.4', '31313.8', '43085.3', '10968.9', '16075.0', '22846.2', '32046.9', '44227.1', '10951.1', '16019.2', '22824.3', '32056.4', '44286.6'] growth=['1.193', '1.255', '1.297', '1.322', '1.330', '1.534', '1.454', '1.413', '1.400', '1.376', '1.546', '1.466', '1.421', '1.403', '1.380', '1.544', '1.463', '1.425', '1.404', '1.382'] +step:30/50 train_loss:4.2191 train_time:125306ms step_avg:4176.87ms +step:40/50 train_loss:3.9125 train_time:196653ms step_avg:4916.32ms +step:50/50 train_loss:3.7387 train_time:273924ms step_avg:5478.49ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/wandb-metadata.json new file mode 100644 index 0000000000..91d137ac18 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/wandb-metadata.json @@ -0,0 +1,50 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T11:48:48.952058Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39170912256" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "fi5csp7779etgc5vltp4adqcjwicj737" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/output.log new file mode 100644 index 0000000000..8d2e7e8ff8 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/output.log @@ -0,0 +1,18 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10161.7', '11313.4', '12557.3', '13898.0', '15219.9', '16827.9', '18528.4', '20318.6', '22230.4', '24153.9', '22046.5', '24114.3', '26314.0', '28667.9', '31058.9'] growth=['1.116', '1.113', '1.110', '1.107', '1.095', '1.106', '1.101', '1.097', '1.094', '1.087', '1.098', '1.094', '1.091', '1.089', '1.083'] +step:1/50 train_loss:6.9310 train_time:6304ms step_avg:6303.72ms +step:2/50 train_loss:8.4504 train_time:12557ms step_avg:6278.34ms +step:3/50 train_loss:7.5634 train_time:18879ms step_avg:6293.09ms +step:4/50 train_loss:7.3652 train_time:25104ms step_avg:6276.05ms +step:5/50 train_loss:7.1863 train_time:30783ms step_avg:6156.62ms +step:6/50 train_loss:7.1201 train_time:36471ms step_avg:6078.55ms +step:7/50 train_loss:7.1222 train_time:42201ms step_avg:6028.70ms +step:8/50 train_loss:7.0087 train_time:47839ms step_avg:5979.83ms +step:9/50 train_loss:6.6202 train_time:53571ms step_avg:5952.33ms +step:10/50 train_loss:6.2665 train_time:59252ms step_avg:5925.21ms +step:20/50 train_loss:5.1492 train_time:116187ms step_avg:5809.33ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/wandb-metadata.json new file mode 100644 index 0000000000..9e2af819aa --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T11:51:41.177220Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.0", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39171293184" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "0z1atj36j5zp0and9x300x1b0fk2hrr4" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/config.yaml new file mode 100644 index 0000000000..9c7ae81f09 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/config.yaml @@ -0,0 +1,96 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + 7pytqgh34iofqn6y02pxdb7c4hwkq1nt: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.0" + - --no-interpass-rmsnorm + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39171923968" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T11:57:41.701879Z" + writerId: 7pytqgh34iofqn6y02pxdb7c4hwkq1nt + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 50 +jacobian_proxy_weight: + value: 0 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927199 +num_layers: + value: 11 +num_passes: + value: 3 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/output.log new file mode 100644 index 0000000000..274a9a3c1c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/output.log @@ -0,0 +1,50 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2081, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1757, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1026, in forward + x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1002, in _forward_hidden + x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 781, in forward + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 742, in forward + x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 217.94 MiB is free. Process 241184 has 67.76 GiB memory in use. Process 242252 has 55.38 GiB memory in use. Including non-PyTorch memory, this process has 16.43 GiB memory in use. Of the allocated memory 15.67 GiB is allocated by PyTorch, and 99.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/wandb-metadata.json new file mode 100644 index 0000000000..150c8cee92 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T11:57:41.701879Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.0", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39171923968" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "7pytqgh34iofqn6y02pxdb7c4hwkq1nt" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/wandb-summary.json new file mode 100644 index 0000000000..b0a620d0c1 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/wandb-summary.json @@ -0,0 +1 @@ +{"_wandb":{"runtime":0},"_runtime":0} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/config.yaml new file mode 100644 index 0000000000..9097efd5a5 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/config.yaml @@ -0,0 +1,96 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + npfid9038p76794z8vba45it7q4zecmx: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39172128768" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T11:57:49.733244Z" + writerId: npfid9038p76794z8vba45it7q4zecmx + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 50 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927199 +num_layers: + value: 11 +num_passes: + value: 3 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/output.log new file mode 100644 index 0000000000..274a9a3c1c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/output.log @@ -0,0 +1,50 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2081, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1757, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1026, in forward + x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1002, in _forward_hidden + x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 781, in forward + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 742, in forward + x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 217.94 MiB is free. Process 241184 has 67.76 GiB memory in use. Process 242252 has 55.38 GiB memory in use. Including non-PyTorch memory, this process has 16.43 GiB memory in use. Of the allocated memory 15.67 GiB is allocated by PyTorch, and 99.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/wandb-metadata.json new file mode 100644 index 0000000000..a020307416 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T11:57:49.733244Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39172128768" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "npfid9038p76794z8vba45it7q4zecmx" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/wandb-summary.json new file mode 100644 index 0000000000..1d476fc886 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":0,"_wandb":{"runtime":0}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/config.yaml new file mode 100644 index 0000000000..4e1842592e --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + a5k6glqja7g5x8087ep1dybi7qd02owy: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.0" + - --no-interpass-rmsnorm + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39172550656" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T11:59:24.875015Z" + writerId: a5k6glqja7g5x8087ep1dybi7qd02owy + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 50 +jacobian_proxy_weight: + value: 0 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927199 +num_layers: + value: 11 +num_passes: + value: 3 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/output.log new file mode 100644 index 0000000000..5e93f4daea --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10160.2', '11311.6', '12555.0', '13896.3', '15332.5', '16950.8', '18657.9', '20459.0', '22380.8', '24439.6', '22273.3', '24354.9', '26573.0', '28946.4', '31482.6'] growth=['1.116', '1.113', '1.110', '1.107', '1.103', '1.106', '1.101', '1.097', '1.094', '1.092', '1.098', '1.093', '1.091', '1.089', '1.088'] +step:1/50 train_loss:6.9310 train_time:2457ms step_avg:2456.70ms +step:2/50 train_loss:8.4480 train_time:4895ms step_avg:2447.55ms +step:3/50 train_loss:7.5656 train_time:7366ms step_avg:2455.22ms +step:4/50 train_loss:7.3715 train_time:9835ms step_avg:2458.84ms +step:5/50 train_loss:7.1882 train_time:12305ms step_avg:2460.94ms +step:6/50 train_loss:7.1200 train_time:14774ms step_avg:2462.35ms +step:7/50 train_loss:7.1275 train_time:17244ms step_avg:2463.46ms +step:8/50 train_loss:7.0234 train_time:19715ms step_avg:2464.42ms +step:9/50 train_loss:6.6287 train_time:22185ms step_avg:2465.05ms +step:10/50 train_loss:6.2775 train_time:24656ms step_avg:2465.57ms +step:20/50 train_loss:5.2073 train_time:49354ms step_avg:2467.72ms +step:25/50 val_loss:4.6001 val_bpb:2.7244 train_time:61743ms step_avg:2469.70ms h_norms=['18012.8', '20576.1', '23734.4', '27658.0', '32476.6', '39019.5', '47115.3', '57297.5', '70090.2', '85988.1', '66911.8', '82425.7', '101585.5', '125680.3', '156003.0'] growth=['1.133', '1.142', '1.153', '1.165', '1.174', '1.201', '1.207', '1.216', '1.223', '1.227', '1.228', '1.232', '1.232', '1.237', '1.241'] +step:30/50 train_loss:4.3938 train_time:74062ms step_avg:2468.73ms +step:40/50 train_loss:4.0561 train_time:98772ms step_avg:2469.31ms +step:50/50 train_loss:3.8233 train_time:123613ms step_avg:2472.25ms +step:50/50 val_loss:3.7814 val_bpb:2.2396 train_time:123647ms step_avg:2472.93ms h_norms=['31577.0', '34240.9', '37755.1', '42362.9', '48395.2', '56432.3', '66325.1', '79064.0', '95012.0', '114485.4', '84419.3', '101309.9', '122596.9', '148869.6', '181094.1'] growth=['1.068', '1.084', '1.103', '1.122', '1.142', '1.166', '1.175', '1.192', '1.202', '1.205', '1.193', '1.200', '1.210', '1.214', '1.216'] +peak memory allocated: 54207 MiB reserved: 55384 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9416 val_bpb:3.5190 eval_time:67959ms +Serialized model: 106023671 bytes +Code size: 98931 bytes +Serialized model int6+lzma: 4809652 bytes +Total submission size int6+lzma: 4908583 bytes +final_int6_roundtrip val_loss:6.1789 val_bpb:3.6595 eval_time:67564ms +final_int6_roundtrip_exact val_loss:6.17886280 val_bpb:3.65947059 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/wandb-metadata.json new file mode 100644 index 0000000000..813cd140ce --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T11:59:24.875015Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.0", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39172550656" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "a5k6glqja7g5x8087ep1dybi7qd02owy" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/wandb-summary.json new file mode 100644 index 0000000000..11fab24bc9 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/wandb-summary.json @@ -0,0 +1 @@ +{"val_loss":3.781418496150666,"val_bpb":2.239569030432122,"_timestamp":1.7745267061100767e+09,"lr_scale":1,"step_avg_ms":2472.2500372401555,"train_loss":3.8232855796813965,"_wandb":{"runtime":493},"_runtime":493.602379103,"_step":50} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/config.yaml new file mode 100644 index 0000000000..65b18a5321 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + qkgd2d3r5fmqvm20ng3nbm8vgtccoso2: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39173230592" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T12:07:45.519065Z" + writerId: qkgd2d3r5fmqvm20ng3nbm8vgtccoso2 + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 50 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927199 +num_layers: + value: 11 +num_passes: + value: 3 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/output.log new file mode 100644 index 0000000000..4f5b257c05 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10143.6', '11247.2', '12436.8', '13721.2', '15093.2', '16638.6', '18260.9', '19974.1', '21802.2', '23760.8', '21655.4', '23628.3', '25729.0', '27978.2', '30382.6'] growth=['1.110', '1.109', '1.106', '1.103', '1.100', '1.102', '1.098', '1.094', '1.092', '1.090', '1.095', '1.091', '1.089', '1.087', '1.086'] +step:1/50 train_loss:6.9310 train_time:2474ms step_avg:2473.98ms +step:2/50 train_loss:8.4480 train_time:4928ms step_avg:2463.85ms +step:3/50 train_loss:7.5657 train_time:7414ms step_avg:2471.28ms +step:4/50 train_loss:7.4125 train_time:9901ms step_avg:2475.24ms +step:5/50 train_loss:7.2581 train_time:12387ms step_avg:2477.37ms +step:6/50 train_loss:7.1563 train_time:14873ms step_avg:2478.80ms +step:7/50 train_loss:7.1205 train_time:17358ms step_avg:2479.79ms +step:8/50 train_loss:7.0021 train_time:19845ms step_avg:2480.59ms +step:9/50 train_loss:6.6191 train_time:22332ms step_avg:2481.32ms +step:10/50 train_loss:6.2241 train_time:24818ms step_avg:2481.82ms +step:20/50 train_loss:4.8854 train_time:49674ms step_avg:2483.72ms +step:25/50 val_loss:4.4102 val_bpb:2.6119 train_time:62144ms step_avg:2485.74ms h_norms=['12925.3', '12168.2', '11607.0', '11186.7', '10890.7', '10691.3', '10518.0', '10429.2', '10384.9', '10395.0', '10468.4', '10350.9', '10317.8', '10323.0', '10377.5'] growth=['0.930', '0.941', '0.954', '0.964', '0.974', '0.982', '0.984', '0.992', '0.996', '1.001', '0.987', '0.989', '0.997', '1.001', '1.005'] +step:30/50 train_loss:4.2124 train_time:74549ms step_avg:2484.96ms +step:40/50 train_loss:3.9336 train_time:99426ms step_avg:2485.66ms +step:50/50 train_loss:3.7638 train_time:124432ms step_avg:2488.64ms +step:50/50 val_loss:3.7456 val_bpb:2.2184 train_time:124466ms step_avg:2489.33ms h_norms=['20394.8', '18235.1', '16671.4', '15574.6', '14825.8', '14555.8', '14297.6', '14121.5', '14031.0', '13984.3', '14335.7', '14174.8', '14069.5', '14035.7', '14026.9'] growth=['0.871', '0.894', '0.914', '0.934', '0.952', '0.982', '0.982', '0.988', '0.994', '0.997', '0.991', '0.989', '0.993', '0.998', '0.999'] +peak memory allocated: 54399 MiB reserved: 55768 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9394 val_bpb:3.5176 eval_time:68125ms +Serialized model: 106023671 bytes +Code size: 98931 bytes +Serialized model int6+lzma: 4804840 bytes +Total submission size int6+lzma: 4903771 bytes +final_int6_roundtrip val_loss:6.1350 val_bpb:3.6335 eval_time:67734ms +final_int6_roundtrip_exact val_loss:6.13503683 val_bpb:3.63351438 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/wandb-metadata.json new file mode 100644 index 0000000000..d95274eca3 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T12:07:45.519065Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39173230592" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "qkgd2d3r5fmqvm20ng3nbm8vgtccoso2" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/wandb-summary.json new file mode 100644 index 0000000000..1a19c1110c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/wandb-summary.json @@ -0,0 +1 @@ +{"_wandb":{"runtime":494},"val_bpb":2.2183812965919136,"lr_scale":1,"_runtime":494.840216454,"val_loss":3.745643895079573,"_timestamp":1.7745272083926284e+09,"train_loss":3.763766288757324,"step_avg_ms":2488.642202020128,"_step":50} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/config.yaml new file mode 100644 index 0000000000..48b1728e58 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + esobk79omtrnfsnpn87sdj4fh76agicw: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39285575680" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T12:30:15.851208Z" + writerId: esobk79omtrnfsnpn87sdj4fh76agicw + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 50 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/output.log new file mode 100644 index 0000000000..c5cb39b734 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['9918.1', '10850.5', '11857.1', '12933.0', '14082.2', '15372.6', '16714.7', '18128.7', '19628.1', '21227.1', '19383.5', '20984.5', '22685.3', '24491.8', '26411.9', '24206.9', '26133.0', '28177.6', '30331.6', '32611.0'] growth=['1.094', '1.094', '1.093', '1.091', '1.089', '1.092', '1.087', '1.085', '1.083', '1.081', '1.086', '1.083', '1.081', '1.080', '1.078', '1.082', '1.080', '1.078', '1.076', '1.075'] +step:1/50 train_loss:6.9310 train_time:3032ms step_avg:3031.88ms +step:2/50 train_loss:8.3473 train_time:6048ms step_avg:3024.07ms +step:3/50 train_loss:7.5117 train_time:9096ms step_avg:3032.08ms +step:4/50 train_loss:7.5611 train_time:12145ms step_avg:3036.23ms +step:5/50 train_loss:7.3188 train_time:15194ms step_avg:3038.75ms +step:6/50 train_loss:7.0774 train_time:18243ms step_avg:3040.49ms +step:7/50 train_loss:6.9519 train_time:21292ms step_avg:3041.67ms +step:8/50 train_loss:6.9005 train_time:24342ms step_avg:3042.69ms +step:9/50 train_loss:6.5418 train_time:27391ms step_avg:3043.47ms +step:10/50 train_loss:6.1552 train_time:30442ms step_avg:3044.24ms +step:20/50 train_loss:4.8491 train_time:60936ms step_avg:3046.80ms +step:25/50 val_loss:4.3716 val_bpb:2.5891 train_time:76222ms step_avg:3048.87ms h_norms=['12739.2', '11639.7', '10834.5', '10248.2', '9888.2', '9584.1', '9359.5', '9269.6', '9263.6', '9383.8', '9327.1', '9186.1', '9174.7', '9243.6', '9433.6', '9207.4', '9116.2', '9157.3', '9280.5', '9523.6'] growth=['0.899', '0.914', '0.931', '0.946', '0.965', '0.969', '0.977', '0.990', '0.999', '1.013', '0.979', '0.985', '0.999', '1.008', '1.021', '0.984', '0.990', '1.005', '1.013', '1.026'] +step:30/50 train_loss:4.2292 train_time:91444ms step_avg:3048.14ms +step:40/50 train_loss:3.9319 train_time:121963ms step_avg:3049.08ms +step:50/50 train_loss:3.7393 train_time:152613ms step_avg:3052.25ms +step:50/50 val_loss:3.7133 val_bpb:2.1992 train_time:152647ms step_avg:3052.93ms h_norms=['19200.7', '16705.4', '14993.2', '13850.4', '13137.0', '13070.5', '12983.4', '12953.9', '12984.4', '13061.0', '13002.1', '13046.2', '13096.0', '13184.0', '13296.2', '13048.7', '13170.7', '13265.1', '13387.0', '13519.2'] growth=['0.844', '0.870', '0.898', '0.924', '0.948', '0.995', '0.993', '0.998', '1.002', '1.006', '1.011', '1.003', '1.004', '1.007', '1.009', '1.021', '1.009', '1.007', '1.009', '1.010'] +peak memory allocated: 66516 MiB reserved: 67876 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9477 val_bpb:3.5225 eval_time:83302ms +Serialized model: 106023671 bytes +Code size: 99082 bytes +Serialized model int6+lzma: 4803604 bytes +Total submission size int6+lzma: 4902686 bytes +final_int6_roundtrip val_loss:6.1439 val_bpb:3.6388 eval_time:82807ms +final_int6_roundtrip_exact val_loss:6.14393907 val_bpb:3.63878679 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/wandb-metadata.json new file mode 100644 index 0000000000..c30f2510cf --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T12:30:15.851208Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39285575680" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "esobk79omtrnfsnpn87sdj4fh76agicw" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/wandb-summary.json new file mode 100644 index 0000000000..376a7987a5 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/wandb-summary.json @@ -0,0 +1 @@ +{"_timestamp":1.7745286351104448e+09,"lr_scale":1,"val_bpb":2.199228377998411,"_step":50,"train_loss":3.739262819290161,"step_avg_ms":3052.2532462800154,"val_loss":3.7133049943175975,"_runtime":600.315707325,"_wandb":{"runtime":600}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/output.log new file mode 100644 index 0000000000..ff4b052b54 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['9918.0', '10850.4', '11857.1', '12934.1', '14082.9', '15373.2', '16715.3', '18129.5', '19630.8', '21229.7', '19385.0', '20986.2', '22687.0', '24496.2', '26416.1', '24210.2', '26136.7', '28180.1', '30337.3', '32616.2'] growth=['1.094', '1.094', '1.093', '1.091', '1.089', '1.092', '1.087', '1.085', '1.083', '1.081', '1.086', '1.083', '1.081', '1.080', '1.078', '1.082', '1.080', '1.078', '1.077', '1.075'] +step:1/50 train_loss:6.9310 train_time:3036ms step_avg:3035.59ms +step:2/50 train_loss:8.3473 train_time:6057ms step_avg:3028.64ms +step:3/50 train_loss:7.5117 train_time:9110ms step_avg:3036.83ms +step:4/50 train_loss:7.5612 train_time:12164ms step_avg:3040.98ms +step:5/50 train_loss:7.3191 train_time:15217ms step_avg:3043.48ms +step:6/50 train_loss:7.0778 train_time:18271ms step_avg:3045.21ms +step:7/50 train_loss:6.9515 train_time:21325ms step_avg:3046.36ms +step:8/50 train_loss:6.8989 train_time:24379ms step_avg:3047.33ms +step:9/50 train_loss:6.5423 train_time:27433ms step_avg:3048.11ms +step:10/50 train_loss:6.1552 train_time:30487ms step_avg:3048.71ms +step:20/50 train_loss:4.8543 train_time:61023ms step_avg:3051.16ms +step:25/50 val_loss:4.3745 val_bpb:2.5908 train_time:76333ms step_avg:3053.30ms h_norms=['12750.8', '11662.4', '10864.5', '10284.5', '9926.1', '9627.8', '9403.7', '9313.4', '9309.0', '9428.9', '9374.8', '9233.4', '9221.1', '9291.2', '9480.8', '9258.3', '9166.0', '9206.0', '9330.2', '9572.9'] growth=['0.900', '0.915', '0.932', '0.947', '0.965', '0.970', '0.977', '0.990', '1.000', '1.013', '0.979', '0.985', '0.999', '1.008', '1.020', '0.985', '0.990', '1.004', '1.013', '1.026'] +step:30/50 train_loss:4.2237 train_time:91579ms step_avg:3052.64ms +step:40/50 train_loss:3.9233 train_time:122152ms step_avg:3053.79ms +step:50/50 train_loss:3.7378 train_time:152866ms step_avg:3057.32ms +step:50/50 val_loss:3.7019 val_bpb:2.1925 train_time:152900ms step_avg:3058.01ms h_norms=['18557.7', '16139.2', '14492.7', '13414.8', '12747.7', '12664.0', '12590.1', '12595.2', '12645.7', '12735.2', '12603.5', '12652.5', '12735.0', '12838.0', '12962.0', '12653.1', '12772.9', '12898.4', '13031.2', '13175.0'] growth=['0.841', '0.870', '0.898', '0.926', '0.950', '0.993', '0.994', '1.000', '1.004', '1.007', '1.009', '1.004', '1.007', '1.008', '1.010', '1.018', '1.009', '1.010', '1.010', '1.011'] +peak memory allocated: 66516 MiB reserved: 67876 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9474 val_bpb:3.5224 eval_time:83475ms +Serialized model: 106023671 bytes +Code size: 99082 bytes +Serialized model int6+lzma: 4801388 bytes +Total submission size int6+lzma: 4900470 bytes +final_int6_roundtrip val_loss:6.1436 val_bpb:3.6386 eval_time:82976ms +final_int6_roundtrip_exact val_loss:6.14358204 val_bpb:3.63857534 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/wandb-metadata.json new file mode 100644 index 0000000000..509652a9d6 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T12:41:19.243398Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39286394880" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "h5pmd33bz5si4l1d16p7aftz6ie9kpgq" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log new file mode 100644 index 0000000000..cf02289148 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log @@ -0,0 +1,18 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['9917.9', '10757.9', '11684.5', '12697.7', '13792.7', '14968.7', '16232.0', '17584.0', '19026.6', '20567.1', '18830.4', '20360.3', '21989.6', '23732.8', '25596.8', '23508.1', '25360.6', '27334.0', '29434.0', '31674.1'] growth=['1.081', '1.085', '1.086', '1.087', '1.086', '1.085', '1.084', '1.083', '1.082', '1.081', '1.082', '1.081', '1.080', '1.079', '1.079', '1.079', '1.079', '1.078', '1.077', '1.076'] +step:1/50 train_loss:6.9310 train_time:3149ms step_avg:3148.80ms +step:2/50 train_loss:8.4139 train_time:6265ms step_avg:3132.33ms +step:3/50 train_loss:7.5940 train_time:9413ms step_avg:3137.77ms +step:4/50 train_loss:7.3379 train_time:12563ms step_avg:3140.75ms +step:5/50 train_loss:7.1846 train_time:15710ms step_avg:3142.03ms +step:6/50 train_loss:7.1330 train_time:18856ms step_avg:3142.72ms +step:7/50 train_loss:7.0579 train_time:22001ms step_avg:3142.94ms +step:8/50 train_loss:6.8826 train_time:25148ms step_avg:3143.55ms +step:9/50 train_loss:6.5374 train_time:28300ms step_avg:3144.44ms +step:10/50 train_loss:6.1407 train_time:31445ms step_avg:3144.50ms +step:20/50 train_loss:4.7836 train_time:62947ms step_avg:3147.35ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/wandb-metadata.json new file mode 100644 index 0000000000..3f6e5b6fa1 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T12:52:42.386699Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "b38248cf6d4a1387d06b2906628c717e59747b11" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39287189504" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "vbqzv6lrtlyc5qq1jo1gdcj8ooc1lprq" +} \ No newline at end of file diff --git a/report.md b/report.md new file mode 100644 index 0000000000..37f4f7e26d --- /dev/null +++ b/report.md @@ -0,0 +1,118 @@ +# Report — Recurrent Core + Learned Feedback + QAT Experiment + +## Overview + +This report covers the setup and analysis of the **Recurrent Core with Learned Error-Feedback Correction and Full-Rollout QAT** submission for the OpenAI Parameter Golf challenge (16MB artifact, 10-minute training budget on 8xH100). + +### Challenge Context +The current SOTA on the leaderboard is **1.1194 BPB** (LeakyReLU² + Legal TTT + Parallel Muon). This submission builds on that record by adding: +- A shared recurrent core (stem/core/tail architecture) +- STE-based fake quantization during training (full-rollout QAT) +- Learned low-rank error-feedback correction for quantization residuals + +## Architecture Summary + +### Stem / Core / Tail Partitioning +``` +Input → Stem (3 layers) → Core (2 shared layers × 3 passes) → Tail (3 layers) → Output + ↓ ↑ + skip connections ────────────────────────────────→ consumed +``` + +- **Total unique layers:** 8 (3 stem + 2 core + 3 tail) +- **Effective depth:** 12 (3 + 2×3 + 3) via weight reuse in the core +- **Model parameters:** 19,679,297 (base) + 2,560 (feedback) + +### Error Feedback Module +The learned feedback correction compensates for quantization error amplification across recurrence passes: + +``` +e_k = U(V^T h_k) — low-rank residual approximation (rank 2) +c_k = diag(d) · e_k — diagonal correction (512 params) +h_{k+1} = f_{W_q}(h_k + c_k) — corrected recurrent update +``` + +Correction is inactive on pass 0 (no prior quantization residual exists). + +### Key Components Preserved from SOTA +| Component | Detail | +|-----------|--------| +| Activation | LeakyReLU(0.5)² | +| BigramHash | 1536 | +| XSA | Last 4 unique layers | +| Partial RoPE | 16/64 dims | +| LayerNorm scaling | 1/√(layer+1) | +| VE128 | Layers 6,7 | +| Weight averaging | EMA(0.997) + SWA(every 50) | +| Export | GPTQ-lite int6 + lzma | +| Optimizer | Parallel Muon | + +## Environment + +| Component | Value | +|-----------|-------| +| GPU | 1× NVIDIA H200 (143GB HBM) | +| CUDA | 13.0 | +| PyTorch | 2.11.0+cu130 | +| Flash Attention | FA3 Hopper (pre-built) | + +## Experiment Status + +### What Was Completed +1. **Full environment setup** — PyTorch + FA3 + dependencies + data +2. **Model verification** — Forward pass produces valid loss (6.94 at init) +3. **Code compatibility fixes** — Disabled `torch.compile` for PyTorch nightly compat +4. **Run scripts created** — `run_smoke.sh` (50 steps) and `run_full.sh` (80 min) + +### What Remains (Shell Sandbox Failure) +The Cursor IDE shell tool entered a non-functional state during the session, preventing execution of training runs. The scripts are ready to run manually: + +```bash +# Smoke test (~5 min) +bash records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_smoke.sh + +# Full run (80 min on 1 GPU ≈ 10 min on 8 GPUs) +bash records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT/run_full.sh +``` + +## Expected Results + +### Script 3 (Learned Feedback) — the strongest variant + +Based on the submission README and the architecture design: + +1. **QAT alone** (Script 1) should significantly reduce the quantization gap that plagued previous recurrent approaches (PR #363 saw 900× error amplification) + +2. **Learned diagonal feedback** (Script 3) should outperform both QAT-only and fixed feedback by: + - Adapting the correction to the actual error distribution + - Only adding 2,560 extra parameters (negligible impact on artifact size) + - Being compatible with the existing GPTQ-lite int6 + lzma export pipeline + +3. **Expected BPB range:** If recurrence successfully adds depth without degradation, the model should achieve comparable or better BPB than the base 11-layer record (~1.12-1.13), with the added benefit of fewer unique parameters (8 vs 11 layers). + +### Scaling on 1 GPU +The `full_run_1gpu.sh` uses `grad_accum_steps=8` to simulate the 8-GPU batch size. At 80 minutes on 1 GPU, this approximates the 10-minute 8-GPU training budget in terms of optimizer steps completed. The key metrics to watch: + +- **val_bpb during training** — Should decrease steadily +- **post-EMA val_bpb** — Should be the best checkpoint quality +- **int6 roundtrip val_bpb** — The official metric (after quantization + compression) +- **sliding window eval** — Typically improves over standard eval by ~0.01-0.02 BPB + +## Key Design Questions (Experimental Plan) + +| Experiment | Script | Question | +|-----------|--------|----------| +| A | QAT only | Does QAT alone fix recurrence quantization? | +| B | Fixed feedback | Does a tiny correction help beyond QAT? | +| C | Learned feedback | Does learned feedback beat fixed at same budget? | +| D | Learned + TTT | Which TTT regime is safest for shared weights? | +| E | Learned + stabilizers | Do clipping/scaling/Jacobian penalty help? | + +Script 3 (learned feedback) is expected to be the best because it can adapt its correction to the actual quantization error distribution during training, while the fixed version uses a static identity or diagonal. + +## Recommendations + +1. **Run Script 3 first** — It's the main experimental target with the highest expected performance +2. **Compare against QAT-only** — Script 1 provides the ablation baseline +3. **Monitor h_norms and growth_ratios** — The stabilizer diagnostics will show whether recurrence is staying stable +4. **Check int6 roundtrip quality** — The gap between pre/post quantization BPB is the key metric for whether QAT+feedback is working diff --git a/run_baseline_50step.sh b/run_baseline_50step.sh new file mode 100755 index 0000000000..9a53a3395c --- /dev/null +++ b/run_baseline_50step.sh @@ -0,0 +1,98 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +export PYTHONUNBUFFERED=1 +export TORCH_COMPILE_DISABLE=1 +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=50 +export MAX_WALLCLOCK_SECONDS=900 +export VAL_LOSS_EVERY=25 +export TRAIN_LOG_EVERY=10 +export WARMUP_STEPS=5 +export WARMDOWN_ITERS=10 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=0 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=5 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=0 +export TTT_ENABLED=0 + +LOG="/home/nesta/parameter-golf/baseline_50step.log" +echo "START baseline SOTA 50-step ($(date +%H:%M:%S))" + +$PYTHON -c " +import subprocess, sys, os, re, time + +os.environ['WANDB_PROJECT'] = 'parameter-golf' +os.environ['WANDB_NAME'] = 'baseline_SOTA_50step' + +import wandb +wandb.init( + project='parameter-golf', + name='baseline_SOTA_50step', + config={ + 'method': 'baseline_SOTA', + 'num_layers': 11, 'model_dim': 512, 'num_heads': 8, + 'num_passes': 1, 'recurrence': False, + 'train_batch_tokens': 786432, 'train_seq_len': 2048, + 'iterations': 50, 'seed': 1337, + }, +) + +proc = subprocess.Popen( + [sys.executable, 'train_gpt.py'], + stdout=subprocess.PIPE, stderr=subprocess.STDOUT, + text=True, bufsize=1, +) + +logf = open('$LOG', 'w') +for line in proc.stdout: + logf.write(line) + logf.flush() + print(line, end='') + + m = re.search(r'step:(\d+)/\d+ val_loss:([\d.]+) val_bpb:([\d.]+) train_time:(\d+)ms step_avg:([\d.]+)ms', line) + if m: + step = int(m.group(1)) + wandb.log({'val_loss': float(m.group(2)), 'val_bpb': float(m.group(3)), 'step_avg_ms': float(m.group(5))}, step=step) + + m = re.search(r'step:(\d+)/\d+ train_loss:([\d.]+) train_time:(\d+)ms step_avg:([\d.]+)ms', line) + if m: + step = int(m.group(1)) + wandb.log({'train_loss': float(m.group(2)), 'step_avg_ms': float(m.group(4))}, step=step) + +proc.wait() +logf.close() +wandb.finish() +print(f'EXIT CODE: {proc.returncode}') +" 2>&1 + +echo "DONE baseline ($(date +%H:%M:%S))" diff --git a/sky.yaml b/sky.yaml deleted file mode 100644 index 6c8753456b..0000000000 --- a/sky.yaml +++ /dev/null @@ -1,78 +0,0 @@ -name: nesta-propensity-rl - -envs: - N_GPUS: 8 - CONFIG_NAME: propensity_ppo_8gpu - PREPROCESS: 0 - MAX_SAMPLES: 0 - WANDB_PROJECT_NAME: prosus-propensity-ppo - WANDB_EXPERIMENT_NAME: qwen3-4b-ppo-8gpu-v1 - TOTAL_EPOCHS: 1 - CONFIG_FOLDER: threshold_prediction - SOURCE_PARQUET: /my_data/propensity/promotions/input_data/propensity_cpo_prediction_v3_equal_sampled - DATA_DIR: /tmp/propensity_cpo - -file_mounts: - /my_data: - source: s3://lcm-ifood-data - mode: MOUNT - -resources: - cloud: nebius - region: eu-north1 - accelerators: H200:8 - disk_size: 1000 - ports: - - 8265 - image_id: docker:verlai/verl:vllm011.latest - -num_nodes: 1 - -workdir: . - -secrets: - WANDB_API_KEY: null - -setup: | - set -euo pipefail - - rm -rf verl - pip cache purge - - git clone https://github.com/volcengine/verl.git - cd verl - pip3 install -v -e .[vllm] - cd .. - - if [[ -n "${WANDB_API_KEY:-}" ]]; then - python3 -c "import wandb; wandb.login(relogin=True, key='${WANDB_API_KEY}')" - fi - - export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True - -run: | - set -euo pipefail - - # Preprocess data if requested and not already done - if [[ "${PREPROCESS}" == "1" ]]; then - echo "Preprocessing dataset..." - MAX_SAMPLES_ARG="" - if [[ "${MAX_SAMPLES}" -gt 0 ]]; then - MAX_SAMPLES_ARG="--max_samples ${MAX_SAMPLES}" - fi - python3 custom/propensity/datasets/prepare_data.py \ - --input_path "${SOURCE_PARQUET}" \ - --local_save_dir "${DATA_DIR}" \ - ${MAX_SAMPLES_ARG} - echo "Preprocessing complete." - else - echo "Skipping preprocessing (PREPROCESS=${PREPROCESS}, data exists=$(test -f ${DATA_DIR}/train.parquet && echo yes || echo no))." - fi - - export VLLM_USE_V1=1 - - python3 -m verl.trainer.main_ppo \ - --config-path="${PWD}/configs/experiments/${CONFIG_FOLDER}" \ - --config-name="${CONFIG_NAME}" - - echo "Training completed!" \ No newline at end of file diff --git a/sky_recurrent.yaml b/sky_recurrent.yaml deleted file mode 100644 index f4cfab36b4..0000000000 --- a/sky_recurrent.yaml +++ /dev/null @@ -1,57 +0,0 @@ -name: param-golf-recurrent - -envs: - SEED: 1337 - SCRIPT: train_bestbase_recurrent_feedback_learned.py - FEEDBACK_MODE: diagonal - FEEDBACK_RANK: 2 - TTT_REGIME: tail_only - TTT_ENABLED: 0 - -resources: - cloud: nebius - region: eu-north1 - accelerators: H200:8 - disk_size: 200 - image_id: docker:nvcr.io/nvidia/pytorch:24.12-py3 - -num_nodes: 1 - -workdir: . - -setup: | - set -euo pipefail - - pip install sentencepiece numpy huggingface-hub 2>/dev/null - pip install flash-attn --no-build-isolation 2>/dev/null || true - - # Download FineWeb dataset + tokenizer (~2 min) - if [ ! -d "data/datasets/fineweb10B_sp1024" ]; then - echo "Downloading FineWeb dataset..." - python data/cached_challenge_fineweb.py - else - echo "Dataset already present." - fi - -run: | - set -euo pipefail - - RECORD_DIR="records/track_10min_16mb/2026-03-25_RecurrentCore_LearnedFeedback_QAT" - cd "$RECORD_DIR" - - export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" - export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" - export SEED="${SEED}" - export MAX_WALLCLOCK_SECONDS=600 - export TTT_ENABLED="${TTT_ENABLED}" - - echo "Running ${SCRIPT} with SEED=${SEED}..." - echo "Data: ${DATA_PATH}" - echo "Feedback: mode=${FEEDBACK_MODE} rank=${FEEDBACK_RANK}" - - torchrun --standalone --nproc_per_node=8 "${SCRIPT}" \ - --feedback-mode "${FEEDBACK_MODE}" \ - --feedback-rank "${FEEDBACK_RANK}" \ - --ttt-regime "${TTT_REGIME}" - - echo "Training completed! Logs in logs/" diff --git a/smoke_3pass.log b/smoke_3pass.log new file mode 100644 index 0000000000..d01ebdf564 --- /dev/null +++ b/smoke_3pass.log @@ -0,0 +1,45 @@ +logs/17edc995-7ab2-45c5-a3c7-81e9b7e989e8.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=3 stem=3 core=5 tail=3 +model_params:26927199 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:30 warmup_steps:5 max_wallclock_seconds:600.000 +seed:1337 +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/30 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10017.6', '11136.1', '12344.6', '13649.1', '14930.9', '8131.2', '9221.2', '10380.1', '11610.1', '12803.3', '8145.6', '9247.6', '10417.8', '11656.3', '12858.4'] growth=['1.114', '1.112', '1.109', '1.106', '1.094', '1.146', '1.134', '1.126', '1.118', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] +step:1/30 train_loss:6.9310 train_time:2559ms step_avg:2558.98ms +step:2/30 train_loss:8.4366 train_time:5099ms step_avg:2549.65ms +step:3/30 train_loss:7.5697 train_time:7672ms step_avg:2557.48ms +step:4/30 train_loss:7.3744 train_time:10245ms step_avg:2561.17ms +step:5/30 train_loss:7.1785 train_time:12816ms step_avg:2563.29ms +step:6/30 train_loss:7.0957 train_time:15388ms step_avg:2564.71ms +step:7/30 train_loss:7.1043 train_time:17962ms step_avg:2565.93ms +step:8/30 train_loss:6.9979 train_time:20534ms step_avg:2566.77ms +step:9/30 train_loss:6.6202 train_time:23107ms step_avg:2567.47ms +step:10/30 train_loss:6.2350 train_time:25680ms step_avg:2568.01ms +step:15/30 val_loss:5.4108 val_bpb:3.2046 train_time:38665ms step_avg:2577.69ms h_norms=['12890.8', '15652.2', '19044.1', '23221.1', '28158.7', '9400.8', '12125.6', '15396.3', '19428.0', '24216.3', '9485.6', '12358.7', '15833.4', '20185.5', '25411.9'] growth=['1.206', '1.214', '1.217', '1.219', '1.213', '1.325', '1.290', '1.270', '1.262', '1.246', '1.337', '1.303', '1.281', '1.275', '1.259'] +step:20/30 train_loss:4.9219 train_time:51499ms step_avg:2574.94ms +step:30/30 train_loss:4.2511 train_time:77236ms step_avg:2574.53ms +step:30/30 val_loss:4.1921 val_bpb:2.4828 train_time:77269ms step_avg:2575.63ms h_norms=['18485.5', '22357.3', '27748.3', '35088.0', '44617.9', '10213.3', '14090.6', '19092.3', '25597.2', '33766.0', '10296.8', '14342.7', '19555.4', '26454.9', '35157.1'] growth=['1.167', '1.209', '1.241', '1.265', '1.272', '1.440', '1.380', '1.355', '1.341', '1.319', '1.451', '1.393', '1.363', '1.353', '1.329'] +peak memory allocated: 55186 MiB reserved: 56536 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9678 val_bpb:3.5345 eval_time:70242ms +Serialized model: 106023671 bytes +Code size: 96617 bytes +Serialized model int6+lzma: 4718536 bytes +Total submission size int6+lzma: 4815153 bytes +final_int6_roundtrip val_loss:5.9995 val_bpb:3.5532 eval_time:69827ms +final_int6_roundtrip_exact val_loss:5.99946477 val_bpb:3.55322097 +run_3pass.sh: line 20: 209886 Killed PYTHONUNBUFFERED=1 TORCH_COMPILE_DISABLE=1 DATA_PATH="../../../data/datasets/fineweb10B_sp1024" TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" SEED=1337 ITERATIONS=30 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=15 TRAIN_LOG_EVERY=10 WARMUP_STEPS=5 WARMDOWN_ITERS=5 TRAIN_BATCH_TOKENS=786432 TRAIN_SEQ_LEN=2048 EVAL_SEQ_LEN=2048 EVAL_STRIDE=0 NUM_LAYERS=11 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 VE_ENABLED=1 VE_DIM=128 VE_LAYERS="9,10" MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=5 MUON_WD=0.04 ADAM_WD=0.04 GRAD_CLIP_NORM=0.3 SWA_ENABLED=0 LATE_QAT=0 TTT_ENABLED=0 CORE_START=3 CORE_END=8 NUM_PASSES=3 CORE_QUANT_ENABLED=0 /home/nesta/parameter-golf/.venv/bin/python3 train_gpt_recurrent.py --feedback-mode diagonal --feedback-rank 2 --residual-scale-init 0.5 --jacobian-proxy-weight 0.01 diff --git a/smoke_noclip.log b/smoke_noclip.log new file mode 100644 index 0000000000..8b6bee8864 --- /dev/null +++ b/smoke_noclip.log @@ -0,0 +1,48 @@ +=== Smoke test: same params as full run, 50 iterations, torch.compile ENABLED === +logs/34b659e5-596a-4c30-9cca-d9e5659557e2.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=2 stem=3 core=5 tail=3 +model_params:26927198 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:600.000 +seed:1337 +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms +step:1/50 train_loss:6.9310 train_time:772ms step_avg:771.98ms +step:2/50 train_loss:8.5366 train_time:1533ms step_avg:766.72ms +step:3/50 train_loss:7.6283 train_time:2348ms step_avg:782.72ms +step:4/50 train_loss:7.3352 train_time:3157ms step_avg:789.28ms +step:5/50 train_loss:7.1457 train_time:3963ms step_avg:792.62ms +step:6/50 train_loss:7.0996 train_time:4777ms step_avg:796.13ms +step:7/50 train_loss:7.0716 train_time:5583ms step_avg:797.59ms +step:8/50 train_loss:6.9514 train_time:6393ms step_avg:799.14ms +step:9/50 train_loss:6.6007 train_time:7201ms step_avg:800.15ms +step:10/50 train_loss:6.2459 train_time:8013ms step_avg:801.34ms +step:20/50 train_loss:4.9217 train_time:16134ms step_avg:806.71ms +step:25/50 val_loss:4.4307 val_bpb:2.6241 train_time:20236ms step_avg:809.42ms +step:30/50 train_loss:4.2558 train_time:24275ms step_avg:809.16ms +step:40/50 train_loss:3.9787 train_time:32426ms step_avg:810.64ms +step:50/50 train_loss:3.7904 train_time:40586ms step_avg:811.73ms +step:50/50 val_loss:3.7588 val_bpb:2.2262 train_time:40624ms step_avg:812.48ms +peak memory allocated: 32531 MiB reserved: 32598 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9162 val_bpb:3.5039 eval_time:20570ms +Serialized model: 106023671 bytes +Code size: 96617 bytes +Serialized model int6+lzma: 4808332 bytes +Total submission size int6+lzma: 4904949 bytes +final_int6_roundtrip val_loss:6.1231 val_bpb:3.6264 eval_time:25525ms +final_int6_roundtrip_exact val_loss:6.12306792 val_bpb:3.62642572 +smoke_test.sh: line 55: 201409 Killed PYTHONUNBUFFERED=1 DATA_PATH="../../../data/datasets/fineweb10B_sp1024" TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" SEED=1337 ITERATIONS=50 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=25 TRAIN_LOG_EVERY=10 WARMUP_STEPS=5 WARMDOWN_ITERS=10 TRAIN_BATCH_TOKENS=786432 TRAIN_SEQ_LEN=2048 EVAL_SEQ_LEN=2048 EVAL_STRIDE=0 NUM_LAYERS=11 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 VE_ENABLED=1 VE_DIM=128 VE_LAYERS="9,10" MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=5 MUON_WD=0.04 ADAM_WD=0.04 GRAD_CLIP_NORM=0.3 SWA_ENABLED=1 SWA_EVERY=10 LATE_QAT=0 TTT_ENABLED=0 CORE_START=3 CORE_END=8 NUM_PASSES=2 CORE_QUANT_ENABLED=0 $PYTHON train_gpt_recurrent.py --feedback-mode diagonal --feedback-rank 2 --residual-scale-init 0.5 --jacobian-proxy-weight 0.01 diff --git a/smoke_passes.log b/smoke_passes.log new file mode 100644 index 0000000000..88df97e45c --- /dev/null +++ b/smoke_passes.log @@ -0,0 +1,49 @@ + +======================================== + NUM_PASSES=2 (30 steps) +======================================== +logs/cb21ee17-0238-480a-bab1-31abb392b62f.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=2 stem=3 core=5 tail=3 +model_params:26927198 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:30 warmup_steps:5 max_wallclock_seconds:600.000 +seed:1337 +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/30 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.1', '11332.4', '12572.0', '13909.5', '15221.6', '8143.6', '9246.7', '10423.1', '11669.9', '12876.1'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] +step:1/30 train_loss:6.9310 train_time:1964ms step_avg:1963.81ms +step:2/30 train_loss:8.5267 train_time:3903ms step_avg:1951.36ms +step:3/30 train_loss:7.6283 train_time:5874ms step_avg:1957.95ms +step:4/30 train_loss:7.3204 train_time:7845ms step_avg:1961.15ms +step:5/30 train_loss:7.1282 train_time:9815ms step_avg:1963.01ms +step:6/30 train_loss:7.0823 train_time:11786ms step_avg:1964.30ms +step:7/30 train_loss:7.0694 train_time:13756ms step_avg:1965.15ms +step:8/30 train_loss:6.9485 train_time:15727ms step_avg:1965.93ms +step:9/30 train_loss:6.6023 train_time:17699ms step_avg:1966.53ms +step:10/30 train_loss:6.2459 train_time:19670ms step_avg:1967.04ms +step:15/30 val_loss:5.4370 val_bpb:3.2201 train_time:29560ms step_avg:1970.64ms h_norms=['13059.2', '15785.8', '19048.4', '22971.2', '27484.0', '9278.5', '11837.1', '14855.2', '18507.6', '22745.8'] growth=['1.208', '1.209', '1.207', '1.206', '1.196', '1.308', '1.276', '1.255', '1.246', '1.229'] +step:20/30 train_loss:4.9597 train_time:39476ms step_avg:1973.82ms +step:30/30 train_loss:4.2722 train_time:59203ms step_avg:1973.44ms +step:30/30 val_loss:4.2093 val_bpb:2.4930 train_time:59236ms step_avg:1974.53ms h_norms=['16776.6', '19077.7', '22030.9', '25685.0', '30232.6', '9592.6', '12498.9', '15904.3', '19931.8', '24501.9'] growth=['1.114', '1.137', '1.155', '1.166', '1.177', '1.352', '1.303', '1.272', '1.253', '1.229'] +peak memory allocated: 42782 MiB reserved: 44140 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9546 val_bpb:3.5267 eval_time:54001ms +Serialized model: 106023671 bytes +Code size: 96617 bytes +Serialized model int6+lzma: 4716008 bytes +Total submission size int6+lzma: 4812625 bytes +final_int6_roundtrip val_loss:5.9865 val_bpb:3.5455 eval_time:53683ms +final_int6_roundtrip_exact val_loss:5.98646698 val_bpb:3.54552295 +smoke_passes.sh: line 50: 203751 Killed env $COMMON_ENV NUM_PASSES=$PASSES $PYTHON train_gpt_recurrent.py --feedback-mode diagonal --feedback-rank 2 --residual-scale-init 0.5 --jacobian-proxy-weight 0.01 diff --git a/smoke_passes2.log b/smoke_passes2.log new file mode 100644 index 0000000000..146d0de9de --- /dev/null +++ b/smoke_passes2.log @@ -0,0 +1,26 @@ + +======================================== + NUM_PASSES=2 (30 steps) +======================================== +logs/4d899ae6-1bf8-4054-b64c-755b08712440.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=2 stem=3 core=5 tail=3 +model_params:26927198 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:30 warmup_steps:5 max_wallclock_seconds:600.000 +seed:1337 +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/30 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10183.1', '11332.7', '12572.4', '13909.9', '15222.2', '8143.2', '9246.2', '10422.7', '11669.4', '12875.8'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] +smoke_passes.sh: line 50: 207705 Killed $PYTHON train_gpt_recurrent.py --feedback-mode diagonal --feedback-rank 2 --residual-scale-init 0.5 --jacobian-proxy-weight 0.01 diff --git a/test_4pass_noRMS_j0.1.log b/test_4pass_noRMS_j0.1.log new file mode 100644 index 0000000000..558cdce0d6 --- /dev/null +++ b/test_4pass_noRMS_j0.1.log @@ -0,0 +1,79 @@ +logs/e5b07d36-68a7-472c-9e61-88d88ee79a60.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +model_params:26927200 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run qgrbnv6t +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run test_4pass_noRMS_j0.1_80GBcap +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/qgrbnv6t +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['9918.1', '10850.5', '11857.1', '12933.0', '14082.2', '15372.6', '16714.7', '18128.7', '19628.1', '21227.1', '19383.5', '20984.5', '22685.3', '24491.8', '26411.9', '24206.9', '26133.0', '28177.6', '30331.6', '32611.0'] growth=['1.094', '1.094', '1.093', '1.091', '1.089', '1.092', '1.087', '1.085', '1.083', '1.081', '1.086', '1.083', '1.081', '1.080', '1.078', '1.082', '1.080', '1.078', '1.076', '1.075'] +step:1/50 train_loss:6.9310 train_time:3032ms step_avg:3031.88ms +step:2/50 train_loss:8.3473 train_time:6048ms step_avg:3024.07ms +step:3/50 train_loss:7.5117 train_time:9096ms step_avg:3032.08ms +step:4/50 train_loss:7.5611 train_time:12145ms step_avg:3036.23ms +step:5/50 train_loss:7.3188 train_time:15194ms step_avg:3038.75ms +step:6/50 train_loss:7.0774 train_time:18243ms step_avg:3040.49ms +step:7/50 train_loss:6.9519 train_time:21292ms step_avg:3041.67ms +step:8/50 train_loss:6.9005 train_time:24342ms step_avg:3042.69ms +step:9/50 train_loss:6.5418 train_time:27391ms step_avg:3043.47ms +step:10/50 train_loss:6.1552 train_time:30442ms step_avg:3044.24ms +step:20/50 train_loss:4.8491 train_time:60936ms step_avg:3046.80ms +step:25/50 val_loss:4.3716 val_bpb:2.5891 train_time:76222ms step_avg:3048.87ms h_norms=['12739.2', '11639.7', '10834.5', '10248.2', '9888.2', '9584.1', '9359.5', '9269.6', '9263.6', '9383.8', '9327.1', '9186.1', '9174.7', '9243.6', '9433.6', '9207.4', '9116.2', '9157.3', '9280.5', '9523.6'] growth=['0.899', '0.914', '0.931', '0.946', '0.965', '0.969', '0.977', '0.990', '0.999', '1.013', '0.979', '0.985', '0.999', '1.008', '1.021', '0.984', '0.990', '1.005', '1.013', '1.026'] +step:30/50 train_loss:4.2292 train_time:91444ms step_avg:3048.14ms +step:40/50 train_loss:3.9319 train_time:121963ms step_avg:3049.08ms +step:50/50 train_loss:3.7393 train_time:152613ms step_avg:3052.25ms +step:50/50 val_loss:3.7133 val_bpb:2.1992 train_time:152647ms step_avg:3052.93ms h_norms=['19200.7', '16705.4', '14993.2', '13850.4', '13137.0', '13070.5', '12983.4', '12953.9', '12984.4', '13061.0', '13002.1', '13046.2', '13096.0', '13184.0', '13296.2', '13048.7', '13170.7', '13265.1', '13387.0', '13519.2'] growth=['0.844', '0.870', '0.898', '0.924', '0.948', '0.995', '0.993', '0.998', '1.002', '1.006', '1.011', '1.003', '1.004', '1.007', '1.009', '1.021', '1.009', '1.007', '1.009', '1.010'] +peak memory allocated: 66516 MiB reserved: 67876 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9477 val_bpb:3.5225 eval_time:83302ms +Serialized model: 106023671 bytes +Code size: 99082 bytes +Serialized model int6+lzma: 4803604 bytes +Total submission size int6+lzma: 4902686 bytes +final_int6_roundtrip val_loss:6.1439 val_bpb:3.6388 eval_time:82807ms +final_int6_roundtrip_exact val_loss:6.14393907 val_bpb:3.63878679 +wandb: uploading data; updating run metadata +wandb: uploading data +wandb: uploading history steps 15-15, summary, console lines 30-31 +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▃▁▃▄▅▅▅▆▆▆▇▇▇█ +wandb: train_loss ▆█▇▇▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▂▁ +wandb: val_loss █▂▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 3052.25325 +wandb: train_loss 3.73926 +wandb: val_bpb 2.19923 +wandb: val_loss 3.7133 +wandb: +wandb: 🚀 View run test_4pass_noRMS_j0.1_80GBcap at: https://wandb.ai/propensity/parameter-golf/runs/qgrbnv6t +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_123015-qgrbnv6t/logs diff --git a/test_4pass_qat.log b/test_4pass_qat.log new file mode 100644 index 0000000000..7938fa4784 --- /dev/null +++ b/test_4pass_qat.log @@ -0,0 +1,43 @@ +logs/8fb4628b-3f66-468a-8d0a-b534be5ca9c9.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +model_params:26927200 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run meaoom9b +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run test_4pass_noRMS_j0.1_QAT +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/meaoom9b +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['9917.9', '10757.9', '11684.5', '12697.7', '13792.7', '14968.7', '16232.0', '17584.0', '19026.6', '20567.1', '18830.4', '20360.3', '21989.6', '23732.8', '25596.8', '23508.1', '25360.6', '27334.0', '29434.0', '31674.1'] growth=['1.081', '1.085', '1.086', '1.087', '1.086', '1.085', '1.084', '1.083', '1.082', '1.081', '1.082', '1.081', '1.080', '1.079', '1.079', '1.079', '1.079', '1.078', '1.077', '1.076'] +step:1/50 train_loss:6.9310 train_time:3149ms step_avg:3148.80ms +step:2/50 train_loss:8.4139 train_time:6265ms step_avg:3132.33ms +step:3/50 train_loss:7.5940 train_time:9413ms step_avg:3137.77ms +step:4/50 train_loss:7.3379 train_time:12563ms step_avg:3140.75ms +step:5/50 train_loss:7.1846 train_time:15710ms step_avg:3142.03ms +step:6/50 train_loss:7.1330 train_time:18856ms step_avg:3142.72ms +step:7/50 train_loss:7.0579 train_time:22001ms step_avg:3142.94ms +step:8/50 train_loss:6.8826 train_time:25148ms step_avg:3143.55ms +step:9/50 train_loss:6.5374 train_time:28300ms step_avg:3144.44ms +step:10/50 train_loss:6.1407 train_time:31445ms step_avg:3144.50ms +step:20/50 train_loss:4.7836 train_time:62947ms step_avg:3147.35ms diff --git a/test_4pass_qat_stdout.log b/test_4pass_qat_stdout.log new file mode 100644 index 0000000000..c2f75f265e --- /dev/null +++ b/test_4pass_qat_stdout.log @@ -0,0 +1 @@ +START 4-pass no-RMSnorm jac=0.1 QAT, 80GB cap (12:52:38) diff --git a/test_4pass_stdout.log b/test_4pass_stdout.log new file mode 100644 index 0000000000..333ab740c6 --- /dev/null +++ b/test_4pass_stdout.log @@ -0,0 +1,3 @@ +START 4-pass no-RMSnorm jac=0.1, 80GB memory cap (12:30:11) +DONE => bpb@50=2.1992 int6=3.63878679 step=3052.25ms mem=66516MiB +FINISHED (12:40:21) diff --git a/test_4pass_ttt.log b/test_4pass_ttt.log new file mode 100644 index 0000000000..cd1c870fa2 --- /dev/null +++ b/test_4pass_ttt.log @@ -0,0 +1,57 @@ +logs/d889730e-3e41-4ead-bdf8-4ffa6a4ed64c.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +model_params:26927200 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run cf7n2jes +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run test_4pass_noRMS_j0.1_TTT +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/cf7n2jes +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['9918.0', '10850.4', '11857.1', '12934.1', '14082.9', '15373.2', '16715.3', '18129.5', '19630.8', '21229.7', '19385.0', '20986.2', '22687.0', '24496.2', '26416.1', '24210.2', '26136.7', '28180.1', '30337.3', '32616.2'] growth=['1.094', '1.094', '1.093', '1.091', '1.089', '1.092', '1.087', '1.085', '1.083', '1.081', '1.086', '1.083', '1.081', '1.080', '1.078', '1.082', '1.080', '1.078', '1.077', '1.075'] +step:1/50 train_loss:6.9310 train_time:3036ms step_avg:3035.59ms +step:2/50 train_loss:8.3473 train_time:6057ms step_avg:3028.64ms +step:3/50 train_loss:7.5117 train_time:9110ms step_avg:3036.83ms +step:4/50 train_loss:7.5612 train_time:12164ms step_avg:3040.98ms +step:5/50 train_loss:7.3191 train_time:15217ms step_avg:3043.48ms +step:6/50 train_loss:7.0778 train_time:18271ms step_avg:3045.21ms +step:7/50 train_loss:6.9515 train_time:21325ms step_avg:3046.36ms +step:8/50 train_loss:6.8989 train_time:24379ms step_avg:3047.33ms +step:9/50 train_loss:6.5423 train_time:27433ms step_avg:3048.11ms +step:10/50 train_loss:6.1552 train_time:30487ms step_avg:3048.71ms +step:20/50 train_loss:4.8543 train_time:61023ms step_avg:3051.16ms +step:25/50 val_loss:4.3745 val_bpb:2.5908 train_time:76333ms step_avg:3053.30ms h_norms=['12750.8', '11662.4', '10864.5', '10284.5', '9926.1', '9627.8', '9403.7', '9313.4', '9309.0', '9428.9', '9374.8', '9233.4', '9221.1', '9291.2', '9480.8', '9258.3', '9166.0', '9206.0', '9330.2', '9572.9'] growth=['0.900', '0.915', '0.932', '0.947', '0.965', '0.970', '0.977', '0.990', '1.000', '1.013', '0.979', '0.985', '0.999', '1.008', '1.020', '0.985', '0.990', '1.004', '1.013', '1.026'] +step:30/50 train_loss:4.2237 train_time:91579ms step_avg:3052.64ms +step:40/50 train_loss:3.9233 train_time:122152ms step_avg:3053.79ms +step:50/50 train_loss:3.7378 train_time:152866ms step_avg:3057.32ms +step:50/50 val_loss:3.7019 val_bpb:2.1925 train_time:152900ms step_avg:3058.01ms h_norms=['18557.7', '16139.2', '14492.7', '13414.8', '12747.7', '12664.0', '12590.1', '12595.2', '12645.7', '12735.2', '12603.5', '12652.5', '12735.0', '12838.0', '12962.0', '12653.1', '12772.9', '12898.4', '13031.2', '13175.0'] growth=['0.841', '0.870', '0.898', '0.926', '0.950', '0.993', '0.994', '1.000', '1.004', '1.007', '1.009', '1.004', '1.007', '1.008', '1.010', '1.018', '1.009', '1.010', '1.010', '1.011'] +peak memory allocated: 66516 MiB reserved: 67876 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9474 val_bpb:3.5224 eval_time:83475ms +Serialized model: 106023671 bytes +Code size: 99082 bytes +Serialized model int6+lzma: 4801388 bytes +Total submission size int6+lzma: 4900470 bytes +final_int6_roundtrip val_loss:6.1436 val_bpb:3.6386 eval_time:82976ms +final_int6_roundtrip_exact val_loss:6.14358204 val_bpb:3.63857534 diff --git a/test_4pass_ttt_stdout.log b/test_4pass_ttt_stdout.log new file mode 100644 index 0000000000..8f7abd9734 --- /dev/null +++ b/test_4pass_ttt_stdout.log @@ -0,0 +1 @@ +START 4-pass no-RMSnorm jac=0.1 + TTT, 80GB cap (12:41:15) From 8caedf76dc67e9cc7ea454d34e96983490e29e63 Mon Sep 17 00:00:00 2001 From: nesta Date: Fri, 27 Mar 2026 10:45:57 +0000 Subject: [PATCH 03/23] things are looking decent --- .../run-20260326_125242-meaoom9b/files/output.log | 12 ++++++++++++ test_4pass_qat.log | 12 ++++++++++++ 2 files changed, 24 insertions(+) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log index cf02289148..932fd79c6a 100644 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log @@ -16,3 +16,15 @@ step:8/50 train_loss:6.8826 train_time:25148ms step_avg:3143.55ms step:9/50 train_loss:6.5374 train_time:28300ms step_avg:3144.44ms step:10/50 train_loss:6.1407 train_time:31445ms step_avg:3144.50ms step:20/50 train_loss:4.7836 train_time:62947ms step_avg:3147.35ms +step:25/50 val_loss:4.3985 val_bpb:2.6051 train_time:78763ms step_avg:3150.51ms h_norms=['12132.8', '10932.9', '9995.3', '9323.7', '8882.5', '8488.8', '8191.2', '8002.8', '7969.0', '8058.7', '8126.2', '7929.0', '7834.6', '7885.5', '8052.5', '7921.9', '7788.9', '7759.4', '7870.9', '8096.2'] growth=['0.889', '0.901', '0.914', '0.933', '0.953', '0.956', '0.965', '0.977', '0.996', '1.011', '0.967', '0.976', '0.988', '1.006', '1.021', '0.974', '0.983', '0.996', '1.014', '1.029'] +step:30/50 train_loss:4.2134 train_time:94516ms step_avg:3150.53ms +step:40/50 train_loss:3.9354 train_time:126096ms step_avg:3152.40ms +step:50/50 train_loss:3.7653 train_time:157849ms step_avg:3156.98ms +step:50/50 val_loss:3.7283 val_bpb:2.2081 train_time:157883ms step_avg:3157.66ms h_norms=['19485.3', '16686.2', '14563.5', '13010.1', '11889.3', '11357.6', '10866.1', '10420.7', '10135.7', '9955.0', '10905.5', '10559.7', '10219.8', '10021.9', '9904.3', '10582.1', '10345.6', '10085.3', '9954.8', '9885.5'] growth=['0.843', '0.856', '0.873', '0.893', '0.914', '0.955', '0.957', '0.959', '0.973', '0.982', '0.971', '0.968', '0.968', '0.981', '0.988', '0.985', '0.978', '0.975', '0.987', '0.993'] +peak memory allocated: 66515 MiB reserved: 67880 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9490 val_bpb:3.5233 eval_time:83476ms +Serialized model: 106023671 bytes +Code size: 99082 bytes +Serialized model int6+lzma: 4795396 bytes +Total submission size int6+lzma: 4894478 bytes diff --git a/test_4pass_qat.log b/test_4pass_qat.log index 7938fa4784..d40d4965ef 100644 --- a/test_4pass_qat.log +++ b/test_4pass_qat.log @@ -41,3 +41,15 @@ step:8/50 train_loss:6.8826 train_time:25148ms step_avg:3143.55ms step:9/50 train_loss:6.5374 train_time:28300ms step_avg:3144.44ms step:10/50 train_loss:6.1407 train_time:31445ms step_avg:3144.50ms step:20/50 train_loss:4.7836 train_time:62947ms step_avg:3147.35ms +step:25/50 val_loss:4.3985 val_bpb:2.6051 train_time:78763ms step_avg:3150.51ms h_norms=['12132.8', '10932.9', '9995.3', '9323.7', '8882.5', '8488.8', '8191.2', '8002.8', '7969.0', '8058.7', '8126.2', '7929.0', '7834.6', '7885.5', '8052.5', '7921.9', '7788.9', '7759.4', '7870.9', '8096.2'] growth=['0.889', '0.901', '0.914', '0.933', '0.953', '0.956', '0.965', '0.977', '0.996', '1.011', '0.967', '0.976', '0.988', '1.006', '1.021', '0.974', '0.983', '0.996', '1.014', '1.029'] +step:30/50 train_loss:4.2134 train_time:94516ms step_avg:3150.53ms +step:40/50 train_loss:3.9354 train_time:126096ms step_avg:3152.40ms +step:50/50 train_loss:3.7653 train_time:157849ms step_avg:3156.98ms +step:50/50 val_loss:3.7283 val_bpb:2.2081 train_time:157883ms step_avg:3157.66ms h_norms=['19485.3', '16686.2', '14563.5', '13010.1', '11889.3', '11357.6', '10866.1', '10420.7', '10135.7', '9955.0', '10905.5', '10559.7', '10219.8', '10021.9', '9904.3', '10582.1', '10345.6', '10085.3', '9954.8', '9885.5'] growth=['0.843', '0.856', '0.873', '0.893', '0.914', '0.955', '0.957', '0.959', '0.973', '0.982', '0.971', '0.968', '0.968', '0.981', '0.988', '0.985', '0.978', '0.975', '0.987', '0.993'] +peak memory allocated: 66515 MiB reserved: 67880 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9490 val_bpb:3.5233 eval_time:83476ms +Serialized model: 106023671 bytes +Code size: 99082 bytes +Serialized model int6+lzma: 4795396 bytes +Total submission size int6+lzma: 4894478 bytes From 4a6317dbe6f6c4adb344d639175e6cfac10ed2d9 Mon Sep 17 00:00:00 2001 From: nesta Date: Fri, 27 Mar 2026 10:46:07 +0000 Subject: [PATCH 04/23] logs should not be committed --- baseline_stdout.log | 110 +-- eval_2p3c_ttt_4pass.log | 81 ++ eval_ttt_2pass.log | 48 ++ eval_ttt_6pass.log | 43 ++ full_2pass_3core.log | 430 +++++++++++ full_4pass.log | 145 ++++ full_4pass_stdout.log | Bin 0 -> 2445 bytes full_baseline.log | 723 ++++++++++++++++++ full_lora_r8.log | 217 ++++++ lora_r8_stdout.log | 34 + lora_test_500step.log | 151 ++++ lora_test_r8_500step.log | 92 +++ lora_test_r8_stdout.log | 4 + lora_test_stdout.log | 10 + lora_test_v2_stdout.log | 9 + records/full_baseline(save).log | 723 ++++++++++++++++++ .../eval_ttt_passes.sh | 82 ++ .../lora-fix-plan.md | 204 +++++ .../run_2pass_3core.sh | 82 ++ .../run_baseline_4pass.sh | 83 ++ .../run_full_4pass.sh | 81 ++ .../run_lora_test.sh | 79 ++ .../run_lora_test_r8.sh | 83 ++ .../sweep_passes.sh | 88 +++ .../train_gpt_recurrent.py | 120 ++- .../wandb/debug-internal.log | 2 +- .../wandb/debug.log | 2 +- .../wandb/latest-run | 2 +- .../files/config.yaml | 98 +++ .../files/output.log | 2 + .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 +++ .../files/output.log | 32 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 51 ++ .../files/wandb-summary.json | 1 + .../files/output.log | 19 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 51 ++ .../files/output.log | 35 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 51 ++ .../files/output.log | 32 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 51 ++ .../files/config.yaml | 96 +++ .../files/output.log | 67 ++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 51 ++ .../files/wandb-summary.json | 1 + .../files/output.log | 122 +++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 51 ++ .../files/output.log | 42 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/output.log | 51 ++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/config.yaml | 100 +++ .../files/output.log | 100 +++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + .../files/output.log | 93 +++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/output.log | 35 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/output.log | 42 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/output.log | 42 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/output.log | 32 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/output.log | 23 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/output.log | 42 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/output.log | 36 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/output.log | 48 ++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/output.log | 35 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 57 ++ .../files/output.log | 39 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 57 ++ .../files/output.log | 39 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 57 ++ .../files/config.yaml | 106 +++ .../files/output.log | 100 +++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 59 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 104 +++ .../files/output.log | 50 ++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 59 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 104 +++ .../files/output.log | 14 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 59 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 104 +++ .../files/output.log | 50 ++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 59 ++ .../files/wandb-summary.json | 1 + .../files/output.log | 63 ++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 59 ++ .../files/output.log | 65 ++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 55 ++ .../files/config.yaml | 101 +++ .../files/output.log | 88 +++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 55 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 101 +++ .../files/output.log | 186 +++++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 55 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 +++ .../files/output.log | 41 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 +++ .../files/output.log | 41 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 +++ .../files/output.log | 109 +++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 +++ .../files/output.log | 41 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 +++ .../files/output.log | 41 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 +++ .../files/output.log | 109 +++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 100 +++ .../files/output.log | 319 ++++++++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 +++ .../files/output.log | 67 ++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 +++ .../files/output.log | 67 ++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 +++ .../files/output.log | 67 ++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 98 +++ .../files/output.log | 41 + .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + .../files/config.yaml | 100 +++ .../files/output.log | 379 +++++++++ .../files/requirements.txt | 101 +++ .../files/wandb-metadata.json | 53 ++ .../files/wandb-summary.json | 1 + sweep_5pass.log | 78 ++ sweep_6pass.log | 44 ++ sweep_passes_results.txt | 2 + sweep_stdout.log | 3 + test_4pass_qat.log | 23 + test_4pass_qat_stdout.log | 2 + 202 files changed, 15307 insertions(+), 83 deletions(-) create mode 100644 eval_2p3c_ttt_4pass.log create mode 100644 eval_ttt_2pass.log create mode 100644 eval_ttt_6pass.log create mode 100644 full_2pass_3core.log create mode 100644 full_4pass.log create mode 100644 full_4pass_stdout.log create mode 100644 full_baseline.log create mode 100644 full_lora_r8.log create mode 100644 lora_r8_stdout.log create mode 100644 lora_test_500step.log create mode 100644 lora_test_r8_500step.log create mode 100644 lora_test_r8_stdout.log create mode 100644 lora_test_stdout.log create mode 100644 lora_test_v2_stdout.log create mode 100644 records/full_baseline(save).log create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/eval_ttt_passes.sh create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/lora-fix-plan.md create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_2pass_3core.sh create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_baseline_4pass.sh create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_full_4pass.sh create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_lora_test.sh create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_lora_test_r8.sh create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/sweep_passes.sh create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/wandb-summary.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/output.log create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/requirements.txt create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/wandb-metadata.json create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/wandb-summary.json create mode 100644 sweep_5pass.log create mode 100644 sweep_6pass.log create mode 100644 sweep_passes_results.txt create mode 100644 sweep_stdout.log diff --git a/baseline_stdout.log b/baseline_stdout.log index 5a653b4e0b..67f3158584 100644 --- a/baseline_stdout.log +++ b/baseline_stdout.log @@ -1,75 +1,35 @@ -START baseline SOTA 50-step (12:20:35) -wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. -wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin -wandb: setting up run nx8viusx -wandb: Tracking run with wandb version 0.25.1 -wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/wandb/run-20260326_122037-nx8viusx -wandb: Run `wandb offline` to turn off syncing. -wandb: Syncing run baseline_SOTA_50step -wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf -wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/nx8viusx -logs/bff51f18-fbb9-43cc-9903-c84284e4e76d.txt -val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model -train_loader:dataset:fineweb10B_sp1024 train_shards:10 -val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 -model_params:26928220 -mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 -XSA:last_4 active_layers:[7, 8, 9, 10] -world_size:1 grad_accum_steps:8 -sdp_backends:cudnn=False flash=True mem_efficient=False math=False -attention_mode:gqa num_heads:8 num_kv_heads:4 -tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 -train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 -seed:1337 -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms -step:1/50 train_loss:6.9310 train_time:1335ms step_avg:1334.89ms -step:2/50 train_loss:8.6894 train_time:2639ms step_avg:1319.33ms -step:3/50 train_loss:7.7641 train_time:3975ms step_avg:1325.02ms -step:4/50 train_loss:7.2309 train_time:5311ms step_avg:1327.85ms -step:5/50 train_loss:7.1292 train_time:6648ms step_avg:1329.55ms -step:6/50 train_loss:7.1698 train_time:7983ms step_avg:1330.57ms -step:7/50 train_loss:7.1045 train_time:9320ms step_avg:1331.38ms -step:8/50 train_loss:6.9776 train_time:10656ms step_avg:1331.99ms -step:9/50 train_loss:6.6169 train_time:11993ms step_avg:1332.53ms -step:10/50 train_loss:6.2604 train_time:13330ms step_avg:1332.96ms -step:20/50 train_loss:5.1681 train_time:26695ms step_avg:1334.74ms -step:25/50 val_loss:4.6120 val_bpb:2.7315 train_time:33413ms step_avg:1336.54ms -step:30/50 train_loss:4.3901 train_time:40068ms step_avg:1335.60ms -step:40/50 train_loss:4.0167 train_time:53443ms step_avg:1336.07ms -step:50/50 train_loss:3.8262 train_time:66820ms step_avg:1336.40ms -step:50/50 val_loss:3.7856 val_bpb:2.2421 train_time:66853ms step_avg:1337.06ms -peak memory allocated: 30083 MiB reserved: 31168 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.8987 val_bpb:3.4935 eval_time:38419ms -Serialized model: 106027446 bytes -Code size: 89458 bytes -Serialized model int6+lzma: 4809376 bytes -Total submission size int6+lzma: 4898834 bytes -final_int6_roundtrip val_loss:6.0576 val_bpb:3.5876 eval_time:38209ms -final_int6_roundtrip_exact val_loss:6.05759208 val_bpb:3.58764724 -wandb: updating run metadata -wandb: uploading history steps 15-15, summary -wandb: -wandb: Run history: -wandb: step_avg_ms ▁███████████████ -wandb: train_loss ▅█▇▆▆▆▆▆▅▅▃▂▁▁ -wandb: val_bpb █▃▁ -wandb: val_loss █▃▁ -wandb: -wandb: Run summary: -wandb: step_avg_ms 1337.06 -wandb: train_loss 3.8262 -wandb: val_bpb 2.2421 -wandb: val_loss 3.7856 -wandb: -wandb: 🚀 View run baseline_SOTA_50step at: https://wandb.ai/propensity/parameter-golf/runs/nx8viusx -wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf -wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) -wandb: Find logs at: ./wandb/run-20260326_122037-nx8viusx/logs -EXIT CODE: -9 -DONE baseline (12:28:16) +START full run: 4-pass baseline (no LoRA) TTT SWA, 80min (Thu Mar 26 23:01:24 UTC 2026) + +=== FINAL RESULTS === +stopping_early: wallclock_cap train_time:4800814ms step:3456/20000 +peak memory allocated: 50545 MiB reserved: 50594 MiB +final_int6_roundtrip_exact val_loss:1.95735252 val_bpb:1.15925441 +final_int6_sliding_window_exact val_loss:1.91642779 val_bpb:1.13501949 +legal_ttt_exact val_loss:1.91163996 val_bpb:1.13218386 +FINISHED (Fri Mar 27 01:46:16 UTC 2026) +python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call + buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 18.69 MiB is free. Process 448538 has 758.00 MiB memory in use. Process 448539 has 758.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/jkh80zal +wandb: Find logs at: wandb/run-20260326_230158-jkh80zal/logs +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/fsi4c82a +wandb: Find logs at: wandb/run-20260326_230158-fsi4c82a/logs +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/43bipylb +wandb: Find logs at: wandb/run-20260326_230158-43bipylb/logs +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/zcabiozu +wandb: Find logs at: wandb/run-20260326_230158-zcabiozu/logs +FINISHED (Thu Mar 26 23:02:28 UTC 2026) diff --git a/eval_2p3c_ttt_4pass.log b/eval_2p3c_ttt_4pass.log new file mode 100644 index 0000000000..e8b2bd7106 --- /dev/null +++ b/eval_2p3c_ttt_4pass.log @@ -0,0 +1,81 @@ +=== TTT eval: 2p3c model with 4 passes (Fri Mar 27 10:29:27 UTC 2026) === +logs/ef6cddfc-76b9-4b54-9b3d-cf13e771c176.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=4 core_end=7 num_passes=2 stem=4 core=3 tail=4 +eval_only: loading checkpoint /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/final_model.pt +eval_only: overriding num_passes 2 -> 4 +eval_only: ResidualScale padded/trimmed 2 -> 4 +eval_only: running TTT with 4 passes +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=26927712 frozen=0 + ttt_chunk [1/1893] bpb=1.225438 time=1.9s + ttt_chunk [11/1893] bpb=1.126805 time=16.7s + ttt_chunk [21/1893] bpb=1.133151 time=31.3s + ttt_chunk [31/1893] bpb=1.136732 time=46.0s + ttt_chunk [41/1893] bpb=1.131236 time=60.7s + ttt_chunk [51/1893] bpb=1.131592 time=75.4s + ttt_chunk [61/1893] bpb=1.134411 time=90.1s + ttt_chunk [71/1893] bpb=1.131938 time=104.8s + ttt_chunk [81/1893] bpb=1.127734 time=119.5s + ttt_chunk [91/1893] bpb=1.126218 time=134.2s + ttt_chunk [101/1893] bpb=1.126657 time=148.9s + ttt_chunk [111/1893] bpb=1.126324 time=163.6s + ttt_chunk [121/1893] bpb=1.122359 time=178.3s + ttt_chunk [131/1893] bpb=1.121249 time=192.9s + ttt_chunk [141/1893] bpb=1.119837 time=207.6s + ttt_chunk [151/1893] bpb=1.119739 time=222.3s + ttt_chunk [161/1893] bpb=1.120343 time=237.0s + ttt_chunk [171/1893] bpb=1.122176 time=251.7s + ttt_chunk [181/1893] bpb=1.122053 time=266.4s + ttt_chunk [191/1893] bpb=1.124291 time=281.1s + ttt_chunk [201/1893] bpb=1.123696 time=295.8s + ttt_chunk [211/1893] bpb=1.122586 time=310.5s + ttt_chunk [221/1893] bpb=1.123345 time=325.2s + ttt_chunk [231/1893] bpb=1.122959 time=339.9s + ttt_chunk [241/1893] bpb=1.123089 time=354.6s + ttt_chunk [251/1893] bpb=1.122554 time=369.3s + ttt_chunk [261/1893] bpb=1.121768 time=384.0s + ttt_chunk [271/1893] bpb=1.120767 time=398.7s + ttt_chunk [281/1893] bpb=1.122262 time=413.4s + ttt_chunk [291/1893] bpb=1.121808 time=428.1s + ttt_chunk [301/1893] bpb=1.122611 time=442.8s + ttt_chunk [311/1893] bpb=1.122628 time=457.4s + ttt_chunk [321/1893] bpb=1.123312 time=472.1s + ttt_chunk [331/1893] bpb=1.122708 time=486.8s + ttt_chunk [341/1893] bpb=1.122283 time=501.5s + ttt_chunk [351/1893] bpb=1.122944 time=516.2s + ttt_chunk [361/1893] bpb=1.123664 time=530.9s + ttt_chunk [371/1893] bpb=1.123511 time=545.6s + ttt_chunk [381/1893] bpb=1.123245 time=560.3s + ttt_chunk [391/1893] bpb=1.123914 time=575.0s + ttt_chunk [401/1893] bpb=1.123421 time=589.7s + ttt_chunk [411/1893] bpb=1.122449 time=604.4s + ttt_chunk [421/1893] bpb=1.122552 time=619.1s + ttt_chunk [431/1893] bpb=1.122931 time=633.9s + ttt_chunk [441/1893] bpb=1.122301 time=648.6s + ttt_chunk [451/1893] bpb=1.122421 time=663.3s + ttt_chunk [461/1893] bpb=1.122281 time=678.0s + ttt_chunk [471/1893] bpb=1.121841 time=692.7s + ttt_chunk [481/1893] bpb=1.121629 time=707.4s + ttt_chunk [491/1893] bpb=1.121779 time=722.0s + ttt_chunk [501/1893] bpb=1.121517 time=736.7s + ttt_chunk [511/1893] bpb=1.121016 time=751.4s + ttt_chunk [521/1893] bpb=1.120679 time=766.1s + ttt_chunk [531/1893] bpb=1.121376 time=780.8s + ttt_chunk [541/1893] bpb=1.121495 time=795.5s + ttt_chunk [551/1893] bpb=1.120956 time=810.2s + ttt_chunk [561/1893] bpb=1.120809 time=824.9s + ttt_chunk [571/1893] bpb=1.120526 time=839.6s + ttt_chunk [581/1893] bpb=1.120159 time=854.3s + ttt_chunk [591/1893] bpb=1.119599 time=869.0s + ttt_chunk [601/1893] bpb=1.119592 time=883.7s + ttt_chunk [611/1893] bpb=1.119278 time=898.4s + ttt_chunk [621/1893] bpb=1.119108 time=913.1s + ttt_chunk [631/1893] bpb=1.118851 time=927.8s + ttt_chunk [641/1893] bpb=1.118397 time=942.5s + ttt_chunk [651/1893] bpb=1.117944 time=957.2s + ttt_chunk [661/1893] bpb=1.117832 time=971.9s + ttt_chunk [671/1893] bpb=1.117352 time=986.5s diff --git a/eval_ttt_2pass.log b/eval_ttt_2pass.log new file mode 100644 index 0000000000..ad9fb90c38 --- /dev/null +++ b/eval_ttt_2pass.log @@ -0,0 +1,48 @@ +=== TTT eval with 2 passes (Fri Mar 27 07:48:42 UTC 2026) === +logs/3d962823-8bd1-4a24-83f4-9b45908379dc.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +eval_only: loading checkpoint /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/final_model.pt +eval_only: overriding num_passes 4 -> 2 +eval_only: ResidualScale padded/trimmed 4 -> 2 +eval_only: running TTT with 2 passes +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=2 +ttt_sliding:params unfrozen=26923086 frozen=4112 + ttt_chunk [1/1893] bpb=1.375031 time=1.6s + ttt_chunk [11/1893] bpb=1.273915 time=13.6s + ttt_chunk [21/1893] bpb=1.263373 time=25.4s + ttt_chunk [31/1893] bpb=1.252818 time=37.3s + ttt_chunk [41/1893] bpb=1.237500 time=49.1s + ttt_chunk [51/1893] bpb=1.230685 time=61.0s + ttt_chunk [61/1893] bpb=1.227812 time=72.8s + ttt_chunk [71/1893] bpb=1.221164 time=84.7s + ttt_chunk [81/1893] bpb=1.213689 time=96.5s + ttt_chunk [91/1893] bpb=1.209732 time=108.4s + ttt_chunk [101/1893] bpb=1.208368 time=120.2s + ttt_chunk [111/1893] bpb=1.206418 time=132.1s + ttt_chunk [121/1893] bpb=1.200992 time=143.9s + ttt_chunk [131/1893] bpb=1.198638 time=155.8s + ttt_chunk [141/1893] bpb=1.195947 time=167.6s + ttt_chunk [151/1893] bpb=1.195006 time=179.5s + ttt_chunk [161/1893] bpb=1.194817 time=191.3s + ttt_chunk [171/1893] bpb=1.195854 time=203.2s + ttt_chunk [181/1893] bpb=1.195031 time=215.0s + ttt_chunk [191/1893] bpb=1.196735 time=226.9s + ttt_chunk [201/1893] bpb=1.195529 time=238.7s + ttt_chunk [211/1893] bpb=1.193885 time=250.6s + ttt_chunk [221/1893] bpb=1.194117 time=262.4s + ttt_chunk [231/1893] bpb=1.193253 time=274.3s + ttt_chunk [241/1893] bpb=1.192978 time=286.2s + ttt_chunk [251/1893] bpb=1.191952 time=298.0s + ttt_chunk [261/1893] bpb=1.190810 time=309.9s + ttt_chunk [271/1893] bpb=1.189461 time=321.7s + ttt_chunk [281/1893] bpb=1.190595 time=333.6s + ttt_chunk [291/1893] bpb=1.189724 time=345.4s + ttt_chunk [301/1893] bpb=1.190174 time=357.3s + ttt_chunk [311/1893] bpb=1.189856 time=369.1s + ttt_chunk [321/1893] bpb=1.190142 time=381.0s + ttt_chunk [331/1893] bpb=1.189247 time=392.8s + ttt_chunk [341/1893] bpb=1.188533 time=404.7s diff --git a/eval_ttt_6pass.log b/eval_ttt_6pass.log new file mode 100644 index 0000000000..caccd4a71d --- /dev/null +++ b/eval_ttt_6pass.log @@ -0,0 +1,43 @@ +=== TTT eval with 6 passes (Fri Mar 27 07:55:54 UTC 2026) === +logs/6763e472-c0ea-4251-bb8e-abcdc8e88af2.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +eval_only: loading checkpoint /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/final_model.pt +eval_only: overriding num_passes 4 -> 6 +eval_only: ResidualScale padded/trimmed 4 -> 6 +eval_only: running TTT with 6 passes +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=2 +ttt_sliding:params unfrozen=26923090 frozen=4112 + ttt_chunk [1/1893] bpb=1.237269 time=3.0s + ttt_chunk [11/1893] bpb=1.139461 time=28.2s + ttt_chunk [21/1893] bpb=1.144261 time=53.5s + ttt_chunk [31/1893] bpb=1.145955 time=78.7s + ttt_chunk [41/1893] bpb=1.140368 time=103.9s + ttt_chunk [51/1893] bpb=1.140828 time=129.2s + ttt_chunk [61/1893] bpb=1.143879 time=154.4s + ttt_chunk [71/1893] bpb=1.141455 time=179.6s + ttt_chunk [81/1893] bpb=1.137388 time=204.9s + ttt_chunk [91/1893] bpb=1.135963 time=230.1s + ttt_chunk [101/1893] bpb=1.136666 time=255.3s + ttt_chunk [111/1893] bpb=1.136601 time=280.6s + ttt_chunk [121/1893] bpb=1.132785 time=305.8s + ttt_chunk [131/1893] bpb=1.131875 time=331.0s + ttt_chunk [141/1893] bpb=1.130561 time=356.3s + ttt_chunk [151/1893] bpb=1.130627 time=381.5s + ttt_chunk [161/1893] bpb=1.131328 time=406.8s + ttt_chunk [171/1893] bpb=1.133288 time=432.0s + ttt_chunk [181/1893] bpb=1.133196 time=457.2s + ttt_chunk [191/1893] bpb=1.135561 time=482.5s + ttt_chunk [201/1893] bpb=1.135020 time=507.7s + ttt_chunk [211/1893] bpb=1.134037 time=533.0s + ttt_chunk [221/1893] bpb=1.134850 time=558.2s + ttt_chunk [231/1893] bpb=1.134504 time=583.5s + ttt_chunk [241/1893] bpb=1.134729 time=608.7s + ttt_chunk [251/1893] bpb=1.134203 time=633.9s + ttt_chunk [261/1893] bpb=1.133486 time=659.2s + ttt_chunk [271/1893] bpb=1.132534 time=684.4s + ttt_chunk [281/1893] bpb=1.134099 time=709.6s + ttt_chunk [291/1893] bpb=1.133685 time=734.9s diff --git a/full_2pass_3core.log b/full_2pass_3core.log new file mode 100644 index 0000000000..d880fa8afb --- /dev/null +++ b/full_2pass_3core.log @@ -0,0 +1,430 @@ +START full run: 2-pass 3-core (layers 4-6) TTT SWA, 80min (Fri Mar 27 08:09:55 UTC 2026) +logs/12c21058-a818-4f9e-b953-341912eb25a7.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=4 core_end=7 num_passes=2 stem=4 core=3 tail=4 +model_params:26927710 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run p8sqkbqa +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run 2pass_3core_80min +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/p8sqkbqa +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9310 grad_norm:0.3715 train_time:739ms step_avg:739.22ms +step:2/20000 train_loss:8.4759 grad_norm:3.5698 train_time:1441ms step_avg:720.50ms +step:3/20000 train_loss:7.5787 grad_norm:2.0259 train_time:2207ms step_avg:735.61ms +step:4/20000 train_loss:7.3563 grad_norm:1.4981 train_time:2972ms step_avg:742.93ms +step:5/20000 train_loss:7.1725 grad_norm:1.6464 train_time:3729ms step_avg:745.72ms +step:6/20000 train_loss:7.1055 grad_norm:1.5402 train_time:4489ms step_avg:748.20ms +step:7/20000 train_loss:7.0940 grad_norm:1.7366 train_time:5253ms step_avg:750.42ms +step:8/20000 train_loss:6.9891 grad_norm:1.2439 train_time:6012ms step_avg:751.51ms +step:9/20000 train_loss:6.6063 grad_norm:0.9271 train_time:6770ms step_avg:752.25ms +step:10/20000 train_loss:6.2335 grad_norm:0.8702 train_time:7539ms step_avg:753.87ms +step:50/20000 train_loss:3.7232 grad_norm:0.7000 train_time:38194ms step_avg:763.87ms +step:100/20000 train_loss:3.2045 grad_norm:0.8824 train_time:76601ms step_avg:766.01ms +step:150/20000 train_loss:2.8496 grad_norm:0.3729 train_time:114956ms step_avg:766.37ms +step:200/20000 train_loss:2.6184 grad_norm:0.3442 train_time:153346ms step_avg:766.73ms +step:250/20000 train_loss:2.6059 grad_norm:0.2754 train_time:191710ms step_avg:766.84ms +step:300/20000 train_loss:2.4690 grad_norm:0.3187 train_time:230112ms step_avg:767.04ms +step:350/20000 train_loss:2.5041 grad_norm:0.1730 train_time:268510ms step_avg:767.17ms +step:400/20000 train_loss:2.4246 grad_norm:0.2139 train_time:306906ms step_avg:767.26ms +step:450/20000 train_loss:2.2470 grad_norm:0.1624 train_time:345332ms step_avg:767.40ms +step:500/20000 train_loss:2.3033 grad_norm:0.1807 train_time:383744ms step_avg:767.49ms +step:500/20000 val_loss:2.3232 val_bpb:1.3759 train_time:383791ms step_avg:767.58ms +step:550/20000 train_loss:2.3638 grad_norm:0.1696 train_time:422169ms step_avg:767.58ms +step:600/20000 train_loss:2.2655 grad_norm:0.1474 train_time:460616ms step_avg:767.69ms +step:650/20000 train_loss:2.2434 grad_norm:0.1741 train_time:499080ms step_avg:767.82ms +step:700/20000 train_loss:2.3153 grad_norm:0.1303 train_time:537548ms step_avg:767.93ms +step:750/20000 train_loss:2.2883 grad_norm:0.0978 train_time:576022ms step_avg:768.03ms +step:800/20000 train_loss:2.2638 grad_norm:0.0930 train_time:614511ms step_avg:768.14ms +step:850/20000 train_loss:2.1916 grad_norm:0.0668 train_time:653031ms step_avg:768.27ms +step:900/20000 train_loss:2.1043 grad_norm:0.0706 train_time:691568ms step_avg:768.41ms +step:950/20000 train_loss:2.3072 grad_norm:0.0614 train_time:730098ms step_avg:768.52ms +step:1000/20000 train_loss:2.2373 grad_norm:0.0756 train_time:768630ms step_avg:768.63ms +step:1000/20000 val_loss:2.1827 val_bpb:1.2927 train_time:768677ms step_avg:768.68ms +step:1050/20000 train_loss:2.1577 grad_norm:0.0518 train_time:807163ms step_avg:768.73ms +step:1100/20000 train_loss:2.1862 grad_norm:0.1172 train_time:845684ms step_avg:768.80ms +step:1150/20000 train_loss:2.1385 grad_norm:0.0562 train_time:884226ms step_avg:768.89ms +step:1200/20000 train_loss:2.1870 grad_norm:0.1121 train_time:922748ms step_avg:768.96ms +step:1250/20000 train_loss:2.2093 grad_norm:0.0998 train_time:961291ms step_avg:769.03ms +step:1300/20000 train_loss:2.1788 grad_norm:0.0968 train_time:999829ms step_avg:769.10ms +step:1350/20000 train_loss:2.1574 grad_norm:0.1471 train_time:1038366ms step_avg:769.16ms +step:1400/20000 train_loss:2.1680 grad_norm:0.0608 train_time:1076900ms step_avg:769.21ms +step:1450/20000 train_loss:2.1623 grad_norm:0.0881 train_time:1115450ms step_avg:769.28ms +step:1500/20000 train_loss:2.1325 grad_norm:0.0715 train_time:1153976ms step_avg:769.32ms +step:1500/20000 val_loss:2.1196 val_bpb:1.2554 train_time:1154023ms step_avg:769.35ms +step:1550/20000 train_loss:2.1034 grad_norm:0.0555 train_time:1192498ms step_avg:769.35ms +step:1600/20000 train_loss:2.1841 grad_norm:0.0690 train_time:1231045ms step_avg:769.40ms +step:1650/20000 train_loss:1.9679 grad_norm:0.0623 train_time:1269584ms step_avg:769.44ms +step:1700/20000 train_loss:2.0979 grad_norm:0.1024 train_time:1308158ms step_avg:769.50ms +step:1750/20000 train_loss:2.0668 grad_norm:0.0916 train_time:1346716ms step_avg:769.55ms +step:1800/20000 train_loss:2.1070 grad_norm:0.0947 train_time:1385264ms step_avg:769.59ms +step:1850/20000 train_loss:2.1232 grad_norm:0.0632 train_time:1423808ms step_avg:769.63ms +step:1900/20000 train_loss:2.0734 grad_norm:0.0745 train_time:1462341ms step_avg:769.65ms +step:1950/20000 train_loss:2.0600 grad_norm:0.1123 train_time:1500897ms step_avg:769.69ms +step:2000/20000 train_loss:2.3179 grad_norm:0.0744 train_time:1539440ms step_avg:769.72ms +step:2000/20000 val_loss:2.0970 val_bpb:1.2420 train_time:1539487ms step_avg:769.74ms +step:2050/20000 train_loss:2.0834 grad_norm:0.0576 train_time:1577995ms step_avg:769.75ms +step:2100/20000 train_loss:2.0619 grad_norm:0.0535 train_time:1616543ms step_avg:769.78ms +step:2150/20000 train_loss:2.0442 grad_norm:0.0773 train_time:1655080ms step_avg:769.80ms +step:2200/20000 train_loss:2.2001 grad_norm:0.0771 train_time:1693618ms step_avg:769.83ms +step:2250/20000 train_loss:2.0929 grad_norm:0.0580 train_time:1732173ms step_avg:769.85ms +step:2300/20000 train_loss:2.0744 grad_norm:0.0679 train_time:1770710ms step_avg:769.87ms +step:2350/20000 train_loss:2.0331 grad_norm:0.1111 train_time:1809258ms step_avg:769.90ms +step:2400/20000 train_loss:2.1487 grad_norm:0.0711 train_time:1847814ms step_avg:769.92ms +step:2450/20000 train_loss:2.1108 grad_norm:0.0571 train_time:1886353ms step_avg:769.94ms +step:2500/20000 train_loss:2.0727 grad_norm:0.0554 train_time:1924887ms step_avg:769.95ms +step:2500/20000 val_loss:2.0778 val_bpb:1.2306 train_time:1924935ms step_avg:769.97ms +step:2550/20000 train_loss:2.0748 grad_norm:0.0686 train_time:1963451ms step_avg:769.98ms +step:2600/20000 train_loss:2.0557 grad_norm:0.0604 train_time:2001997ms step_avg:770.00ms +step:2650/20000 train_loss:2.0616 grad_norm:0.0686 train_time:2040534ms step_avg:770.01ms +step:2700/20000 train_loss:2.0896 grad_norm:0.1260 train_time:2079065ms step_avg:770.02ms +step:2750/20000 train_loss:2.0721 grad_norm:0.0683 train_time:2117609ms step_avg:770.04ms +step:2800/20000 train_loss:2.1120 grad_norm:0.0617 train_time:2156137ms step_avg:770.05ms +step:2850/20000 train_loss:2.0672 grad_norm:0.0618 train_time:2194684ms step_avg:770.06ms +step:2900/20000 train_loss:2.0771 grad_norm:0.0606 train_time:2233233ms step_avg:770.08ms +step:2950/20000 train_loss:2.1240 grad_norm:0.0555 train_time:2271768ms step_avg:770.09ms +step:3000/20000 train_loss:2.0159 grad_norm:0.1451 train_time:2310313ms step_avg:770.10ms +step:3000/20000 val_loss:2.0668 val_bpb:1.2241 train_time:2310360ms step_avg:770.12ms +step:3050/20000 train_loss:2.0182 grad_norm:0.0595 train_time:2348849ms step_avg:770.11ms +step:3100/20000 train_loss:2.0905 grad_norm:0.1292 train_time:2387390ms step_avg:770.13ms +step:3150/20000 train_loss:2.1075 grad_norm:0.0625 train_time:2425926ms step_avg:770.14ms +step:3200/20000 train_loss:2.0823 grad_norm:0.0562 train_time:2464469ms step_avg:770.15ms +step:3250/20000 train_loss:2.0517 grad_norm:0.0675 train_time:2502996ms step_avg:770.15ms +step:3300/20000 train_loss:2.0328 grad_norm:0.0879 train_time:2541545ms step_avg:770.17ms +step:3350/20000 train_loss:2.0720 grad_norm:0.0532 train_time:2580093ms step_avg:770.18ms +step:3400/20000 train_loss:2.1303 grad_norm:0.1521 train_time:2618607ms step_avg:770.18ms +step:3450/20000 train_loss:2.0795 grad_norm:0.0880 train_time:2657152ms step_avg:770.19ms +step:3500/20000 train_loss:2.0592 grad_norm:0.0568 train_time:2695692ms step_avg:770.20ms +step:3500/20000 val_loss:2.0566 val_bpb:1.2180 train_time:2695739ms step_avg:770.21ms +step:3550/20000 train_loss:2.0263 grad_norm:0.1106 train_time:2734239ms step_avg:770.21ms +step:3600/20000 train_loss:2.0242 grad_norm:0.0546 train_time:2772777ms step_avg:770.22ms +step:3650/20000 train_loss:2.0434 grad_norm:0.0792 train_time:2811292ms step_avg:770.22ms +step:3700/20000 train_loss:2.0376 grad_norm:0.0698 train_time:2849835ms step_avg:770.23ms +step:3750/20000 train_loss:2.0472 grad_norm:0.0748 train_time:2888385ms step_avg:770.24ms +step:3800/20000 train_loss:2.0429 grad_norm:0.0871 train_time:2926943ms step_avg:770.25ms +step:3850/20000 train_loss:2.0756 grad_norm:0.0799 train_time:2965493ms step_avg:770.26ms +step:3900/20000 train_loss:2.0692 grad_norm:0.0643 train_time:3004035ms step_avg:770.27ms +step:3950/20000 train_loss:2.0267 grad_norm:0.0667 train_time:3042573ms step_avg:770.27ms +step:4000/20000 train_loss:2.0478 grad_norm:0.1092 train_time:3081111ms step_avg:770.28ms +step:4000/20000 val_loss:2.0501 val_bpb:1.2142 train_time:3081158ms step_avg:770.29ms +step:4050/20000 train_loss:2.0524 grad_norm:0.0595 train_time:3119653ms step_avg:770.28ms +step:4100/20000 train_loss:1.9268 grad_norm:0.0611 train_time:3158208ms step_avg:770.29ms +step:4150/20000 train_loss:2.0567 grad_norm:0.0683 train_time:3196731ms step_avg:770.30ms +step:4200/20000 train_loss:2.0971 grad_norm:0.0701 train_time:3235281ms step_avg:770.30ms +step:4250/20000 train_loss:2.0465 grad_norm:0.0670 train_time:3273811ms step_avg:770.31ms +step:4300/20000 train_loss:2.0333 grad_norm:0.0543 train_time:3312336ms step_avg:770.31ms +step:4350/20000 train_loss:2.0258 grad_norm:0.0616 train_time:3350887ms step_avg:770.32ms +step:4400/20000 train_loss:2.0344 grad_norm:0.0614 train_time:3389423ms step_avg:770.32ms +step:4450/20000 train_loss:2.0645 grad_norm:0.0506 train_time:3427944ms step_avg:770.32ms +step:4500/20000 train_loss:2.0679 grad_norm:0.0550 train_time:3466466ms step_avg:770.33ms +step:4500/20000 val_loss:2.0466 val_bpb:1.2121 train_time:3466512ms step_avg:770.34ms +step:4550/20000 train_loss:2.0316 grad_norm:0.0607 train_time:3505002ms step_avg:770.33ms +step:4600/20000 train_loss:1.9490 grad_norm:0.0543 train_time:3543530ms step_avg:770.33ms +step:4650/20000 train_loss:2.0275 grad_norm:0.0590 train_time:3582052ms step_avg:770.33ms +step:4700/20000 train_loss:2.0597 grad_norm:0.1022 train_time:3620609ms step_avg:770.34ms +step:4750/20000 train_loss:2.0241 grad_norm:0.1000 train_time:3659144ms step_avg:770.35ms +step:4800/20000 train_loss:2.0349 grad_norm:0.0580 train_time:3697672ms step_avg:770.35ms +step:4850/20000 train_loss:2.0473 grad_norm:0.0532 train_time:3736199ms step_avg:770.35ms +step:4900/20000 train_loss:2.0297 grad_norm:0.0634 train_time:3774723ms step_avg:770.35ms +step:4950/20000 train_loss:1.9799 grad_norm:0.0504 train_time:3813238ms step_avg:770.35ms +step:5000/20000 train_loss:2.0735 grad_norm:0.0549 train_time:3851805ms step_avg:770.36ms +step:5000/20000 val_loss:2.0184 val_bpb:1.1954 train_time:3851852ms step_avg:770.37ms +step:5050/20000 train_loss:1.9940 grad_norm:0.0666 train_time:3890332ms step_avg:770.36ms +step:5100/20000 train_loss:1.9998 grad_norm:0.0478 train_time:3928885ms step_avg:770.37ms +step:5150/20000 train_loss:2.0985 grad_norm:0.0906 train_time:3967424ms step_avg:770.37ms +step:5200/20000 train_loss:2.0041 grad_norm:0.0450 train_time:4005964ms step_avg:770.38ms +step:5250/20000 train_loss:1.9757 grad_norm:0.0451 train_time:4044484ms step_avg:770.38ms +step:5300/20000 train_loss:1.9557 grad_norm:0.0544 train_time:4083005ms step_avg:770.38ms +step:5350/20000 train_loss:1.9972 grad_norm:0.0399 train_time:4121527ms step_avg:770.38ms +step:5400/20000 train_loss:2.0035 grad_norm:0.0433 train_time:4160051ms step_avg:770.38ms +step:5450/20000 train_loss:2.0130 grad_norm:0.0411 train_time:4198561ms step_avg:770.38ms +step:5500/20000 train_loss:2.0100 grad_norm:0.0376 train_time:4237081ms step_avg:770.38ms +step:5500/20000 val_loss:1.9818 val_bpb:1.1737 train_time:4237128ms step_avg:770.39ms +step:5550/20000 train_loss:1.9694 grad_norm:0.0464 train_time:4275608ms step_avg:770.38ms +step:5600/20000 train_loss:1.9396 grad_norm:0.0419 train_time:4314139ms step_avg:770.38ms +step:5650/20000 train_loss:2.0040 grad_norm:0.0377 train_time:4352662ms step_avg:770.38ms +step:5700/20000 train_loss:1.9579 grad_norm:0.0492 train_time:4391196ms step_avg:770.39ms +step:5750/20000 train_loss:1.9341 grad_norm:0.0370 train_time:4429712ms step_avg:770.38ms +step:5800/20000 train_loss:1.8494 grad_norm:0.0476 train_time:4468236ms step_avg:770.39ms +step:5850/20000 train_loss:1.8418 grad_norm:0.0404 train_time:4506763ms step_avg:770.39ms +swa:start step:5900 +step:5900/20000 train_loss:1.9171 grad_norm:0.0399 train_time:4545281ms step_avg:770.39ms +step:5950/20000 train_loss:1.9844 grad_norm:0.0432 train_time:4583947ms step_avg:770.41ms +late_qat:enabled step:5976 scale:0.1496 core_quant:on +step:6000/20000 train_loss:1.9423 grad_norm:0.0376 train_time:4656702ms step_avg:776.12ms +step:6000/20000 val_loss:1.9425 val_bpb:1.1505 train_time:4656808ms step_avg:776.13ms +step:6050/20000 train_loss:1.9037 grad_norm:0.0447 train_time:4695045ms step_avg:776.04ms +step:6100/20000 train_loss:1.9130 grad_norm:0.0341 train_time:4733411ms step_avg:775.97ms +step:6150/20000 train_loss:1.9282 grad_norm:0.0331 train_time:4771676ms step_avg:775.88ms +step:6188/20000 val_loss:1.9310 val_bpb:1.1437 train_time:4800792ms step_avg:775.82ms +stopping_early: wallclock_cap train_time:4800792ms step:6188/20000 +peak memory allocated: 28656 MiB reserved: 28704 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9264 val_bpb:1.1409 eval_time:18187ms +Serialized model: 106025719 bytes +Code size: 105268 bytes +Serialized model int6+lzma: 16459152 bytes +Total submission size int6+lzma: 16564420 bytes +final_int6_roundtrip val_loss:1.9355 val_bpb:1.1463 eval_time:36060ms +final_int6_roundtrip_exact val_loss:1.93549859 val_bpb:1.14631129 +final_int6_sliding_window val_loss:1.8956 val_bpb:1.1227 stride:64 eval_time:643423ms +final_int6_sliding_window_exact val_loss:1.89556561 val_bpb:1.12266370 +final_int8_zlib_roundtrip_exact val_loss:1.89556561 val_bpb:1.12266370 +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=2 +ttt_sliding:params unfrozen=26923598 frozen=4112 + ttt_chunk [1/1893] bpb=1.213507 time=1.2s + ttt_chunk [11/1893] bpb=1.115005 time=11.9s + ttt_chunk [21/1893] bpb=1.124115 time=22.5s + ttt_chunk [31/1893] bpb=1.129630 time=33.2s + ttt_chunk [41/1893] bpb=1.125661 time=43.9s + ttt_chunk [51/1893] bpb=1.127271 time=54.6s + ttt_chunk [61/1893] bpb=1.131076 time=65.2s + ttt_chunk [71/1893] bpb=1.129445 time=75.9s + ttt_chunk [81/1893] bpb=1.125956 time=86.6s + ttt_chunk [91/1893] bpb=1.125065 time=97.3s + ttt_chunk [101/1893] bpb=1.126017 time=107.9s + ttt_chunk [111/1893] bpb=1.126115 time=118.6s + ttt_chunk [121/1893] bpb=1.122503 time=129.3s + ttt_chunk [131/1893] bpb=1.121744 time=139.9s + ttt_chunk [141/1893] bpb=1.120618 time=150.6s + ttt_chunk [151/1893] bpb=1.120790 time=161.3s + ttt_chunk [161/1893] bpb=1.121623 time=172.0s + ttt_chunk [171/1893] bpb=1.123693 time=182.6s + ttt_chunk [181/1893] bpb=1.123772 time=193.3s + ttt_chunk [191/1893] bpb=1.126204 time=204.0s + ttt_chunk [201/1893] bpb=1.125769 time=214.6s + ttt_chunk [211/1893] bpb=1.124798 time=225.3s + ttt_chunk [221/1893] bpb=1.125694 time=236.0s + ttt_chunk [231/1893] bpb=1.125431 time=246.7s + ttt_chunk [241/1893] bpb=1.125671 time=257.3s + ttt_chunk [251/1893] bpb=1.125230 time=268.0s + ttt_chunk [261/1893] bpb=1.124541 time=278.7s + ttt_chunk [271/1893] bpb=1.123621 time=289.3s + ttt_chunk [281/1893] bpb=1.125199 time=300.0s + ttt_chunk [291/1893] bpb=1.124816 time=310.7s + ttt_chunk [301/1893] bpb=1.125690 time=321.3s + ttt_chunk [311/1893] bpb=1.125780 time=332.0s + ttt_chunk [321/1893] bpb=1.126532 time=342.7s + ttt_chunk [331/1893] bpb=1.125992 time=353.4s + ttt_chunk [341/1893] bpb=1.125619 time=364.0s + ttt_chunk [351/1893] bpb=1.126331 time=374.7s + ttt_chunk [361/1893] bpb=1.127106 time=385.4s + ttt_chunk [371/1893] bpb=1.126994 time=396.0s + ttt_chunk [381/1893] bpb=1.126771 time=406.7s + ttt_chunk [391/1893] bpb=1.127473 time=417.4s + ttt_chunk [401/1893] bpb=1.127019 time=428.0s + ttt_chunk [411/1893] bpb=1.126072 time=438.7s + ttt_chunk [421/1893] bpb=1.126211 time=449.4s + ttt_chunk [431/1893] bpb=1.126627 time=460.1s + ttt_chunk [441/1893] bpb=1.126026 time=470.7s + ttt_chunk [451/1893] bpb=1.126178 time=481.4s + ttt_chunk [461/1893] bpb=1.126064 time=492.1s + ttt_chunk [471/1893] bpb=1.125649 time=502.7s + ttt_chunk [481/1893] bpb=1.125465 time=513.4s + ttt_chunk [491/1893] bpb=1.125641 time=524.1s + ttt_chunk [501/1893] bpb=1.125403 time=534.8s + ttt_chunk [511/1893] bpb=1.124922 time=545.4s + ttt_chunk [521/1893] bpb=1.124606 time=556.1s + ttt_chunk [531/1893] bpb=1.125329 time=566.8s + ttt_chunk [541/1893] bpb=1.125457 time=577.5s + ttt_chunk [551/1893] bpb=1.124939 time=588.1s + ttt_chunk [561/1893] bpb=1.124808 time=598.8s + ttt_chunk [571/1893] bpb=1.124539 time=609.5s + ttt_chunk [581/1893] bpb=1.124189 time=620.1s + ttt_chunk [591/1893] bpb=1.123646 time=630.8s + ttt_chunk [601/1893] bpb=1.123663 time=641.5s + ttt_chunk [611/1893] bpb=1.123360 time=652.2s + ttt_chunk [621/1893] bpb=1.123207 time=662.8s + ttt_chunk [631/1893] bpb=1.122962 time=673.5s + ttt_chunk [641/1893] bpb=1.122517 time=684.2s + ttt_chunk [651/1893] bpb=1.122074 time=694.8s + ttt_chunk [661/1893] bpb=1.121970 time=705.5s + ttt_chunk [671/1893] bpb=1.121500 time=716.2s + ttt_chunk [681/1893] bpb=1.120954 time=726.9s + ttt_chunk [691/1893] bpb=1.121049 time=737.5s + ttt_chunk [701/1893] bpb=1.120237 time=748.2s + ttt_chunk [711/1893] bpb=1.120250 time=758.9s + ttt_chunk [721/1893] bpb=1.120159 time=769.5s + ttt_chunk [731/1893] bpb=1.120390 time=780.2s + ttt_chunk [741/1893] bpb=1.120277 time=790.9s + ttt_chunk [751/1893] bpb=1.119980 time=801.6s + ttt_chunk [761/1893] bpb=1.120120 time=812.2s + ttt_chunk [771/1893] bpb=1.119952 time=822.9s + ttt_chunk [781/1893] bpb=1.120123 time=833.6s + ttt_chunk [791/1893] bpb=1.119975 time=844.2s + ttt_chunk [801/1893] bpb=1.119907 time=854.9s + ttt_chunk [811/1893] bpb=1.119919 time=865.6s + ttt_chunk [821/1893] bpb=1.119806 time=876.3s + ttt_chunk [831/1893] bpb=1.119523 time=886.9s + ttt_chunk [841/1893] bpb=1.119272 time=897.6s + ttt_chunk [851/1893] bpb=1.119326 time=908.3s + ttt_chunk [861/1893] bpb=1.119397 time=919.0s + ttt_chunk [871/1893] bpb=1.119597 time=929.6s + ttt_chunk [881/1893] bpb=1.119595 time=940.3s + ttt_chunk [891/1893] bpb=1.119057 time=951.0s + ttt_chunk [901/1893] bpb=1.119066 time=961.6s + ttt_chunk [911/1893] bpb=1.118915 time=972.3s + ttt_chunk [921/1893] bpb=1.119057 time=983.0s + ttt_chunk [931/1893] bpb=1.119000 time=993.7s + ttt_chunk [941/1893] bpb=1.119211 time=1004.3s + ttt_chunk [951/1893] bpb=1.119510 time=1015.0s + ttt_chunk [961/1893] bpb=1.119814 time=1025.7s + ttt_chunk [971/1893] bpb=1.120172 time=1036.4s + ttt_chunk [981/1893] bpb=1.120386 time=1047.0s + ttt_chunk [991/1893] bpb=1.120296 time=1057.7s + ttt_chunk [1001/1893] bpb=1.120622 time=1068.4s + ttt_chunk [1011/1893] bpb=1.120769 time=1079.0s + ttt_chunk [1021/1893] bpb=1.121058 time=1089.7s + ttt_chunk [1031/1893] bpb=1.121451 time=1100.4s + ttt_chunk [1041/1893] bpb=1.121962 time=1111.0s + ttt_chunk [1051/1893] bpb=1.121834 time=1121.7s + ttt_chunk [1061/1893] bpb=1.121931 time=1132.4s + ttt_chunk [1071/1893] bpb=1.122074 time=1143.1s + ttt_chunk [1081/1893] bpb=1.122119 time=1153.7s + ttt_chunk [1091/1893] bpb=1.122380 time=1164.4s + ttt_chunk [1101/1893] bpb=1.122531 time=1175.1s + ttt_chunk [1111/1893] bpb=1.122350 time=1185.7s + ttt_chunk [1121/1893] bpb=1.122124 time=1196.4s + ttt_chunk [1131/1893] bpb=1.122020 time=1207.1s + ttt_chunk [1141/1893] bpb=1.121778 time=1217.7s + ttt_chunk [1151/1893] bpb=1.121805 time=1228.4s + ttt_chunk [1161/1893] bpb=1.121646 time=1239.1s + ttt_chunk [1171/1893] bpb=1.121470 time=1249.7s + ttt_chunk [1181/1893] bpb=1.121248 time=1260.4s + ttt_chunk [1191/1893] bpb=1.121404 time=1271.1s + ttt_chunk [1201/1893] bpb=1.121610 time=1281.8s + ttt_chunk [1211/1893] bpb=1.121209 time=1292.4s + ttt_chunk [1221/1893] bpb=1.121550 time=1303.1s + ttt_chunk [1231/1893] bpb=1.121493 time=1313.8s + ttt_chunk [1241/1893] bpb=1.121200 time=1324.4s + ttt_chunk [1251/1893] bpb=1.120654 time=1335.1s + ttt_chunk [1261/1893] bpb=1.120400 time=1345.8s + ttt_chunk [1271/1893] bpb=1.120154 time=1356.4s + ttt_chunk [1281/1893] bpb=1.119845 time=1367.1s + ttt_chunk [1291/1893] bpb=1.119603 time=1377.8s + ttt_chunk [1301/1893] bpb=1.119559 time=1388.4s + ttt_chunk [1311/1893] bpb=1.119294 time=1399.1s + ttt_chunk [1321/1893] bpb=1.119006 time=1409.8s + ttt_chunk [1331/1893] bpb=1.118778 time=1420.5s + ttt_chunk [1341/1893] bpb=1.118650 time=1431.1s + ttt_chunk [1351/1893] bpb=1.118499 time=1441.8s + ttt_chunk [1361/1893] bpb=1.118620 time=1452.5s + ttt_chunk [1371/1893] bpb=1.118833 time=1463.1s + ttt_chunk [1381/1893] bpb=1.119039 time=1473.8s + ttt_chunk [1391/1893] bpb=1.118831 time=1484.5s + ttt_chunk [1401/1893] bpb=1.118873 time=1495.1s + ttt_chunk [1411/1893] bpb=1.118989 time=1505.8s + ttt_chunk [1421/1893] bpb=1.118980 time=1516.5s + ttt_chunk [1431/1893] bpb=1.118955 time=1527.1s + ttt_chunk [1441/1893] bpb=1.119428 time=1537.8s + ttt_chunk [1451/1893] bpb=1.119298 time=1548.5s + ttt_chunk [1461/1893] bpb=1.119224 time=1559.2s + ttt_chunk [1471/1893] bpb=1.119815 time=1569.8s + ttt_chunk [1481/1893] bpb=1.119692 time=1580.5s + ttt_chunk [1491/1893] bpb=1.120064 time=1591.2s + ttt_chunk [1501/1893] bpb=1.120046 time=1601.9s + ttt_chunk [1511/1893] bpb=1.119995 time=1612.5s + ttt_chunk [1521/1893] bpb=1.120109 time=1623.2s + ttt_chunk [1531/1893] bpb=1.120314 time=1633.9s + ttt_chunk [1541/1893] bpb=1.120389 time=1644.6s + ttt_chunk [1551/1893] bpb=1.120624 time=1655.2s + ttt_chunk [1561/1893] bpb=1.120711 time=1665.9s + ttt_chunk [1571/1893] bpb=1.120856 time=1676.6s + ttt_chunk [1581/1893] bpb=1.121011 time=1687.2s + ttt_chunk [1591/1893] bpb=1.121070 time=1697.9s + ttt_chunk [1601/1893] bpb=1.121194 time=1708.6s + ttt_chunk [1611/1893] bpb=1.121454 time=1719.3s + ttt_chunk [1621/1893] bpb=1.121318 time=1729.9s + ttt_chunk [1631/1893] bpb=1.121361 time=1740.6s + ttt_chunk [1641/1893] bpb=1.121385 time=1751.3s + ttt_chunk [1651/1893] bpb=1.121438 time=1762.0s + ttt_chunk [1661/1893] bpb=1.121577 time=1772.6s + ttt_chunk [1671/1893] bpb=1.121754 time=1783.3s + ttt_chunk [1681/1893] bpb=1.121845 time=1794.0s + ttt_chunk [1691/1893] bpb=1.121951 time=1804.6s + ttt_chunk [1701/1893] bpb=1.122049 time=1815.3s + ttt_chunk [1711/1893] bpb=1.122031 time=1826.0s + ttt_chunk [1721/1893] bpb=1.121864 time=1836.7s + ttt_chunk [1731/1893] bpb=1.121961 time=1847.3s + ttt_chunk [1741/1893] bpb=1.121701 time=1858.0s + ttt_chunk [1751/1893] bpb=1.121579 time=1868.7s + ttt_chunk [1761/1893] bpb=1.121622 time=1879.3s + ttt_chunk [1771/1893] bpb=1.121568 time=1890.0s + ttt_chunk [1781/1893] bpb=1.121464 time=1900.7s + ttt_chunk [1791/1893] bpb=1.121129 time=1911.3s + ttt_chunk [1801/1893] bpb=1.121118 time=1922.0s + ttt_chunk [1811/1893] bpb=1.120975 time=1932.7s + ttt_chunk [1821/1893] bpb=1.121035 time=1943.3s + ttt_chunk [1831/1893] bpb=1.120887 time=1954.0s + ttt_chunk [1841/1893] bpb=1.120931 time=1964.7s + ttt_chunk [1851/1893] bpb=1.120761 time=1975.4s + ttt_chunk [1861/1893] bpb=1.120682 time=1986.0s + ttt_chunk [1871/1893] bpb=1.120616 time=1996.7s + ttt_chunk [1881/1893] bpb=1.120376 time=2007.4s + ttt_chunk [1891/1893] bpb=1.120360 time=2018.0s + ttt_chunk [1893/1893] bpb=1.120391 time=2019.8s +ttt_sliding:done val_loss=1.891728 val_bpb=1.120391 elapsed=2019.8s +legal_ttt val_loss:1.8917 val_bpb:1.1204 eval_time:2020286ms +legal_ttt_exact val_loss:1.89172798 val_bpb:1.12039083 +wandb: updating run metadata +wandb: uploading output.log; uploading wandb-summary.json; uploading config.yaml +wandb: uploading output.log; uploading wandb-summary.json +wandb: uploading history steps 134-134, summary, console lines 374-378 +wandb: +wandb: Run history: +wandb: grad_norm █▆▅▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: lr_scale ██████████████████████████████▇▆▆▅▅▄▄▃▃▁ +wandb: step_avg_ms ▁▃▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇█ +wandb: train_loss ████▃▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: val_bpb █▂▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: val_loss █▂▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: +wandb: Run summary: +wandb: grad_norm 0.03315 +wandb: lr_scale 0.02205 +wandb: step_avg_ms 775.88232 +wandb: train_loss 1.92824 +wandb: val_bpb 1.14365 +wandb: val_loss 1.93101 +wandb: +wandb: 🚀 View run 2pass_3core_80min at: https://wandb.ai/propensity/parameter-golf/runs/p8sqkbqa +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260327_080959-p8sqkbqa/logs diff --git a/full_4pass.log b/full_4pass.log new file mode 100644 index 0000000000..c23be1dbfb --- /dev/null +++ b/full_4pass.log @@ -0,0 +1,145 @@ +logs/f702ea4d-79e5-47e5-811d-eed8d0b7c25a.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +model_params:26927200 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run nwftkz5m +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run full_4pass_noRMS_j0.1_QAT_80min +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/nwftkz5m +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9310 train_time:1286ms step_avg:1285.72ms +step:2/20000 train_loss:8.4243 train_time:2520ms step_avg:1259.75ms +step:3/20000 train_loss:7.5899 train_time:3791ms step_avg:1263.65ms +step:4/20000 train_loss:7.3604 train_time:5065ms step_avg:1266.15ms +step:5/20000 train_loss:7.2017 train_time:6337ms step_avg:1267.44ms +step:6/20000 train_loss:7.1139 train_time:7608ms step_avg:1267.94ms +step:7/20000 train_loss:7.0266 train_time:8884ms step_avg:1269.15ms +step:8/20000 train_loss:6.8703 train_time:10155ms step_avg:1269.33ms +step:9/20000 train_loss:6.5277 train_time:11432ms step_avg:1270.24ms +step:10/20000 train_loss:6.1364 train_time:12711ms step_avg:1271.13ms +step:50/20000 train_loss:3.7012 train_time:66255ms step_avg:1325.09ms +step:100/20000 train_loss:3.1707 train_time:133824ms step_avg:1338.24ms +step:150/20000 train_loss:2.8377 train_time:201511ms step_avg:1343.41ms +step:200/20000 train_loss:2.6154 train_time:269226ms step_avg:1346.13ms +step:250/20000 train_loss:2.6168 train_time:336977ms step_avg:1347.91ms +step:300/20000 train_loss:2.4755 train_time:404746ms step_avg:1349.15ms +step:350/20000 train_loss:2.5226 train_time:472525ms step_avg:1350.07ms +step:400/20000 train_loss:2.4300 train_time:540349ms step_avg:1350.87ms +step:450/20000 train_loss:2.2536 train_time:608178ms step_avg:1351.51ms +step:500/20000 train_loss:2.3113 train_time:676567ms step_avg:1353.13ms +step:500/20000 val_loss:2.3274 val_bpb:1.3784 train_time:676611ms step_avg:1353.22ms +step:550/20000 train_loss:2.3695 train_time:744686ms step_avg:1353.98ms +step:600/20000 train_loss:2.2671 train_time:812797ms step_avg:1354.66ms +step:650/20000 train_loss:2.2373 train_time:880946ms step_avg:1355.30ms +step:700/20000 train_loss:2.3081 train_time:949104ms step_avg:1355.86ms +step:750/20000 train_loss:2.2737 train_time:1017254ms step_avg:1356.34ms +step:800/20000 train_loss:2.2530 train_time:1085409ms step_avg:1356.76ms +step:850/20000 train_loss:2.1855 train_time:1153602ms step_avg:1357.18ms +step:900/20000 train_loss:2.1047 train_time:1221810ms step_avg:1357.57ms +step:950/20000 train_loss:2.3058 train_time:1290011ms step_avg:1357.91ms +step:1000/20000 train_loss:2.2370 train_time:1358693ms step_avg:1358.69ms +step:1000/20000 val_loss:2.1807 val_bpb:1.2916 train_time:1358737ms step_avg:1358.74ms +step:1050/20000 train_loss:2.1633 train_time:1426912ms step_avg:1358.96ms +step:1100/20000 train_loss:2.1901 train_time:1495141ms step_avg:1359.22ms +step:1150/20000 train_loss:2.1451 train_time:1563371ms step_avg:1359.45ms +step:1200/20000 train_loss:2.1925 train_time:1631660ms step_avg:1359.72ms +step:1250/20000 train_loss:2.2160 train_time:1699949ms step_avg:1359.96ms +step:1300/20000 train_loss:2.1869 train_time:1768261ms step_avg:1360.20ms +step:1350/20000 train_loss:2.1586 train_time:1836588ms step_avg:1360.44ms +step:1400/20000 train_loss:2.1726 train_time:1904911ms step_avg:1360.65ms +step:1450/20000 train_loss:2.1689 train_time:1973248ms step_avg:1360.86ms +step:1500/20000 train_loss:2.1391 train_time:2042104ms step_avg:1361.40ms +step:1500/20000 val_loss:2.1246 val_bpb:1.2583 train_time:2042149ms step_avg:1361.43ms +step:1550/20000 train_loss:2.1090 train_time:2110481ms step_avg:1361.60ms +step:1600/20000 train_loss:2.1871 train_time:2178869ms step_avg:1361.79ms +step:1650/20000 train_loss:1.9698 train_time:2247283ms step_avg:1361.99ms +step:1700/20000 train_loss:2.0933 train_time:2315736ms step_avg:1362.20ms +step:1750/20000 train_loss:2.0614 train_time:2384155ms step_avg:1362.37ms +step:1800/20000 train_loss:2.0974 train_time:2452595ms step_avg:1362.55ms +step:1850/20000 train_loss:2.1094 train_time:2521066ms step_avg:1362.74ms +step:1900/20000 train_loss:2.0530 train_time:2589507ms step_avg:1362.90ms +step:1950/20000 train_loss:2.0371 train_time:2657961ms step_avg:1363.06ms +step:2000/20000 train_loss:2.2915 train_time:2726880ms step_avg:1363.44ms +step:2000/20000 val_loss:2.0686 val_bpb:1.2252 train_time:2726924ms step_avg:1363.46ms +step:2050/20000 train_loss:2.0580 train_time:2795382ms step_avg:1363.60ms +step:2100/20000 train_loss:2.0309 train_time:2863872ms step_avg:1363.75ms +step:2150/20000 train_loss:2.0080 train_time:2932366ms step_avg:1363.89ms +step:2200/20000 train_loss:2.1611 train_time:3000869ms step_avg:1364.03ms +step:2250/20000 train_loss:2.0531 train_time:3069359ms step_avg:1364.16ms +step:2300/20000 train_loss:2.0314 train_time:3137845ms step_avg:1364.28ms +step:2350/20000 train_loss:1.9839 train_time:3206343ms step_avg:1364.40ms +step:2400/20000 train_loss:2.0978 train_time:3274856ms step_avg:1364.52ms +step:2450/20000 train_loss:2.0583 train_time:3343349ms step_avg:1364.63ms +step:2500/20000 train_loss:2.0143 train_time:3412281ms step_avg:1364.91ms +step:2500/20000 val_loss:2.0210 val_bpb:1.1969 train_time:3412325ms step_avg:1364.93ms +step:2550/20000 train_loss:2.0163 train_time:3480766ms step_avg:1365.01ms +step:2600/20000 train_loss:1.9947 train_time:3549233ms step_avg:1365.09ms +step:2650/20000 train_loss:1.9997 train_time:3617731ms step_avg:1365.18ms +step:2700/20000 train_loss:2.0195 train_time:3686191ms step_avg:1365.26ms +step:2750/20000 train_loss:2.0010 train_time:3754675ms step_avg:1365.34ms +step:2800/20000 train_loss:2.0359 train_time:3823161ms step_avg:1365.41ms +swa:start step:2850 +step:2850/20000 train_loss:1.9860 train_time:3891626ms step_avg:1365.48ms +step:2900/20000 train_loss:2.0033 train_time:3960176ms step_avg:1365.58ms +step:2950/20000 train_loss:2.0417 train_time:4028712ms step_avg:1365.67ms +late_qat:enabled step:2990 scale:0.1498 +step:3000/20000 train_loss:1.9297 train_time:4097686ms step_avg:1365.90ms +step:3000/20000 val_loss:1.9846 val_bpb:1.1754 train_time:4097782ms step_avg:1365.93ms +step:3050/20000 train_loss:1.9368 train_time:4166118ms step_avg:1365.94ms +step:3100/20000 train_loss:2.0003 train_time:4234401ms step_avg:1365.94ms +step:3150/20000 train_loss:2.0099 train_time:4302671ms step_avg:1365.93ms +step:3200/20000 train_loss:1.9846 train_time:4370945ms step_avg:1365.92ms +step:3250/20000 train_loss:1.9515 train_time:4439218ms step_avg:1365.91ms +step:3300/20000 train_loss:1.9330 train_time:4507468ms step_avg:1365.90ms +step:3450/20000 train_loss:1.9670 train_time:4712315ms step_avg:1365.89ms +step:3500/20000 train_loss:1.9497 train_time:4781271ms step_avg:1366.08ms +step:3500/20000 val_loss:1.9558 val_bpb:1.1583 train_time:4781368ms step_avg:1366.11ms +step:3514/20000 val_loss:1.9557 val_bpb:1.1583 train_time:4800532ms step_avg:1366.12ms +stopping_early: wallclock_cap train_time:4800532ms step:3514/20000 +peak memory allocated: 50545 MiB reserved: 50594 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9532 val_bpb:1.1568 eval_time:32961ms +Serialized model: 106023671 bytes +Code size: 99082 bytes +Serialized model int6+lzma: 14754232 bytes +Total submission size int6+lzma: 14853314 bytes +final_int6_roundtrip val_loss:1.9685 val_bpb:1.1659 eval_time:63637ms +final_int6_roundtrip_exact val_loss:1.96850574 val_bpb:1.16585998 diff --git a/full_4pass_stdout.log b/full_4pass_stdout.log new file mode 100644 index 0000000000000000000000000000000000000000..eb5842dba835db640d1191bbcc9a7b512b535268 GIT binary patch literal 2445 zcmd5;+iu%N5bZO+Vh{vHf=24z$z31;94B#v#&!^?4}Bt5(pqNCTS>~c^7Wn7l@#ed zmoK(Q~291-jgwX2NB7V-2b`y2n$=n zo7w#uu5ATO0p*5EBL)09y97p>*>}!N&Eyfi%e-C~1I&VWc(uk))|xLhpI~ z{jb^A+Z$*qo8Y;5W?XJyG`e$cIk$_4QC2J+EEBsba+|g{Dw{t5!_pL+XJ##YUiuwjuTd!G1TKVmx8k?Qh{G*v#lvxr4>A=$K;CyMbslRX-5TPq0IhATU;92Z<_0DJprECVea? zPKQE8aS7?9NiZ%1>6cR*X^4|Y!uGJU|F*;h$DR@nAf2K^%MHr!W$Pe?M2o&zG)lKm zBpc#{I!W3wun=6cA)U5n@vVSkCI>m;Mu(Fmp-=;g`;z$OfB0;Dz{LwXsjwRz`is-P z;*mZoE?>||oLMJ~DC<>$8k3{#`3b3Q9Vw%>b!5~nQ-nnioE5wp(^DcP75hYXVZ;A| z#ilFyxg!yFS7}9ACiyCMwn{e!o@|;f(qi!t`_bC2*A5C!*#ncHboK!9 z8nONI`HO(lSN)d$M7BT$wmOEN-yzJ321k9po@Ys2Cq=$R>z + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward + def forward(self, input_ids: Tensor, target_ids: Tensor, + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward + return compiled_fn(full_args) + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper + all_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g + return f(*args) + ^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply + return super().apply(*args, **kwargs) # type: ignore[misc] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward + fw_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper + return compiled_fn(runtime_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn + outs = compiled_fn(args) + ^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 9077, in call + buf776 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 24.69 MiB is free. Process 448538 has 754.00 MiB memory in use. Process 448539 has 756.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Including non-PyTorch memory, this process has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 38.01 GiB is allocated by PyTorch, and 15.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward + def forward(self, input_ids: Tensor, target_ids: Tensor, + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward + return compiled_fn(full_args) + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper + all_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g + return f(*args) + ^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply + return super().apply(*args, **kwargs) # type: ignore[misc] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward + fw_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper + return compiled_fn(runtime_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn + outs = compiled_fn(args) + ^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 7110, in call + buf5 = empty_strided_cuda((98304, 512), (512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 20.69 MiB is free. Including non-PyTorch memory, this process has 758.00 MiB memory in use. Process 448539 has 756.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 127.50 MiB is allocated by PyTorch, and 18.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward + def forward(self, input_ids: Tensor, target_ids: Tensor, + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward + return compiled_fn(full_args) + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper + all_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g + return f(*args) + ^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply + return super().apply(*args, **kwargs) # type: ignore[misc] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward + fw_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper + return compiled_fn(runtime_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn + outs = compiled_fn(args) + ^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 7110, in call + buf5 = empty_strided_cuda((98304, 512), (512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 18.69 MiB is free. Process 448538 has 758.00 MiB memory in use. Including non-PyTorch memory, this process has 758.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 127.50 MiB is allocated by PyTorch, and 18.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main + (warmup_loss * grad_scale).backward() + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward + torch.autograd.backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward + _engine_run_backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward + return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply + return user_fn(self, *args) + ^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward + return impl_fn() + ^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn + out = CompiledFunction._backward_impl(ctx, all_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl + out = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call + buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 18.69 MiB is free. Process 448538 has 758.00 MiB memory in use. Process 448539 has 758.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/jkh80zal +wandb: Find logs at: wandb/run-20260326_230158-jkh80zal/logs +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/fsi4c82a +wandb: Find logs at: wandb/run-20260326_230158-fsi4c82a/logs +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/43bipylb +wandb: Find logs at: wandb/run-20260326_230158-43bipylb/logs +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/zcabiozu +wandb: Find logs at: wandb/run-20260326_230158-zcabiozu/logs +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9310 grad_norm:0.3717 train_time:1291ms step_avg:1291.09ms +step:2/20000 train_loss:8.3536 grad_norm:3.5393 train_time:2598ms step_avg:1298.81ms +step:3/20000 train_loss:7.5089 grad_norm:1.8069 train_time:3954ms step_avg:1318.13ms +step:4/20000 train_loss:7.5822 grad_norm:1.8725 train_time:5317ms step_avg:1329.29ms +step:5/20000 train_loss:7.3524 grad_norm:1.8843 train_time:6673ms step_avg:1334.64ms +step:6/20000 train_loss:7.0868 grad_norm:1.7131 train_time:8028ms step_avg:1338.00ms +step:7/20000 train_loss:6.9401 grad_norm:2.0897 train_time:9384ms step_avg:1340.63ms +step:8/20000 train_loss:6.8952 grad_norm:1.4534 train_time:10745ms step_avg:1343.15ms +step:9/20000 train_loss:6.5431 grad_norm:1.0222 train_time:12102ms step_avg:1344.70ms +step:10/20000 train_loss:6.1427 grad_norm:0.9715 train_time:13466ms step_avg:1346.55ms +step:50/20000 train_loss:3.6903 grad_norm:0.9422 train_time:68054ms step_avg:1361.07ms +step:100/20000 train_loss:3.1184 grad_norm:0.5410 train_time:136293ms step_avg:1362.93ms +step:150/20000 train_loss:2.7752 grad_norm:0.3613 train_time:205070ms step_avg:1367.13ms +step:200/20000 train_loss:2.5614 grad_norm:0.2693 train_time:273305ms step_avg:1366.53ms +step:250/20000 train_loss:2.5709 grad_norm:0.2522 train_time:341556ms step_avg:1366.22ms +step:300/20000 train_loss:2.4364 grad_norm:0.2295 train_time:409825ms step_avg:1366.08ms +step:350/20000 train_loss:2.4859 grad_norm:0.2104 train_time:478072ms step_avg:1365.92ms +step:400/20000 train_loss:2.3988 grad_norm:0.1555 train_time:546341ms step_avg:1365.85ms +step:450/20000 train_loss:2.2317 grad_norm:0.1958 train_time:614614ms step_avg:1365.81ms +step:500/20000 train_loss:2.2898 grad_norm:0.1775 train_time:682900ms step_avg:1365.80ms +step:500/20000 val_loss:2.3130 val_bpb:1.3699 train_time:682945ms step_avg:1365.89ms +step:550/20000 train_loss:2.3492 grad_norm:0.1559 train_time:751209ms step_avg:1365.83ms +step:600/20000 train_loss:2.2513 grad_norm:0.1438 train_time:819544ms step_avg:1365.91ms +step:650/20000 train_loss:2.2323 grad_norm:0.1536 train_time:888368ms step_avg:1366.72ms +step:700/20000 train_loss:2.3026 grad_norm:0.1020 train_time:956783ms step_avg:1366.83ms +step:750/20000 train_loss:2.2750 grad_norm:0.1105 train_time:1025183ms step_avg:1366.91ms +step:800/20000 train_loss:2.2546 grad_norm:0.1031 train_time:1093599ms step_avg:1367.00ms +step:850/20000 train_loss:2.1799 grad_norm:0.0737 train_time:1162084ms step_avg:1367.16ms +step:900/20000 train_loss:2.0960 grad_norm:0.0817 train_time:1230597ms step_avg:1367.33ms +step:950/20000 train_loss:2.2968 grad_norm:0.0953 train_time:1299094ms step_avg:1367.47ms +step:1000/20000 train_loss:2.2247 grad_norm:0.0713 train_time:1367589ms step_avg:1367.59ms +step:1000/20000 val_loss:2.1722 val_bpb:1.2865 train_time:1367633ms step_avg:1367.63ms +step:1050/20000 train_loss:2.1500 grad_norm:0.1469 train_time:1436112ms step_avg:1367.73ms +step:1100/20000 train_loss:2.1744 grad_norm:0.0794 train_time:1504991ms step_avg:1368.17ms +step:1150/20000 train_loss:2.1290 grad_norm:0.0672 train_time:1573762ms step_avg:1368.49ms +step:1200/20000 train_loss:2.1756 grad_norm:0.0636 train_time:1642514ms step_avg:1368.76ms +step:1250/20000 train_loss:2.1991 grad_norm:0.0599 train_time:1711283ms step_avg:1369.03ms +step:1300/20000 train_loss:2.1695 grad_norm:0.1132 train_time:1780070ms step_avg:1369.28ms +step:1350/20000 train_loss:2.1436 grad_norm:0.1200 train_time:1848866ms step_avg:1369.53ms +step:1400/20000 train_loss:2.1553 grad_norm:0.0700 train_time:1917654ms step_avg:1369.75ms +step:1450/20000 train_loss:2.1501 grad_norm:0.0631 train_time:1986442ms step_avg:1369.96ms +step:1500/20000 train_loss:2.1193 grad_norm:0.0733 train_time:2055220ms step_avg:1370.15ms +step:1500/20000 val_loss:2.1071 val_bpb:1.2479 train_time:2055264ms step_avg:1370.18ms +step:1550/20000 train_loss:2.0928 grad_norm:0.0758 train_time:2124013ms step_avg:1370.33ms +step:1600/20000 train_loss:2.1722 grad_norm:0.0814 train_time:2193129ms step_avg:1370.71ms +step:1650/20000 train_loss:1.9557 grad_norm:0.0655 train_time:2261915ms step_avg:1370.86ms +step:1700/20000 train_loss:2.0848 grad_norm:0.0634 train_time:2330710ms step_avg:1371.01ms +step:1750/20000 train_loss:2.0562 grad_norm:0.0759 train_time:2399493ms step_avg:1371.14ms +step:1800/20000 train_loss:2.0964 grad_norm:0.0645 train_time:2468259ms step_avg:1371.26ms +step:1850/20000 train_loss:2.1107 grad_norm:0.0831 train_time:2537046ms step_avg:1371.38ms +step:1900/20000 train_loss:2.0580 grad_norm:0.0648 train_time:2605824ms step_avg:1371.49ms +step:1950/20000 train_loss:2.0431 grad_norm:0.0981 train_time:2674651ms step_avg:1371.62ms +step:2000/20000 train_loss:2.2944 grad_norm:0.0838 train_time:2743419ms step_avg:1371.71ms +step:2000/20000 val_loss:2.0763 val_bpb:1.2297 train_time:2743463ms step_avg:1371.73ms +step:2050/20000 train_loss:2.0607 grad_norm:0.1013 train_time:2812501ms step_avg:1371.95ms +step:2100/20000 train_loss:2.0358 grad_norm:0.0558 train_time:2881257ms step_avg:1372.03ms +step:2150/20000 train_loss:2.0142 grad_norm:0.0526 train_time:2950035ms step_avg:1372.11ms +step:2200/20000 train_loss:2.1668 grad_norm:0.0614 train_time:3018808ms step_avg:1372.19ms +step:2250/20000 train_loss:2.0604 grad_norm:0.0644 train_time:3087562ms step_avg:1372.25ms +step:2300/20000 train_loss:2.0377 grad_norm:0.1123 train_time:3156291ms step_avg:1372.30ms +step:2350/20000 train_loss:1.9923 grad_norm:0.0511 train_time:3225042ms step_avg:1372.36ms +step:2400/20000 train_loss:2.1062 grad_norm:0.0682 train_time:3293804ms step_avg:1372.42ms +step:2450/20000 train_loss:2.0650 grad_norm:0.0639 train_time:3362565ms step_avg:1372.48ms +step:2500/20000 train_loss:2.0208 grad_norm:0.0580 train_time:3431320ms step_avg:1372.53ms +step:2500/20000 val_loss:2.0279 val_bpb:1.2010 train_time:3431364ms step_avg:1372.55ms +step:2550/20000 train_loss:2.0211 grad_norm:0.0558 train_time:3500393ms step_avg:1372.70ms +step:2600/20000 train_loss:2.0001 grad_norm:0.0479 train_time:3569165ms step_avg:1372.76ms +step:2650/20000 train_loss:2.0040 grad_norm:0.0582 train_time:3637929ms step_avg:1372.80ms +step:2700/20000 train_loss:2.0265 grad_norm:0.0542 train_time:3706703ms step_avg:1372.85ms +step:2750/20000 train_loss:2.0077 grad_norm:0.0457 train_time:3775459ms step_avg:1372.89ms +step:2800/20000 train_loss:2.0415 grad_norm:0.0569 train_time:3844241ms step_avg:1372.94ms +step:2850/20000 train_loss:1.9900 grad_norm:0.0487 train_time:3913011ms step_avg:1372.99ms +step:2900/20000 train_loss:2.0045 grad_norm:0.0438 train_time:3981769ms step_avg:1373.02ms +step:2950/20000 train_loss:2.0440 grad_norm:0.0447 train_time:4050513ms step_avg:1373.06ms +step:3000/20000 train_loss:1.9316 grad_norm:0.0567 train_time:4119545ms step_avg:1373.18ms +step:3000/20000 val_loss:1.9838 val_bpb:1.1749 train_time:4119590ms step_avg:1373.20ms +step:3050/20000 train_loss:1.9372 grad_norm:0.0506 train_time:4188300ms step_avg:1373.21ms +step:3100/20000 train_loss:1.9990 grad_norm:0.0465 train_time:4257075ms step_avg:1373.25ms +step:3150/20000 train_loss:2.0077 grad_norm:0.0401 train_time:4325837ms step_avg:1373.28ms +swa:start step:3200 +step:3200/20000 train_loss:1.9812 grad_norm:0.0445 train_time:4394566ms step_avg:1373.30ms +late_qat:enabled step:3241 scale:0.1495 core_quant:on +step:3250/20000 train_loss:1.9531 grad_norm:0.0567 train_time:4519079ms step_avg:1390.49ms +step:3300/20000 train_loss:1.9296 grad_norm:0.0386 train_time:4587540ms step_avg:1390.16ms +step:3350/20000 train_loss:1.9653 grad_norm:0.0394 train_time:4655858ms step_avg:1389.81ms +step:3400/20000 train_loss:2.0099 grad_norm:0.0483 train_time:4724204ms step_avg:1389.47ms +step:3450/20000 train_loss:1.9637 grad_norm:0.0369 train_time:4792535ms step_avg:1389.14ms +step:3456/20000 val_loss:1.9505 val_bpb:1.1552 train_time:4800814ms step_avg:1389.12ms +stopping_early: wallclock_cap train_time:4800814ms step:3456/20000 +peak memory allocated: 50545 MiB reserved: 50594 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9472 val_bpb:1.1532 eval_time:32839ms +Serialized model: 106023671 bytes +Code size: 102633 bytes +Serialized model int6+lzma: 16373548 bytes +Total submission size int6+lzma: 16476181 bytes +final_int6_roundtrip val_loss:1.9574 val_bpb:1.1593 eval_time:39862ms +final_int6_roundtrip_exact val_loss:1.95735252 val_bpb:1.15925441 +final_int6_sliding_window val_loss:1.9164 val_bpb:1.1350 stride:64 eval_time:1105486ms +final_int6_sliding_window_exact val_loss:1.91642779 val_bpb:1.13501949 +final_int8_zlib_roundtrip_exact val_loss:1.91642779 val_bpb:1.13501949 +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=2 +ttt_sliding:params unfrozen=26923088 frozen=4112 + ttt_chunk [1/1893] bpb=1.226275 time=1.9s + ttt_chunk [11/1893] bpb=1.128206 time=20.5s + ttt_chunk [21/1893] bpb=1.137378 time=39.0s + ttt_chunk [31/1893] bpb=1.142175 time=57.6s + ttt_chunk [41/1893] bpb=1.138228 time=76.1s + ttt_chunk [51/1893] bpb=1.139877 time=94.6s + ttt_chunk [61/1893] bpb=1.143695 time=113.2s + ttt_chunk [71/1893] bpb=1.141806 time=131.7s + ttt_chunk [81/1893] bpb=1.138175 time=150.2s + ttt_chunk [91/1893] bpb=1.137107 time=168.8s + ttt_chunk [101/1893] bpb=1.138115 time=187.3s + ttt_chunk [111/1893] bpb=1.138295 time=205.9s + ttt_chunk [121/1893] bpb=1.134671 time=224.4s + ttt_chunk [131/1893] bpb=1.133939 time=242.9s + ttt_chunk [141/1893] bpb=1.132766 time=261.5s + ttt_chunk [151/1893] bpb=1.132980 time=280.0s + ttt_chunk [161/1893] bpb=1.133800 time=298.6s + ttt_chunk [171/1893] bpb=1.135874 time=317.1s + ttt_chunk [181/1893] bpb=1.135884 time=335.6s + ttt_chunk [191/1893] bpb=1.138340 time=354.2s + ttt_chunk [201/1893] bpb=1.137866 time=372.7s + ttt_chunk [211/1893] bpb=1.136957 time=391.2s + ttt_chunk [221/1893] bpb=1.137842 time=409.8s + ttt_chunk [231/1893] bpb=1.137565 time=428.3s + ttt_chunk [241/1893] bpb=1.137849 time=446.8s + ttt_chunk [251/1893] bpb=1.137360 time=465.4s + ttt_chunk [261/1893] bpb=1.136692 time=483.9s + ttt_chunk [271/1893] bpb=1.135780 time=502.5s + ttt_chunk [281/1893] bpb=1.137389 time=521.0s + ttt_chunk [291/1893] bpb=1.137018 time=539.5s + ttt_chunk [301/1893] bpb=1.137918 time=558.1s + ttt_chunk [311/1893] bpb=1.138001 time=576.6s + ttt_chunk [321/1893] bpb=1.138708 time=595.1s + ttt_chunk [331/1893] bpb=1.138179 time=613.7s + ttt_chunk [341/1893] bpb=1.137832 time=632.2s + ttt_chunk [351/1893] bpb=1.138543 time=650.8s + ttt_chunk [361/1893] bpb=1.139301 time=669.3s + ttt_chunk [371/1893] bpb=1.139185 time=687.8s + ttt_chunk [381/1893] bpb=1.138924 time=706.4s + ttt_chunk [391/1893] bpb=1.139607 time=724.9s + ttt_chunk [401/1893] bpb=1.139172 time=743.4s + ttt_chunk [411/1893] bpb=1.138218 time=762.0s + ttt_chunk [421/1893] bpb=1.138334 time=780.5s + ttt_chunk [431/1893] bpb=1.138777 time=799.1s + ttt_chunk [441/1893] bpb=1.138161 time=817.6s + ttt_chunk [451/1893] bpb=1.138301 time=836.1s + ttt_chunk [461/1893] bpb=1.138190 time=854.7s + ttt_chunk [471/1893] bpb=1.137746 time=873.2s + ttt_chunk [481/1893] bpb=1.137597 time=891.8s + ttt_chunk [491/1893] bpb=1.137722 time=910.3s + ttt_chunk [501/1893] bpb=1.137492 time=928.8s + ttt_chunk [511/1893] bpb=1.137017 time=947.4s + ttt_chunk [521/1893] bpb=1.136714 time=965.9s + ttt_chunk [531/1893] bpb=1.137443 time=984.5s + ttt_chunk [541/1893] bpb=1.137557 time=1003.0s + ttt_chunk [551/1893] bpb=1.137019 time=1021.5s + ttt_chunk [561/1893] bpb=1.136885 time=1040.1s + ttt_chunk [571/1893] bpb=1.136621 time=1058.6s + ttt_chunk [581/1893] bpb=1.136257 time=1077.2s + ttt_chunk [591/1893] bpb=1.135719 time=1095.7s + ttt_chunk [601/1893] bpb=1.135711 time=1114.2s + ttt_chunk [611/1893] bpb=1.135386 time=1132.8s + ttt_chunk [621/1893] bpb=1.135235 time=1151.3s + ttt_chunk [631/1893] bpb=1.134973 time=1169.9s + ttt_chunk [641/1893] bpb=1.134519 time=1188.4s + ttt_chunk [651/1893] bpb=1.134057 time=1206.9s + ttt_chunk [661/1893] bpb=1.133947 time=1225.5s + ttt_chunk [671/1893] bpb=1.133482 time=1244.0s + ttt_chunk [681/1893] bpb=1.132918 time=1262.6s + ttt_chunk [691/1893] bpb=1.132994 time=1281.1s + ttt_chunk [701/1893] bpb=1.132163 time=1299.6s + ttt_chunk [711/1893] bpb=1.132176 time=1318.2s + ttt_chunk [721/1893] bpb=1.132090 time=1336.7s + ttt_chunk [731/1893] bpb=1.132331 time=1355.2s + ttt_chunk [741/1893] bpb=1.132205 time=1373.8s + ttt_chunk [751/1893] bpb=1.131884 time=1392.3s + ttt_chunk [761/1893] bpb=1.132028 time=1410.8s + ttt_chunk [771/1893] bpb=1.131860 time=1429.4s + ttt_chunk [781/1893] bpb=1.132024 time=1447.9s + ttt_chunk [791/1893] bpb=1.131869 time=1466.4s + ttt_chunk [801/1893] bpb=1.131804 time=1485.0s + ttt_chunk [811/1893] bpb=1.131817 time=1503.5s + ttt_chunk [821/1893] bpb=1.131702 time=1522.1s + ttt_chunk [831/1893] bpb=1.131418 time=1540.6s + ttt_chunk [841/1893] bpb=1.131180 time=1559.1s + ttt_chunk [851/1893] bpb=1.131241 time=1577.7s + ttt_chunk [861/1893] bpb=1.131312 time=1596.2s + ttt_chunk [871/1893] bpb=1.131521 time=1614.7s + ttt_chunk [881/1893] bpb=1.131519 time=1633.3s + ttt_chunk [891/1893] bpb=1.130978 time=1651.8s + ttt_chunk [901/1893] bpb=1.130995 time=1670.3s + ttt_chunk [911/1893] bpb=1.130849 time=1688.9s + ttt_chunk [921/1893] bpb=1.130984 time=1707.4s + ttt_chunk [931/1893] bpb=1.130928 time=1726.0s + ttt_chunk [941/1893] bpb=1.131129 time=1744.5s + ttt_chunk [951/1893] bpb=1.131431 time=1763.0s + ttt_chunk [961/1893] bpb=1.131741 time=1781.6s + ttt_chunk [971/1893] bpb=1.132107 time=1800.1s + ttt_chunk [981/1893] bpb=1.132319 time=1818.6s + ttt_chunk [991/1893] bpb=1.132236 time=1837.2s + ttt_chunk [1001/1893] bpb=1.132567 time=1855.7s + ttt_chunk [1011/1893] bpb=1.132723 time=1874.3s + ttt_chunk [1021/1893] bpb=1.133011 time=1892.8s + ttt_chunk [1031/1893] bpb=1.133400 time=1911.3s + ttt_chunk [1041/1893] bpb=1.133897 time=1929.9s + ttt_chunk [1051/1893] bpb=1.133756 time=1948.4s + ttt_chunk [1061/1893] bpb=1.133865 time=1967.0s + ttt_chunk [1071/1893] bpb=1.134029 time=1985.5s + ttt_chunk [1081/1893] bpb=1.134076 time=2004.1s + ttt_chunk [1091/1893] bpb=1.134326 time=2022.7s + ttt_chunk [1101/1893] bpb=1.134469 time=2041.2s + ttt_chunk [1111/1893] bpb=1.134274 time=2059.8s + ttt_chunk [1121/1893] bpb=1.134049 time=2078.3s + ttt_chunk [1131/1893] bpb=1.133943 time=2096.9s + ttt_chunk [1141/1893] bpb=1.133705 time=2115.4s + ttt_chunk [1151/1893] bpb=1.133733 time=2134.0s + ttt_chunk [1161/1893] bpb=1.133569 time=2152.5s + ttt_chunk [1171/1893] bpb=1.133389 time=2171.1s + ttt_chunk [1181/1893] bpb=1.133164 time=2189.6s + ttt_chunk [1191/1893] bpb=1.133317 time=2208.2s + ttt_chunk [1201/1893] bpb=1.133519 time=2226.8s + ttt_chunk [1211/1893] bpb=1.133117 time=2245.3s + ttt_chunk [1221/1893] bpb=1.133455 time=2263.9s + ttt_chunk [1231/1893] bpb=1.133394 time=2282.4s + ttt_chunk [1241/1893] bpb=1.133104 time=2300.9s + ttt_chunk [1251/1893] bpb=1.132567 time=2319.5s + ttt_chunk [1261/1893] bpb=1.132300 time=2338.0s + ttt_chunk [1271/1893] bpb=1.132047 time=2356.6s + ttt_chunk [1281/1893] bpb=1.131738 time=2375.1s + ttt_chunk [1291/1893] bpb=1.131494 time=2393.7s + ttt_chunk [1301/1893] bpb=1.131443 time=2412.2s + ttt_chunk [1311/1893] bpb=1.131173 time=2430.7s + ttt_chunk [1321/1893] bpb=1.130872 time=2449.3s + ttt_chunk [1331/1893] bpb=1.130632 time=2467.8s + ttt_chunk [1341/1893] bpb=1.130505 time=2486.4s + ttt_chunk [1351/1893] bpb=1.130352 time=2504.9s + ttt_chunk [1361/1893] bpb=1.130484 time=2523.5s + ttt_chunk [1371/1893] bpb=1.130705 time=2542.0s + ttt_chunk [1381/1893] bpb=1.130910 time=2560.5s + ttt_chunk [1391/1893] bpb=1.130695 time=2579.1s + ttt_chunk [1401/1893] bpb=1.130724 time=2597.6s + ttt_chunk [1411/1893] bpb=1.130831 time=2616.2s + ttt_chunk [1421/1893] bpb=1.130815 time=2634.7s + ttt_chunk [1431/1893] bpb=1.130791 time=2653.3s + ttt_chunk [1441/1893] bpb=1.131256 time=2671.8s + ttt_chunk [1451/1893] bpb=1.131119 time=2691.1s + ttt_chunk [1461/1893] bpb=1.131048 time=2709.6s + ttt_chunk [1471/1893] bpb=1.131643 time=2728.2s + ttt_chunk [1481/1893] bpb=1.131517 time=2746.7s + ttt_chunk [1491/1893] bpb=1.131890 time=2765.3s + ttt_chunk [1501/1893] bpb=1.131872 time=2783.8s + ttt_chunk [1511/1893] bpb=1.131833 time=2802.3s + ttt_chunk [1521/1893] bpb=1.131945 time=2820.9s + ttt_chunk [1531/1893] bpb=1.132160 time=2839.4s + ttt_chunk [1541/1893] bpb=1.132230 time=2858.0s + ttt_chunk [1551/1893] bpb=1.132470 time=2876.5s + ttt_chunk [1561/1893] bpb=1.132554 time=2895.1s + ttt_chunk [1571/1893] bpb=1.132686 time=2913.6s + ttt_chunk [1581/1893] bpb=1.132836 time=2932.1s + ttt_chunk [1591/1893] bpb=1.132902 time=2950.7s + ttt_chunk [1601/1893] bpb=1.133020 time=2969.2s + ttt_chunk [1611/1893] bpb=1.133281 time=2987.8s + ttt_chunk [1621/1893] bpb=1.133141 time=3006.3s + ttt_chunk [1631/1893] bpb=1.133187 time=3024.8s + ttt_chunk [1641/1893] bpb=1.133212 time=3043.4s + ttt_chunk [1651/1893] bpb=1.133269 time=3061.9s + ttt_chunk [1661/1893] bpb=1.133410 time=3080.5s + ttt_chunk [1671/1893] bpb=1.133595 time=3099.0s + ttt_chunk [1681/1893] bpb=1.133686 time=3117.5s + ttt_chunk [1691/1893] bpb=1.133787 time=3136.1s + ttt_chunk [1701/1893] bpb=1.133884 time=3154.6s + ttt_chunk [1711/1893] bpb=1.133862 time=3173.2s + ttt_chunk [1721/1893] bpb=1.133701 time=3191.7s + ttt_chunk [1731/1893] bpb=1.133797 time=3210.2s + ttt_chunk [1741/1893] bpb=1.133534 time=3228.8s + ttt_chunk [1751/1893] bpb=1.133407 time=3247.3s + ttt_chunk [1761/1893] bpb=1.133444 time=3265.9s + ttt_chunk [1771/1893] bpb=1.133395 time=3284.4s + ttt_chunk [1781/1893] bpb=1.133298 time=3303.0s + ttt_chunk [1791/1893] bpb=1.132959 time=3321.5s + ttt_chunk [1801/1893] bpb=1.132941 time=3340.0s + ttt_chunk [1811/1893] bpb=1.132795 time=3358.6s + ttt_chunk [1821/1893] bpb=1.132853 time=3377.1s + ttt_chunk [1831/1893] bpb=1.132699 time=3395.7s + ttt_chunk [1841/1893] bpb=1.132738 time=3414.2s + ttt_chunk [1851/1893] bpb=1.132559 time=3432.7s + ttt_chunk [1861/1893] bpb=1.132478 time=3451.3s + ttt_chunk [1871/1893] bpb=1.132413 time=3469.8s + ttt_chunk [1881/1893] bpb=1.132170 time=3488.4s + ttt_chunk [1891/1893] bpb=1.132153 time=3506.9s + ttt_chunk [1893/1893] bpb=1.132184 time=3509.9s +ttt_sliding:done val_loss=1.911640 val_bpb=1.132184 elapsed=3510.0s +legal_ttt val_loss:1.9116 val_bpb:1.1322 eval_time:3510399ms +legal_ttt_exact val_loss:1.91163996 val_bpb:1.13218386 +wandb: updating run metadata +wandb: uploading output.log; uploading wandb-summary.json; uploading config.yaml +wandb: uploading data +wandb: +wandb: Run history: +wandb: grad_norm ▂█▅▅▄▃▃▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: lr_scale ██████████████████████▇▇▇▆▆▅▅▄▄▄▃▃▃▂▂▂▂▁ +wandb: step_avg_ms ▁▂▃▄▄▅▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇█ +wandb: train_loss ▆█▇▇▇▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: val_bpb █▂▁▁▁▁▁▁ +wandb: val_loss █▂▁▁▁▁▁▁ +wandb: +wandb: Run summary: +wandb: grad_norm 0.03694 +wandb: lr_scale 0.00374 +wandb: step_avg_ms 1389.14072 +wandb: train_loss 1.96371 +wandb: val_bpb 1.1552 +wandb: val_loss 1.95051 +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/qltwebo4 +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_230156-qltwebo4/logs diff --git a/full_lora_r8.log b/full_lora_r8.log new file mode 100644 index 0000000000..c485d989f8 --- /dev/null +++ b/full_lora_r8.log @@ -0,0 +1,217 @@ +START full run: 4-pass LoRA-r8 delayed-warmup TTT SWA, 80min (Thu Mar 26 20:51:35 UTC 2026) +logs/58255815-919c-41aa-976c-7ba3f80fab8e.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +lora: rank=8 params=1228800 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +lora_optimizer: lr=0.0025000000000000005 (scalar_lr * 0.1) +model_params:28156000 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run 3z8g4kez +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run full_4pass_lora_r8_delayed_80min +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/3z8g4kez +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9291 val_bpb:4.1038 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9303 grad_norm:0.3807 train_time:1305ms step_avg:1305.20ms +step:2/20000 train_loss:8.2624 grad_norm:3.4023 train_time:2653ms step_avg:1326.56ms +step:3/20000 train_loss:7.4846 grad_norm:1.6936 train_time:4015ms step_avg:1338.21ms +step:4/20000 train_loss:7.7154 grad_norm:1.9838 train_time:5385ms step_avg:1346.32ms +step:5/20000 train_loss:7.4456 grad_norm:2.1207 train_time:6747ms step_avg:1349.42ms +step:6/20000 train_loss:7.0896 grad_norm:1.7550 train_time:8108ms step_avg:1351.41ms +step:7/20000 train_loss:6.8569 grad_norm:2.3306 train_time:9470ms step_avg:1352.92ms +step:8/20000 train_loss:6.7973 grad_norm:1.6453 train_time:10833ms step_avg:1354.11ms +step:9/20000 train_loss:6.5582 grad_norm:1.2844 train_time:12199ms step_avg:1355.49ms +step:10/20000 train_loss:6.2034 grad_norm:1.2514 train_time:13569ms step_avg:1356.86ms +step:50/20000 train_loss:3.6856 grad_norm:0.8303 train_time:68670ms step_avg:1373.39ms +step:100/20000 train_loss:3.1157 grad_norm:0.4168 train_time:137708ms step_avg:1377.08ms +step:150/20000 train_loss:2.7810 grad_norm:0.3675 train_time:206699ms step_avg:1377.99ms +step:200/20000 train_loss:2.5703 grad_norm:0.3384 train_time:275646ms step_avg:1378.23ms +step:250/20000 train_loss:2.5789 grad_norm:0.2913 train_time:344574ms step_avg:1378.30ms +step:300/20000 train_loss:2.4357 grad_norm:0.2132 train_time:413532ms step_avg:1378.44ms +step:350/20000 train_loss:2.4866 grad_norm:0.2055 train_time:483575ms step_avg:1381.64ms +step:400/20000 train_loss:2.4144 grad_norm:0.2487 train_time:552589ms step_avg:1381.47ms +step:450/20000 train_loss:2.2333 grad_norm:0.1527 train_time:621533ms step_avg:1381.18ms +step:500/20000 train_loss:2.2865 grad_norm:0.1511 train_time:690504ms step_avg:1381.01ms +step:500/20000 val_loss:2.3117 val_bpb:1.3691 train_time:690515ms step_avg:1381.03ms +step:550/20000 train_loss:2.3480 grad_norm:0.1354 train_time:760462ms step_avg:1382.66ms +step:600/20000 train_loss:2.2536 grad_norm:0.2002 train_time:829477ms step_avg:1382.46ms +step:650/20000 train_loss:2.2306 grad_norm:0.1194 train_time:898608ms step_avg:1382.47ms +step:700/20000 train_loss:2.3041 grad_norm:0.1617 train_time:967715ms step_avg:1382.45ms +step:750/20000 train_loss:2.2754 grad_norm:0.1308 train_time:1036878ms step_avg:1382.50ms +step:800/20000 train_loss:2.2542 grad_norm:0.1211 train_time:1106111ms step_avg:1382.64ms +step:850/20000 train_loss:2.1786 grad_norm:0.0688 train_time:1175361ms step_avg:1382.78ms +step:900/20000 train_loss:2.0929 grad_norm:0.0751 train_time:1244692ms step_avg:1382.99ms +step:950/20000 train_loss:2.2961 grad_norm:0.1601 train_time:1314910ms step_avg:1384.12ms +step:1000/20000 train_loss:2.2261 grad_norm:0.0833 train_time:1384163ms step_avg:1384.16ms +step:1000/20000 val_loss:2.1725 val_bpb:1.2867 train_time:1384175ms step_avg:1384.17ms +step:1050/20000 train_loss:2.1507 grad_norm:0.1497 train_time:1453425ms step_avg:1384.21ms +step:1100/20000 train_loss:2.1755 grad_norm:0.0691 train_time:1522714ms step_avg:1384.29ms +step:1150/20000 train_loss:2.1286 grad_norm:0.0721 train_time:1592948ms step_avg:1385.17ms +step:1200/20000 train_loss:2.1760 grad_norm:0.0782 train_time:1662222ms step_avg:1385.19ms +step:1250/20000 train_loss:2.2002 grad_norm:0.0675 train_time:1731489ms step_avg:1385.19ms +step:1300/20000 train_loss:2.1691 grad_norm:0.0856 train_time:1800790ms step_avg:1385.22ms +step:1350/20000 train_loss:2.1443 grad_norm:0.0825 train_time:1870108ms step_avg:1385.27ms +step:1400/20000 train_loss:2.1563 grad_norm:0.0820 train_time:1939417ms step_avg:1385.30ms +step:1450/20000 train_loss:2.1534 grad_norm:0.0778 train_time:2008708ms step_avg:1385.32ms +step:1500/20000 train_loss:2.1264 grad_norm:0.1636 train_time:2077976ms step_avg:1385.32ms +step:1500/20000 val_loss:2.1131 val_bpb:1.2515 train_time:2077988ms step_avg:1385.33ms +step:1550/20000 train_loss:2.0937 grad_norm:0.0739 train_time:2148175ms step_avg:1385.92ms +step:1600/20000 train_loss:2.1730 grad_norm:0.0742 train_time:2217401ms step_avg:1385.88ms +step:1650/20000 train_loss:1.9579 grad_norm:0.0866 train_time:2286669ms step_avg:1385.86ms +step:1700/20000 train_loss:2.0866 grad_norm:0.0640 train_time:2356000ms step_avg:1385.88ms +step:1750/20000 train_loss:2.0575 grad_norm:0.0784 train_time:2425290ms step_avg:1385.88ms +step:1800/20000 train_loss:2.0953 grad_norm:0.0589 train_time:2495476ms step_avg:1386.38ms +step:1850/20000 train_loss:2.1090 grad_norm:0.1099 train_time:2564723ms step_avg:1386.34ms +step:1900/20000 train_loss:2.0553 grad_norm:0.0538 train_time:2633991ms step_avg:1386.31ms +step:1950/20000 train_loss:2.0417 grad_norm:0.0691 train_time:2703295ms step_avg:1386.31ms +step:2000/20000 train_loss:2.2933 grad_norm:0.0634 train_time:2772559ms step_avg:1386.28ms +step:2000/20000 val_loss:2.0736 val_bpb:1.2281 train_time:2772571ms step_avg:1386.29ms +step:2050/20000 train_loss:2.0610 grad_norm:0.0643 train_time:2841839ms step_avg:1386.26ms +step:2100/20000 train_loss:2.0352 grad_norm:0.0542 train_time:2911097ms step_avg:1386.24ms +step:2150/20000 train_loss:2.0150 grad_norm:0.0748 train_time:2980382ms step_avg:1386.22ms +step:2200/20000 train_loss:2.1675 grad_norm:0.0647 train_time:3050764ms step_avg:1386.71ms +step:2250/20000 train_loss:2.0588 grad_norm:0.0651 train_time:3120289ms step_avg:1386.80ms +step:2300/20000 train_loss:2.0371 grad_norm:0.0742 train_time:3189823ms step_avg:1386.88ms +step:2350/20000 train_loss:1.9911 grad_norm:0.0819 train_time:3259324ms step_avg:1386.95ms +step:2400/20000 train_loss:2.1049 grad_norm:0.0508 train_time:3329685ms step_avg:1387.37ms +step:2450/20000 train_loss:2.0658 grad_norm:0.0537 train_time:3398968ms step_avg:1387.33ms +step:2500/20000 train_loss:2.0210 grad_norm:0.0627 train_time:3468256ms step_avg:1387.30ms +step:2500/20000 val_loss:2.0271 val_bpb:1.2005 train_time:3468267ms step_avg:1387.31ms +step:2550/20000 train_loss:2.0220 grad_norm:0.0459 train_time:3537589ms step_avg:1387.29ms +step:2600/20000 train_loss:1.9997 grad_norm:0.0445 train_time:3606904ms step_avg:1387.27ms +step:2650/20000 train_loss:2.0041 grad_norm:0.0439 train_time:3676209ms step_avg:1387.25ms +step:2700/20000 train_loss:2.0259 grad_norm:0.0450 train_time:3745493ms step_avg:1387.22ms +step:2750/20000 train_loss:2.0067 grad_norm:0.0443 train_time:3814755ms step_avg:1387.18ms +step:2800/20000 train_loss:2.0409 grad_norm:0.0486 train_time:3884987ms step_avg:1387.50ms +step:2850/20000 train_loss:1.9897 grad_norm:0.0474 train_time:3954266ms step_avg:1387.46ms +step:2900/20000 train_loss:2.0047 grad_norm:0.0898 train_time:4023547ms step_avg:1387.43ms +step:2950/20000 train_loss:2.0428 grad_norm:0.0410 train_time:4092882ms step_avg:1387.42ms +step:3000/20000 train_loss:1.9290 grad_norm:0.0543 train_time:4163141ms step_avg:1387.71ms +step:3000/20000 val_loss:1.9832 val_bpb:1.1746 train_time:4163153ms step_avg:1387.72ms +step:3050/20000 train_loss:1.9349 grad_norm:0.0457 train_time:4232654ms step_avg:1387.76ms +step:3100/20000 train_loss:1.9977 grad_norm:0.0425 train_time:4302197ms step_avg:1387.81ms +swa:start step:3150 +step:3150/20000 train_loss:2.0068 grad_norm:0.0401 train_time:4371737ms step_avg:1387.85ms +step:3200/20000 train_loss:1.9809 grad_norm:0.0467 train_time:4441129ms step_avg:1387.85ms +late_qat:enabled step:3204 scale:0.1497 core_quant:on +step:3250/20000 train_loss:1.9488 grad_norm:0.0413 train_time:4597242ms step_avg:1414.54ms +step:3300/20000 train_loss:1.9274 grad_norm:0.0417 train_time:4666636ms step_avg:1414.13ms +step:3350/20000 train_loss:1.9650 grad_norm:0.0352 train_time:4736129ms step_avg:1413.77ms +step:3396/20000 val_loss:1.9539 val_bpb:1.1572 train_time:4800035ms step_avg:1413.44ms +stopping_early: wallclock_cap train_time:4800035ms step:3396/20000 +peak memory allocated: 50639 MiB reserved: 50682 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9508 val_bpb:1.1554 eval_time:33632ms +Serialized model: 110942659 bytes +Code size: 102570 bytes +Serialized model int6+lzma: 17439360 bytes +Total submission size int6+lzma: 17541930 bytes +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] torch._dynamo hit config.recompile_limit (8) +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] function: 'forward' (/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py:1054) +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] last reason: 0/7: self._modules['blocks']._modules['0']._modules['attn']._modules['rotary']._cos_cached is None # self._cos_cached is None # records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py:590 in forward +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] User stack trace: +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1004, in _forward_hidden +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] x, raw_v = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 783, in forward +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] attn_out, raw_v = self.attn(self.attn_norm(x_in) * self.ln_scale_factor, q_w, k_w, v_w, out_w, v_embed=v_embed, v0=v0) +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 678, in forward +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] cos, sin = self.rotary(seqlen, x.device, q.dtype) +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 590, in forward +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] self._cos_cached is None +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] To diagnose recompilation issues, see https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html +Traceback (most recent call last): + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1773, in _compile + raise_unimplemented_cache_limit_exceeded() + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1757, in raise_unimplemented_cache_limit_exceeded + unimplemented( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/exc.py", line 634, in unimplemented + raise Unsupported(msg, gb_type, skip_frame) +torch._dynamo.exc.Unsupported: Dynamo recompile limit exceeded + Explanation: Dynamo attempted to recompile the code object too many times, exceeding the recompile_limit cache size limit (currently set to 8). Excessive recompilations can degrade performance due to the compilation overhead of each recompilation. + Hint: To monitor recompilations, enable TORCH_LOGS=recompiles. If recompilations are expected, consider increasing torch._dynamo.config.recompile_limit to an appropriate value. + Hint: See https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html for tips on dealing with recompilations. + + Developer debug context: Limit type: recompile_limit + + For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0039.html + +The above exception was the direct cause of the following exception: + +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2146, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2084, in main + q_val_loss, q_val_bpb = eval_val( + ^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 359, in eval_val + batch_loss = model(x, y).detach() + ^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 2316, in __call__ + result = self._torchdynamo_orig_backend( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 729, in __call__ + result = _compile( + ^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1780, in _compile + raise FailOnRecompileLimitHit( +torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True +wandb: +wandb: 🚀 View run full_4pass_lora_r8_delayed_80min at: https://wandb.ai/propensity/parameter-golf/runs/3z8g4kez +wandb: Find logs at: wandb/run-20260326_205139-3z8g4kez/logs diff --git a/lora_r8_stdout.log b/lora_r8_stdout.log new file mode 100644 index 0000000000..a471f3d9df --- /dev/null +++ b/lora_r8_stdout.log @@ -0,0 +1,34 @@ +START full run: 4-pass LoRA-r8 delayed-warmup TTT SWA, 80min (Thu Mar 26 20:51:35 UTC 2026) + +FAILED (exit=1) + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 2316, in __call__ + result = self._torchdynamo_orig_backend( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 729, in __call__ + result = _compile( + ^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1780, in _compile + raise FailOnRecompileLimitHit( +torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True +wandb: +wandb: 🚀 View run full_4pass_lora_r8_delayed_80min at: https://wandb.ai/propensity/parameter-golf/runs/3z8g4kez +wandb: Find logs at: wandb/run-20260326_205139-3z8g4kez/logs +FINISHED (Thu Mar 26 22:22:22 UTC 2026) diff --git a/lora_test_500step.log b/lora_test_500step.log new file mode 100644 index 0000000000..e41be67234 --- /dev/null +++ b/lora_test_500step.log @@ -0,0 +1,151 @@ +START LoRA test: 4-pass r8, 500 steps (Thu Mar 26 15:44:51 UTC 2026) +logs/45057043-f913-4db9-8041-977fbf564feb.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +lora: rank=2 params=307200 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +lora_optimizer: lr=0.0025000000000000005 (scalar_lr * 0.1) +model_params:27234400 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:500 warmup_steps:20 max_wallclock_seconds:0.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run y5p28i5r +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run lora_test_500step_r2_fixed +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/y5p28i5r +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['14828.0', '14150.7', '13588.8', '13118.6', '12709.1', '12463.7', '12206.9', '11992.7', '11805.3', '11630.9', '12068.4', '11893.6', '11752.5', '11630.9', '11516.5', '11809.8', '11709.4', '11634.5', '11573.3', '11514.1'] growth=['0.947', '0.954', '0.960', '0.965', '0.969', '0.981', '0.979', '0.982', '0.984', '0.985', '0.988', '0.986', '0.988', '0.990', '0.990', '0.995', '0.992', '0.994', '0.995', '0.995'] +step:1/500 train_loss:6.9310 grad_norm:0.3717 train_time:3145ms step_avg:3144.67ms +step:2/500 train_loss:8.3473 grad_norm:3.5331 train_time:6277ms step_avg:3138.33ms +step:3/500 train_loss:7.5115 grad_norm:1.8520 train_time:9436ms step_avg:3145.34ms +step:4/500 train_loss:7.5611 grad_norm:1.8993 train_time:12598ms step_avg:3149.58ms +step:5/500 train_loss:7.3182 grad_norm:1.9103 train_time:15757ms step_avg:3151.41ms +step:6/500 train_loss:7.0753 grad_norm:1.7013 train_time:18921ms step_avg:3153.55ms +step:7/500 train_loss:6.9528 grad_norm:2.0667 train_time:22082ms step_avg:3154.56ms +step:8/500 train_loss:6.9028 grad_norm:1.4281 train_time:25244ms step_avg:3155.47ms +step:9/500 train_loss:6.5408 grad_norm:1.0079 train_time:28404ms step_avg:3156.00ms +step:10/500 train_loss:6.1499 grad_norm:0.9864 train_time:31565ms step_avg:3156.49ms +step:20/500 train_loss:4.7832 grad_norm:1.0980 train_time:63165ms step_avg:3158.27ms +step:30/500 train_loss:4.1875 grad_norm:1.0659 train_time:94790ms step_avg:3159.67ms +step:40/500 train_loss:3.8630 grad_norm:0.8877 train_time:126560ms step_avg:3164.01ms +step:50/500 train_loss:3.6884 grad_norm:0.7170 train_time:158209ms step_avg:3164.18ms +step:50/500 val_loss:3.6586 val_bpb:2.1668 train_time:158241ms step_avg:3164.81ms h_norms=['12735.5', '11200.0', '10164.1', '9525.6', '9159.4', '9155.7', '9167.3', '9205.2', '9275.0', '9389.5', '9135.8', '9214.4', '9298.5', '9399.7', '9535.9', '9187.1', '9297.4', '9404.6', '9522.6', '9670.8'] growth=['0.845', '0.879', '0.908', '0.937', '0.962', '1.000', '1.001', '1.004', '1.008', '1.012', '1.011', '1.009', '1.009', '1.011', '1.014', '1.016', '1.012', '1.012', '1.013', '1.016'] +step:60/500 train_loss:3.5065 grad_norm:1.0078 train_time:189842ms step_avg:3164.03ms +step:70/500 train_loss:3.4063 grad_norm:0.7213 train_time:221493ms step_avg:3164.18ms +step:80/500 train_loss:3.3329 grad_norm:0.5494 train_time:253155ms step_avg:3164.43ms +step:90/500 train_loss:3.1786 grad_norm:0.4390 train_time:284819ms step_avg:3164.66ms +step:100/500 train_loss:3.1304 grad_norm:0.5572 train_time:316454ms step_avg:3164.54ms +step:100/500 val_loss:3.0898 val_bpb:1.8300 train_time:316486ms step_avg:3164.86ms h_norms=['13387.7', '11921.1', '11209.5', '11093.3', '11390.5', '11631.9', '12018.0', '12495.4', '13068.8', '13826.8', '12333.3', '12773.9', '13286.8', '13868.9', '14643.5', '13114.6', '13524.0', '14018.2', '14576.3', '15343.0'] growth=['0.849', '0.890', '0.940', '0.990', '1.027', '1.021', '1.033', '1.040', '1.046', '1.058', '1.018', '1.036', '1.040', '1.044', '1.056', '1.006', '1.031', '1.037', '1.040', '1.053'] +step:110/500 train_loss:3.0264 grad_norm:0.3633 train_time:348121ms step_avg:3164.73ms +step:120/500 train_loss:2.9410 grad_norm:0.3409 train_time:379800ms step_avg:3165.00ms +step:130/500 train_loss:2.8723 grad_norm:0.3861 train_time:411469ms step_avg:3165.15ms +step:140/500 train_loss:2.8200 grad_norm:0.2985 train_time:443103ms step_avg:3165.02ms +step:150/500 train_loss:2.7805 grad_norm:0.3407 train_time:474745ms step_avg:3164.97ms +step:150/500 val_loss:2.7663 val_bpb:1.6384 train_time:474777ms step_avg:3165.18ms h_norms=['14917.2', '13747.5', '13056.9', '12886.7', '13272.0', '13836.5', '13983.7', '14170.1', '14538.1', '15295.8', '14298.5', '14552.7', '14814.1', '15249.9', '16101.4', '14844.7', '15109.6', '15379.1', '15827.9', '16720.2'] growth=['0.910', '0.922', '0.950', '0.987', '1.030', '1.043', '1.011', '1.013', '1.026', '1.052', '1.027', '1.018', '1.018', '1.029', '1.056', '1.002', '1.018', '1.018', '1.029', '1.056'] +step:160/500 train_loss:2.7721 grad_norm:0.4282 train_time:506415ms step_avg:3165.09ms +step:170/500 train_loss:2.7118 grad_norm:0.3055 train_time:538058ms step_avg:3165.05ms +step:180/500 train_loss:2.6200 grad_norm:0.2472 train_time:569702ms step_avg:3165.01ms +step:190/500 train_loss:2.6444 grad_norm:0.3218 train_time:601362ms step_avg:3165.06ms +step:200/500 train_loss:2.5645 grad_norm:0.2424 train_time:633157ms step_avg:3165.79ms +step:200/500 val_loss:2.6022 val_bpb:1.5411 train_time:633189ms step_avg:3165.94ms h_norms=['17029.4', '15746.8', '14904.1', '14604.7', '14967.2', '16073.3', '15851.9', '15730.9', '15825.5', '16423.3', '16300.0', '16203.6', '16170.9', '16336.4', '17025.2', '16594.0', '16523.8', '16512.3', '16692.3', '17399.1'] growth=['0.922', '0.925', '0.946', '0.980', '1.025', '1.074', '0.986', '0.992', '1.006', '1.038', '1.056', '0.994', '0.998', '1.010', '1.042', '1.026', '0.996', '0.999', '1.011', '1.042'] +step:210/500 train_loss:2.5650 grad_norm:0.3549 train_time:664794ms step_avg:3165.68ms +step:220/500 train_loss:2.6050 grad_norm:0.3495 train_time:696436ms step_avg:3165.62ms +step:230/500 train_loss:2.5417 grad_norm:0.3301 train_time:728081ms step_avg:3165.57ms +step:240/500 train_loss:2.5378 grad_norm:0.2441 train_time:759737ms step_avg:3165.57ms +step:250/500 train_loss:2.5756 grad_norm:0.3062 train_time:791358ms step_avg:3165.43ms +step:250/500 val_loss:2.5268 val_bpb:1.4965 train_time:791390ms step_avg:3165.56ms h_norms=['18846.3', '17410.6', '16438.8', '15959.5', '16305.4', '17764.0', '17388.4', '17140.6', '17049.7', '17635.7', '17928.4', '17683.9', '17472.3', '17397.4', '17960.3', '18072.6', '17862.5', '17642.2', '17546.1', '18026.9'] growth=['0.917', '0.924', '0.944', '0.971', '1.022', '1.089', '0.979', '0.986', '0.995', '1.034', '1.076', '0.986', '0.988', '0.996', '1.032', '1.045', '0.988', '0.988', '0.995', '1.027'] +step:260/500 train_loss:2.5344 grad_norm:0.2526 train_time:823044ms step_avg:3165.55ms +step:270/500 train_loss:2.4962 grad_norm:0.3197 train_time:854686ms step_avg:3165.50ms +step:280/500 train_loss:2.4329 grad_norm:0.2331 train_time:886309ms step_avg:3165.39ms +step:290/500 train_loss:2.4736 grad_norm:0.2550 train_time:917933ms step_avg:3165.29ms +step:300/500 train_loss:2.4380 grad_norm:0.2264 train_time:949572ms step_avg:3165.24ms +step:300/500 val_loss:2.4532 val_bpb:1.4529 train_time:949604ms step_avg:3165.35ms h_norms=['20941.1', '19182.7', '18052.0', '17457.7', '17865.7', '19511.0', '19019.9', '18704.8', '18540.8', '19344.5', '19674.8', '19330.4', '18996.2', '18791.5', '19454.1', '19743.2', '19426.2', '19055.5', '18799.3', '19275.4'] growth=['0.909', '0.916', '0.941', '0.967', '1.023', '1.092', '0.975', '0.983', '0.991', '1.043', '1.080', '0.982', '0.983', '0.989', '1.035', '1.050', '0.984', '0.981', '0.987', '1.025'] +step:310/500 train_loss:2.3557 grad_norm:0.2355 train_time:981222ms step_avg:3165.23ms +step:320/500 train_loss:2.4268 grad_norm:0.2275 train_time:1012875ms step_avg:3165.24ms +step:330/500 train_loss:2.4731 grad_norm:0.2582 train_time:1044522ms step_avg:3165.22ms +step:340/500 train_loss:2.3813 grad_norm:0.2222 train_time:1076174ms step_avg:3165.22ms +step:350/500 train_loss:2.4827 grad_norm:0.2099 train_time:1107944ms step_avg:3165.55ms +step:350/500 val_loss:2.4038 val_bpb:1.4237 train_time:1107975ms step_avg:3165.64ms h_norms=['23083.6', '21048.2', '19811.9', '19011.2', '19310.5', '21330.8', '20798.1', '20502.5', '20108.1', '20815.6', '21418.1', '21061.6', '20705.5', '20224.1', '20676.4', '21384.5', '21039.3', '20616.6', '20063.3', '20274.8'] growth=['0.904', '0.912', '0.941', '0.960', '1.016', '1.105', '0.975', '0.986', '0.981', '1.035', '1.096', '0.983', '0.983', '0.977', '1.022', '1.067', '0.984', '0.980', '0.973', '1.011'] +step:360/500 train_loss:2.2488 grad_norm:0.2012 train_time:1139604ms step_avg:3165.57ms +step:370/500 train_loss:2.4498 grad_norm:0.1714 train_time:1171266ms step_avg:3165.58ms +step:380/500 train_loss:2.3948 grad_norm:0.2284 train_time:1202904ms step_avg:3165.54ms +step:390/500 train_loss:2.3586 grad_norm:0.1837 train_time:1234583ms step_avg:3165.60ms +step:400/500 train_loss:2.4046 grad_norm:0.2053 train_time:1266209ms step_avg:3165.52ms +step:400/500 val_loss:2.3705 val_bpb:1.4040 train_time:1266241ms step_avg:3165.60ms h_norms=['25593.2', '23180.9', '21902.2', '20833.8', '21042.7', '23067.7', '22480.5', '22319.1', '21743.0', '22667.8', '23199.6', '22798.3', '22512.9', '21781.8', '22345.9', '23150.2', '22743.8', '22358.7', '21535.2', '21760.3'] growth=['0.893', '0.906', '0.945', '0.951', '1.010', '1.096', '0.975', '0.993', '0.974', '1.043', '1.091', '0.983', '0.987', '0.968', '1.026', '1.067', '0.982', '0.983', '0.963', '1.010'] +step:410/500 train_loss:2.3694 grad_norm:0.2156 train_time:1297880ms step_avg:3165.56ms +step:420/500 train_loss:2.4022 grad_norm:0.1854 train_time:1329534ms step_avg:3165.56ms +step:430/500 train_loss:2.3222 grad_norm:0.2276 train_time:1361181ms step_avg:3165.54ms +step:440/500 train_loss:2.4095 grad_norm:0.1854 train_time:1392812ms step_avg:3165.48ms +step:450/500 train_loss:2.2365 grad_norm:0.2154 train_time:1424471ms step_avg:3165.49ms +step:450/500 val_loss:2.3386 val_bpb:1.3851 train_time:1424502ms step_avg:3165.56ms h_norms=['28710.0', '25807.1', '24367.3', '23021.1', '23326.2', '25712.0', '25093.1', '24936.9', '24053.5', '25235.9', '25868.9', '25395.7', '25094.3', '23995.7', '24591.2', '25723.8', '25209.5', '24761.3', '23522.8', '23651.5'] growth=['0.884', '0.899', '0.944', '0.945', '1.013', '1.102', '0.976', '0.994', '0.965', '1.049', '1.100', '0.982', '0.988', '0.956', '1.025', '1.077', '0.980', '0.982', '0.950', '1.005'] +step:460/500 train_loss:2.3715 grad_norm:0.2246 train_time:1456119ms step_avg:3165.48ms +step:470/500 train_loss:2.3155 grad_norm:0.2162 train_time:1487752ms step_avg:3165.43ms +step:480/500 train_loss:2.2437 grad_norm:0.1745 train_time:1519404ms step_avg:3165.42ms +step:490/500 train_loss:2.2728 grad_norm:0.2349 train_time:1551048ms step_avg:3165.41ms +step:500/500 train_loss:2.2855 grad_norm:0.1518 train_time:1582691ms step_avg:3165.38ms +step:500/500 val_loss:2.3106 val_bpb:1.3684 train_time:1582722ms step_avg:3165.44ms h_norms=['31454.2', '27922.0', '26377.7', '24565.6', '24573.1', '27173.7', '26402.9', '26559.0', '25384.0', '26739.2', '27427.3', '26814.3', '26736.1', '25274.3', '25895.2', '27331.6', '26687.2', '26396.8', '24736.9', '24660.2'] growth=['0.877', '0.888', '0.945', '0.931', '1.000', '1.106', '0.972', '1.006', '0.956', '1.053', '1.108', '0.978', '0.997', '0.945', '1.025', '1.088', '0.976', '0.989', '0.937', '0.997'] +peak memory allocated: 66505 MiB reserved: 67408 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:2.6566 val_bpb:1.5734 eval_time:85304ms +Serialized model: 107256259 bytes +Code size: 102003 bytes +Serialized model int6+lzma: 9596524 bytes +Total submission size int6+lzma: 9698527 bytes +final_int6_roundtrip val_loss:2.7108 val_bpb:1.6055 eval_time:84596ms +final_int6_roundtrip_exact val_loss:2.71075277 val_bpb:1.60546048 +wandb: updating run metadata +wandb: uploading history steps 59-59, summary, console lines 98-99 +wandb: +wandb: Run history: +wandb: grad_norm █▅▅▅▄▄▃▃▃▃▃▂▂▂▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▃▁▃▄▅▆▆▆████████████████████████████████ +wandb: train_loss █▇▇▇▇▆▆▅▄▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: val_bpb █▃▂▂▁▁▁▁▁▁▁ +wandb: val_loss █▃▂▂▁▁▁▁▁▁▁ +wandb: +wandb: Run summary: +wandb: grad_norm 0.15179 +wandb: lr_scale 1 +wandb: step_avg_ms 3165.38107 +wandb: train_loss 2.28552 +wandb: val_bpb 1.36844 +wandb: val_loss 2.31056 +wandb: +wandb: 🚀 View run lora_test_500step_r2_fixed at: https://wandb.ai/propensity/parameter-golf/runs/y5p28i5r +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_154455-y5p28i5r/logs diff --git a/lora_test_r8_500step.log b/lora_test_r8_500step.log new file mode 100644 index 0000000000..33ecb6e4e1 --- /dev/null +++ b/lora_test_r8_500step.log @@ -0,0 +1,92 @@ +START LoRA test: 4-pass r8, 500 steps (Thu Mar 26 20:08:22 UTC 2026) +logs/906c7481-3ed5-48d5-a973-19e9e09e305e.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +lora: rank=8 params=1228800 optimizer=scalar lr=0.025 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +model_params:28156000 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:500 warmup_steps:20 max_wallclock_seconds:0.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run h4wnno7e +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run lora_test_500step_r8_fixed +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/h4wnno7e +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms jpw:0.1000 +step:1/500 train_loss:6.9303 train_time:3121ms step_avg:3121.37ms +step:2/500 train_loss:8.2964 train_time:6284ms step_avg:3141.84ms +step:3/500 train_loss:7.7322 train_time:9457ms step_avg:3152.21ms +step:4/500 train_loss:8.5580 train_time:12628ms step_avg:3156.95ms +step:5/500 train_loss:8.4686 train_time:15797ms step_avg:3159.39ms +step:6/500 train_loss:7.7993 train_time:18963ms step_avg:3160.57ms +step:7/500 train_loss:7.2392 train_time:22130ms step_avg:3161.45ms +step:8/500 train_loss:7.0090 train_time:25296ms step_avg:3162.02ms +step:9/500 train_loss:6.5969 train_time:28465ms step_avg:3162.73ms +step:10/500 train_loss:6.4712 train_time:31634ms step_avg:3163.42ms +step:20/500 train_loss:5.4373 train_time:63316ms step_avg:3165.82ms +step:30/500 train_loss:4.7425 train_time:95013ms step_avg:3167.10ms +step:40/500 train_loss:4.5571 train_time:126854ms step_avg:3171.35ms +step:50/500 train_loss:4.2765 train_time:158552ms step_avg:3171.04ms +step:50/500 val_loss:4.2310 val_bpb:2.5059 train_time:158584ms step_avg:3171.67ms h_norms=['128020.9', '142329.3', '150817.1', '164904.4', '172878.4', '221093.7', '304101.2', '290425.7', '279737.3', '288668.9', '254141.1', '302318.5', '342082.8', '372346.5', '408459.7', '357174.3', '431074.6', '494662.8', '539688.3', '601738.6'] growth=['1.137', '1.112', '1.060', '1.093', '1.048', '1.279', '1.375', '0.955', '0.963', '1.032', '1.207', '1.190', '1.132', '1.088', '1.097', '1.197', '1.207', '1.148', '1.091', '1.115'] jpw:0.1000 +step:60/500 train_loss:4.0440 train_time:190235ms step_avg:3170.59ms +step:70/500 train_loss:3.9054 train_time:221921ms step_avg:3170.31ms +step:80/500 train_loss:3.8129 train_time:253616ms step_avg:3170.20ms +step:90/500 train_loss:3.6730 train_time:285312ms step_avg:3170.13ms +step:100/500 train_loss:3.6064 train_time:317006ms step_avg:3170.06ms +step:100/500 val_loss:3.5712 val_bpb:2.1151 train_time:317038ms step_avg:3170.38ms h_norms=['217506.4', '238701.1', '250839.5', '253256.4', '255121.4', '309055.8', '367379.8', '398821.4', '411101.1', '432634.1', '369051.3', '439129.6', '489535.2', '523037.9', '563631.0', '484894.3', '584870.6', '666437.0', '722848.2', '791578.9'] growth=['1.119', '1.097', '1.051', '1.010', '1.007', '1.211', '1.189', '1.086', '1.031', '1.052', '1.215', '1.190', '1.115', '1.068', '1.078', '1.215', '1.206', '1.139', '1.085', '1.095'] jpw:0.1000 +step:110/500 train_loss:3.5100 train_time:348728ms step_avg:3170.25ms +step:120/500 train_loss:3.4055 train_time:380452ms step_avg:3170.43ms +step:130/500 train_loss:3.3291 train_time:412209ms step_avg:3170.84ms +step:140/500 train_loss:3.2646 train_time:443910ms step_avg:3170.78ms +step:150/500 train_loss:3.1985 train_time:475595ms step_avg:3170.63ms +step:150/500 val_loss:3.1731 val_bpb:1.8793 train_time:475626ms step_avg:3170.84ms h_norms=['202425.6', '219581.3', '229199.7', '231251.1', '232042.3', '280260.3', '331272.2', '359679.4', '373077.9', '392866.2', '327880.4', '388136.0', '431546.2', '460585.2', '494839.2', '422156.4', '505718.2', '574404.4', '622526.9', '679339.4'] growth=['1.104', '1.085', '1.044', '1.009', '1.003', '1.208', '1.182', '1.086', '1.037', '1.053', '1.208', '1.184', '1.112', '1.067', '1.074', '1.212', '1.198', '1.136', '1.084', '1.091'] jpw:0.1000 +step:160/500 train_loss:3.1467 train_time:507288ms step_avg:3170.55ms +step:170/500 train_loss:3.0741 train_time:538985ms step_avg:3170.50ms +step:180/500 train_loss:2.9611 train_time:570664ms step_avg:3170.36ms +step:190/500 train_loss:2.9609 train_time:602355ms step_avg:3170.29ms +step:200/500 train_loss:2.8727 train_time:634178ms step_avg:3170.89ms +step:200/500 val_loss:2.9169 val_bpb:1.7275 train_time:634210ms step_avg:3171.05ms h_norms=['195918.4', '211776.8', '222459.3', '225583.9', '227291.6', '278823.8', '321587.3', '350774.8', '365721.4', '384108.5', '319066.1', '375610.7', '414454.6', '441410.4', '472513.9', '402015.0', '479212.5', '542916.8', '583835.1', '633030.1'] growth=['1.112', '1.081', '1.050', '1.014', '1.008', '1.227', '1.153', '1.091', '1.043', '1.050', '1.204', '1.177', '1.103', '1.065', '1.070', '1.204', '1.192', '1.133', '1.075', '1.084'] jpw:0.1000 +step:210/500 train_loss:2.8522 train_time:665890ms step_avg:3170.90ms +step:220/500 train_loss:2.9033 train_time:697577ms step_avg:3170.81ms +step:230/500 train_loss:2.8202 train_time:729281ms step_avg:3170.79ms +step:240/500 train_loss:2.8185 train_time:760982ms step_avg:3170.76ms +step:250/500 train_loss:2.8500 train_time:792694ms step_avg:3170.78ms +step:250/500 val_loss:2.7950 val_bpb:1.6554 train_time:792726ms step_avg:3170.90ms h_norms=['187987.2', '202979.7', '214368.9', '220031.3', '223903.4', '284093.7', '324382.2', '353471.8', '371760.4', '389398.9', '318921.8', '370811.2', '408036.2', '436269.2', '464205.0', '395210.2', '465609.5', '525084.6', '560371.8', '601627.7'] growth=['1.132', '1.080', '1.056', '1.026', '1.018', '1.269', '1.142', '1.090', '1.052', '1.047', '1.203', '1.163', '1.100', '1.069', '1.064', '1.198', '1.178', '1.128', '1.067', '1.074'] jpw:0.1000 +step:260/500 train_loss:2.7996 train_time:824394ms step_avg:3170.75ms +step:270/500 train_loss:2.7542 train_time:856111ms step_avg:3170.78ms +step:280/500 train_loss:2.6927 train_time:887831ms step_avg:3170.82ms +step:290/500 train_loss:2.7306 train_time:919538ms step_avg:3170.82ms diff --git a/lora_test_r8_stdout.log b/lora_test_r8_stdout.log new file mode 100644 index 0000000000..206206c684 --- /dev/null +++ b/lora_test_r8_stdout.log @@ -0,0 +1,4 @@ +START LoRA test: 4-pass r8, 500 steps (Thu Mar 26 20:08:22 UTC 2026) +Terminated +run_lora_test_r8.sh: line 70: syntax error near unexpected token `then' +run_lora_test_r8.sh: line 70: `EXIT -ne 0 ]; then' diff --git a/lora_test_stdout.log b/lora_test_stdout.log new file mode 100644 index 0000000000..e564cf1bc3 --- /dev/null +++ b/lora_test_stdout.log @@ -0,0 +1,10 @@ +START LoRA test: 4-pass r8, 500 steps (Thu Mar 26 15:27:19 UTC 2026) +Terminated + +=== FINAL RESULTS === +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms h_norms=['3412002.5', '3162921.2', '9173489.0', '7602557.5', '6195891.5', '5488329.5', '11289328.0', '9569341.0', '7780140.0', '6543904.0', '3864158.2', '3367786.2', '3060453.0', '2578833.0', '2199645.5', '3025920.8', '2612817.8', '2319063.8', '1985769.4', '1676057.1'] growth=['40.638', '0.927', '2.900', '0.829', '0.815', '0.886', '2.057', '0.848', '0.813', '0.841', '0.926', '0.872', '0.909', '0.843', '0.853', '0.893', '0.863', '0.888', '0.856', '0.844'] +step:50/500 val_loss:5.6210 val_bpb:3.3291 train_time:156832ms step_avg:3136.63ms h_norms=['1483117.2', '2001691.6', '5509635.5', '3288294.0', '2112858.5', '9030652.0', '23049164.0', '14409614.0', '9373722.0', '5569532.0', '1012941.9', '764749.4', '575192.3', '488720.7', '685087.8', '1171268.4', '1059284.2', '703405.6', '689976.9', '514776.4'] growth=['1.514', '1.350', '2.752', '0.597', '0.643', '4.274', '2.552', '0.625', '0.651', '0.594', '0.774', '0.755', '0.752', '0.850', '1.402', '1.114', '0.904', '0.664', '0.981', '0.746'] +step:100/500 val_loss:4.2203 val_bpb:2.4995 train_time:313716ms step_avg:3137.16ms h_norms=['2539958.8', '1891926.2', '1842518.8', '1259791.9', '834439.0', '2452532.2', '7493377.5', '4728913.5', '3378154.8', '1967445.1', '10342628.0', '6845392.5', '4585671.5', '3120444.5', '2068628.8', '1138807.9', '1212089.2', '1098464.0', '943749.6', '733953.3'] growth=['3.252', '0.745', '0.974', '0.684', '0.662', '2.939', '3.055', '0.631', '0.714', '0.582', '12.042', '0.662', '0.670', '0.680', '0.663', '1.142', '1.064', '0.906', '0.859', '0.778'] +step:150/500 val_loss:3.6143 val_bpb:2.1406 train_time:470698ms step_avg:3137.98ms h_norms=['1691694.6', '1214504.1', '1032287.4', '726402.4', '507756.3', '1093338.5', '3688532.0', '2256534.5', '1410329.9', '885931.9', '2614546.0', '1920724.0', '1389947.5', '1044625.5', '756565.3', '642715.2', '699743.8', '648021.7', '565602.9', '444204.0'] growth=['3.923', '0.718', '0.850', '0.704', '0.699', '2.153', '3.374', '0.612', '0.625', '0.628', '5.054', '0.735', '0.724', '0.752', '0.724', '1.191', '1.089', '0.926', '0.873', '0.785'] +FINISHED (Thu Mar 26 15:44:36 UTC 2026) diff --git a/lora_test_v2_stdout.log b/lora_test_v2_stdout.log new file mode 100644 index 0000000000..4027fa7a9c --- /dev/null +++ b/lora_test_v2_stdout.log @@ -0,0 +1,9 @@ +START LoRA test: 4-pass r8, 500 steps (Thu Mar 26 15:44:51 UTC 2026) + +=== FINAL RESULTS === +DIAGNOSTIC post_ema val_loss:2.6566 val_bpb:1.5734 eval_time:85304ms +final_int6_roundtrip val_loss:2.7108 val_bpb:1.6055 eval_time:84596ms +final_int6_roundtrip_exact val_loss:2.71075277 val_bpb:1.60546048 +wandb: val_bpb █▃▂▂▁▁▁▁▁▁▁ +wandb: val_bpb 1.36844 +FINISHED (Thu Mar 26 16:31:26 UTC 2026) diff --git a/records/full_baseline(save).log b/records/full_baseline(save).log new file mode 100644 index 0000000000..e5f8897ff6 --- /dev/null +++ b/records/full_baseline(save).log @@ -0,0 +1,723 @@ +START full run: 4-pass baseline (no LoRA) TTT SWA, 80min (Thu Mar 26 23:01:24 UTC 2026) +logs/f98390b0-4cf3-4710-b1f1-a48bf3145a00.txt +logs/5733dda4-95d3-4001-9992-a963924a436b.txt +logs/6326f57e-9e2a-4921-b0d9-138db532ff5b.txt +logs/03b40738-df0a-452f-9ffb-6cfcc1cfc8e8.txt +logs/a7ef90b3-b0b5-4e27-87c5-52fc39836bed.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +model_params:26927200 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run qltwebo4 +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4 +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run full_4pass_baseline_80min +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/qltwebo4 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +wandb:initialized +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +model_params:26927200 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000 +seed:1337 +model_params:26927200 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3 +model_params:26927200 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000 +seed:1337 +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +model_params:26927200 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run fsi4c82a +wandb: setting up run zcabiozu +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run full_4pass_baseline_80min +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/zcabiozu +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run full_4pass_baseline_80min +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/fsi4c82a +wandb:initialized +wandb:initialized +wandb: setting up run 43bipylb +wandb: setting up run jkh80zal +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run full_4pass_baseline_80min +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/jkh80zal +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run full_4pass_baseline_80min +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/43bipylb +wandb:initialized +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward + def forward(self, input_ids: Tensor, target_ids: Tensor, + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward + return compiled_fn(full_args) + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper + all_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g + return f(*args) + ^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply + return super().apply(*args, **kwargs) # type: ignore[misc] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward + fw_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper + return compiled_fn(runtime_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn + outs = compiled_fn(args) + ^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 9077, in call + buf776 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 24.69 MiB is free. Process 448538 has 754.00 MiB memory in use. Process 448539 has 756.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Including non-PyTorch memory, this process has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 38.01 GiB is allocated by PyTorch, and 15.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward + def forward(self, input_ids: Tensor, target_ids: Tensor, + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward + return compiled_fn(full_args) + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper + all_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g + return f(*args) + ^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply + return super().apply(*args, **kwargs) # type: ignore[misc] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward + fw_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper + return compiled_fn(runtime_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn + outs = compiled_fn(args) + ^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 7110, in call + buf5 = empty_strided_cuda((98304, 512), (512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 20.69 MiB is free. Including non-PyTorch memory, this process has 758.00 MiB memory in use. Process 448539 has 756.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 127.50 MiB is allocated by PyTorch, and 18.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward + def forward(self, input_ids: Tensor, target_ids: Tensor, + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward + return compiled_fn(full_args) + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper + all_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g + return f(*args) + ^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply + return super().apply(*args, **kwargs) # type: ignore[misc] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward + fw_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper + return compiled_fn(runtime_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn + outs = compiled_fn(args) + ^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 7110, in call + buf5 = empty_strided_cuda((98304, 512), (512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 18.69 MiB is free. Process 448538 has 758.00 MiB memory in use. Including non-PyTorch memory, this process has 758.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 127.50 MiB is allocated by PyTorch, and 18.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main + (warmup_loss * grad_scale).backward() + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward + torch.autograd.backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward + _engine_run_backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward + return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply + return user_fn(self, *args) + ^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward + return impl_fn() + ^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn + out = CompiledFunction._backward_impl(ctx, all_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl + out = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call + buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 18.69 MiB is free. Process 448538 has 758.00 MiB memory in use. Process 448539 has 758.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/jkh80zal +wandb: Find logs at: wandb/run-20260326_230158-jkh80zal/logs +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/fsi4c82a +wandb: Find logs at: wandb/run-20260326_230158-fsi4c82a/logs +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/43bipylb +wandb: Find logs at: wandb/run-20260326_230158-43bipylb/logs +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/zcabiozu +wandb: Find logs at: wandb/run-20260326_230158-zcabiozu/logs +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9310 grad_norm:0.3717 train_time:1291ms step_avg:1291.09ms +step:2/20000 train_loss:8.3536 grad_norm:3.5393 train_time:2598ms step_avg:1298.81ms +step:3/20000 train_loss:7.5089 grad_norm:1.8069 train_time:3954ms step_avg:1318.13ms +step:4/20000 train_loss:7.5822 grad_norm:1.8725 train_time:5317ms step_avg:1329.29ms +step:5/20000 train_loss:7.3524 grad_norm:1.8843 train_time:6673ms step_avg:1334.64ms +step:6/20000 train_loss:7.0868 grad_norm:1.7131 train_time:8028ms step_avg:1338.00ms +step:7/20000 train_loss:6.9401 grad_norm:2.0897 train_time:9384ms step_avg:1340.63ms +step:8/20000 train_loss:6.8952 grad_norm:1.4534 train_time:10745ms step_avg:1343.15ms +step:9/20000 train_loss:6.5431 grad_norm:1.0222 train_time:12102ms step_avg:1344.70ms +step:10/20000 train_loss:6.1427 grad_norm:0.9715 train_time:13466ms step_avg:1346.55ms +step:50/20000 train_loss:3.6903 grad_norm:0.9422 train_time:68054ms step_avg:1361.07ms +step:100/20000 train_loss:3.1184 grad_norm:0.5410 train_time:136293ms step_avg:1362.93ms +step:150/20000 train_loss:2.7752 grad_norm:0.3613 train_time:205070ms step_avg:1367.13ms +step:200/20000 train_loss:2.5614 grad_norm:0.2693 train_time:273305ms step_avg:1366.53ms +step:250/20000 train_loss:2.5709 grad_norm:0.2522 train_time:341556ms step_avg:1366.22ms +step:300/20000 train_loss:2.4364 grad_norm:0.2295 train_time:409825ms step_avg:1366.08ms +step:350/20000 train_loss:2.4859 grad_norm:0.2104 train_time:478072ms step_avg:1365.92ms +step:400/20000 train_loss:2.3988 grad_norm:0.1555 train_time:546341ms step_avg:1365.85ms +step:450/20000 train_loss:2.2317 grad_norm:0.1958 train_time:614614ms step_avg:1365.81ms +step:500/20000 train_loss:2.2898 grad_norm:0.1775 train_time:682900ms step_avg:1365.80ms +step:500/20000 val_loss:2.3130 val_bpb:1.3699 train_time:682945ms step_avg:1365.89ms +step:550/20000 train_loss:2.3492 grad_norm:0.1559 train_time:751209ms step_avg:1365.83ms +step:600/20000 train_loss:2.2513 grad_norm:0.1438 train_time:819544ms step_avg:1365.91ms +step:650/20000 train_loss:2.2323 grad_norm:0.1536 train_time:888368ms step_avg:1366.72ms +step:700/20000 train_loss:2.3026 grad_norm:0.1020 train_time:956783ms step_avg:1366.83ms +step:750/20000 train_loss:2.2750 grad_norm:0.1105 train_time:1025183ms step_avg:1366.91ms +step:800/20000 train_loss:2.2546 grad_norm:0.1031 train_time:1093599ms step_avg:1367.00ms +step:850/20000 train_loss:2.1799 grad_norm:0.0737 train_time:1162084ms step_avg:1367.16ms +step:900/20000 train_loss:2.0960 grad_norm:0.0817 train_time:1230597ms step_avg:1367.33ms +step:950/20000 train_loss:2.2968 grad_norm:0.0953 train_time:1299094ms step_avg:1367.47ms +step:1000/20000 train_loss:2.2247 grad_norm:0.0713 train_time:1367589ms step_avg:1367.59ms +step:1000/20000 val_loss:2.1722 val_bpb:1.2865 train_time:1367633ms step_avg:1367.63ms +step:1050/20000 train_loss:2.1500 grad_norm:0.1469 train_time:1436112ms step_avg:1367.73ms +step:1100/20000 train_loss:2.1744 grad_norm:0.0794 train_time:1504991ms step_avg:1368.17ms +step:1150/20000 train_loss:2.1290 grad_norm:0.0672 train_time:1573762ms step_avg:1368.49ms +step:1200/20000 train_loss:2.1756 grad_norm:0.0636 train_time:1642514ms step_avg:1368.76ms +step:1250/20000 train_loss:2.1991 grad_norm:0.0599 train_time:1711283ms step_avg:1369.03ms +step:1300/20000 train_loss:2.1695 grad_norm:0.1132 train_time:1780070ms step_avg:1369.28ms +step:1350/20000 train_loss:2.1436 grad_norm:0.1200 train_time:1848866ms step_avg:1369.53ms +step:1400/20000 train_loss:2.1553 grad_norm:0.0700 train_time:1917654ms step_avg:1369.75ms +step:1450/20000 train_loss:2.1501 grad_norm:0.0631 train_time:1986442ms step_avg:1369.96ms +step:1500/20000 train_loss:2.1193 grad_norm:0.0733 train_time:2055220ms step_avg:1370.15ms +step:1500/20000 val_loss:2.1071 val_bpb:1.2479 train_time:2055264ms step_avg:1370.18ms +step:1550/20000 train_loss:2.0928 grad_norm:0.0758 train_time:2124013ms step_avg:1370.33ms +step:1600/20000 train_loss:2.1722 grad_norm:0.0814 train_time:2193129ms step_avg:1370.71ms +step:1650/20000 train_loss:1.9557 grad_norm:0.0655 train_time:2261915ms step_avg:1370.86ms +step:1700/20000 train_loss:2.0848 grad_norm:0.0634 train_time:2330710ms step_avg:1371.01ms +step:1750/20000 train_loss:2.0562 grad_norm:0.0759 train_time:2399493ms step_avg:1371.14ms +step:1800/20000 train_loss:2.0964 grad_norm:0.0645 train_time:2468259ms step_avg:1371.26ms +step:1850/20000 train_loss:2.1107 grad_norm:0.0831 train_time:2537046ms step_avg:1371.38ms +step:1900/20000 train_loss:2.0580 grad_norm:0.0648 train_time:2605824ms step_avg:1371.49ms +step:1950/20000 train_loss:2.0431 grad_norm:0.0981 train_time:2674651ms step_avg:1371.62ms +step:2000/20000 train_loss:2.2944 grad_norm:0.0838 train_time:2743419ms step_avg:1371.71ms +step:2000/20000 val_loss:2.0763 val_bpb:1.2297 train_time:2743463ms step_avg:1371.73ms +step:2050/20000 train_loss:2.0607 grad_norm:0.1013 train_time:2812501ms step_avg:1371.95ms +step:2100/20000 train_loss:2.0358 grad_norm:0.0558 train_time:2881257ms step_avg:1372.03ms +step:2150/20000 train_loss:2.0142 grad_norm:0.0526 train_time:2950035ms step_avg:1372.11ms +step:2200/20000 train_loss:2.1668 grad_norm:0.0614 train_time:3018808ms step_avg:1372.19ms +step:2250/20000 train_loss:2.0604 grad_norm:0.0644 train_time:3087562ms step_avg:1372.25ms +step:2300/20000 train_loss:2.0377 grad_norm:0.1123 train_time:3156291ms step_avg:1372.30ms +step:2350/20000 train_loss:1.9923 grad_norm:0.0511 train_time:3225042ms step_avg:1372.36ms +step:2400/20000 train_loss:2.1062 grad_norm:0.0682 train_time:3293804ms step_avg:1372.42ms +step:2450/20000 train_loss:2.0650 grad_norm:0.0639 train_time:3362565ms step_avg:1372.48ms +step:2500/20000 train_loss:2.0208 grad_norm:0.0580 train_time:3431320ms step_avg:1372.53ms +step:2500/20000 val_loss:2.0279 val_bpb:1.2010 train_time:3431364ms step_avg:1372.55ms +step:2550/20000 train_loss:2.0211 grad_norm:0.0558 train_time:3500393ms step_avg:1372.70ms +step:2600/20000 train_loss:2.0001 grad_norm:0.0479 train_time:3569165ms step_avg:1372.76ms +step:2650/20000 train_loss:2.0040 grad_norm:0.0582 train_time:3637929ms step_avg:1372.80ms +step:2700/20000 train_loss:2.0265 grad_norm:0.0542 train_time:3706703ms step_avg:1372.85ms +step:2750/20000 train_loss:2.0077 grad_norm:0.0457 train_time:3775459ms step_avg:1372.89ms +step:2800/20000 train_loss:2.0415 grad_norm:0.0569 train_time:3844241ms step_avg:1372.94ms +step:2850/20000 train_loss:1.9900 grad_norm:0.0487 train_time:3913011ms step_avg:1372.99ms +step:2900/20000 train_loss:2.0045 grad_norm:0.0438 train_time:3981769ms step_avg:1373.02ms +step:2950/20000 train_loss:2.0440 grad_norm:0.0447 train_time:4050513ms step_avg:1373.06ms +step:3000/20000 train_loss:1.9316 grad_norm:0.0567 train_time:4119545ms step_avg:1373.18ms +step:3000/20000 val_loss:1.9838 val_bpb:1.1749 train_time:4119590ms step_avg:1373.20ms +step:3050/20000 train_loss:1.9372 grad_norm:0.0506 train_time:4188300ms step_avg:1373.21ms +step:3100/20000 train_loss:1.9990 grad_norm:0.0465 train_time:4257075ms step_avg:1373.25ms +step:3150/20000 train_loss:2.0077 grad_norm:0.0401 train_time:4325837ms step_avg:1373.28ms +swa:start step:3200 +step:3200/20000 train_loss:1.9812 grad_norm:0.0445 train_time:4394566ms step_avg:1373.30ms +late_qat:enabled step:3241 scale:0.1495 core_quant:on +step:3250/20000 train_loss:1.9531 grad_norm:0.0567 train_time:4519079ms step_avg:1390.49ms +step:3300/20000 train_loss:1.9296 grad_norm:0.0386 train_time:4587540ms step_avg:1390.16ms +step:3350/20000 train_loss:1.9653 grad_norm:0.0394 train_time:4655858ms step_avg:1389.81ms +step:3400/20000 train_loss:2.0099 grad_norm:0.0483 train_time:4724204ms step_avg:1389.47ms +step:3450/20000 train_loss:1.9637 grad_norm:0.0369 train_time:4792535ms step_avg:1389.14ms +step:3456/20000 val_loss:1.9505 val_bpb:1.1552 train_time:4800814ms step_avg:1389.12ms +stopping_early: wallclock_cap train_time:4800814ms step:3456/20000 +peak memory allocated: 50545 MiB reserved: 50594 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9472 val_bpb:1.1532 eval_time:32839ms +Serialized model: 106023671 bytes +Code size: 102633 bytes +Serialized model int6+lzma: 16373548 bytes +Total submission size int6+lzma: 16476181 bytes +final_int6_roundtrip val_loss:1.9574 val_bpb:1.1593 eval_time:39862ms +final_int6_roundtrip_exact val_loss:1.95735252 val_bpb:1.15925441 +final_int6_sliding_window val_loss:1.9164 val_bpb:1.1350 stride:64 eval_time:1105486ms +final_int6_sliding_window_exact val_loss:1.91642779 val_bpb:1.13501949 +final_int8_zlib_roundtrip_exact val_loss:1.91642779 val_bpb:1.13501949 +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=2 +ttt_sliding:params unfrozen=26923088 frozen=4112 + ttt_chunk [1/1893] bpb=1.226275 time=1.9s + ttt_chunk [11/1893] bpb=1.128206 time=20.5s + ttt_chunk [21/1893] bpb=1.137378 time=39.0s + ttt_chunk [31/1893] bpb=1.142175 time=57.6s + ttt_chunk [41/1893] bpb=1.138228 time=76.1s + ttt_chunk [51/1893] bpb=1.139877 time=94.6s + ttt_chunk [61/1893] bpb=1.143695 time=113.2s + ttt_chunk [71/1893] bpb=1.141806 time=131.7s + ttt_chunk [81/1893] bpb=1.138175 time=150.2s + ttt_chunk [91/1893] bpb=1.137107 time=168.8s + ttt_chunk [101/1893] bpb=1.138115 time=187.3s + ttt_chunk [111/1893] bpb=1.138295 time=205.9s + ttt_chunk [121/1893] bpb=1.134671 time=224.4s + ttt_chunk [131/1893] bpb=1.133939 time=242.9s + ttt_chunk [141/1893] bpb=1.132766 time=261.5s + ttt_chunk [151/1893] bpb=1.132980 time=280.0s + ttt_chunk [161/1893] bpb=1.133800 time=298.6s + ttt_chunk [171/1893] bpb=1.135874 time=317.1s + ttt_chunk [181/1893] bpb=1.135884 time=335.6s + ttt_chunk [191/1893] bpb=1.138340 time=354.2s + ttt_chunk [201/1893] bpb=1.137866 time=372.7s + ttt_chunk [211/1893] bpb=1.136957 time=391.2s + ttt_chunk [221/1893] bpb=1.137842 time=409.8s + ttt_chunk [231/1893] bpb=1.137565 time=428.3s + ttt_chunk [241/1893] bpb=1.137849 time=446.8s + ttt_chunk [251/1893] bpb=1.137360 time=465.4s + ttt_chunk [261/1893] bpb=1.136692 time=483.9s + ttt_chunk [271/1893] bpb=1.135780 time=502.5s + ttt_chunk [281/1893] bpb=1.137389 time=521.0s + ttt_chunk [291/1893] bpb=1.137018 time=539.5s + ttt_chunk [301/1893] bpb=1.137918 time=558.1s + ttt_chunk [311/1893] bpb=1.138001 time=576.6s + ttt_chunk [321/1893] bpb=1.138708 time=595.1s + ttt_chunk [331/1893] bpb=1.138179 time=613.7s + ttt_chunk [341/1893] bpb=1.137832 time=632.2s + ttt_chunk [351/1893] bpb=1.138543 time=650.8s + ttt_chunk [361/1893] bpb=1.139301 time=669.3s + ttt_chunk [371/1893] bpb=1.139185 time=687.8s + ttt_chunk [381/1893] bpb=1.138924 time=706.4s + ttt_chunk [391/1893] bpb=1.139607 time=724.9s + ttt_chunk [401/1893] bpb=1.139172 time=743.4s + ttt_chunk [411/1893] bpb=1.138218 time=762.0s + ttt_chunk [421/1893] bpb=1.138334 time=780.5s + ttt_chunk [431/1893] bpb=1.138777 time=799.1s + ttt_chunk [441/1893] bpb=1.138161 time=817.6s + ttt_chunk [451/1893] bpb=1.138301 time=836.1s + ttt_chunk [461/1893] bpb=1.138190 time=854.7s + ttt_chunk [471/1893] bpb=1.137746 time=873.2s + ttt_chunk [481/1893] bpb=1.137597 time=891.8s + ttt_chunk [491/1893] bpb=1.137722 time=910.3s + ttt_chunk [501/1893] bpb=1.137492 time=928.8s + ttt_chunk [511/1893] bpb=1.137017 time=947.4s + ttt_chunk [521/1893] bpb=1.136714 time=965.9s + ttt_chunk [531/1893] bpb=1.137443 time=984.5s + ttt_chunk [541/1893] bpb=1.137557 time=1003.0s + ttt_chunk [551/1893] bpb=1.137019 time=1021.5s + ttt_chunk [561/1893] bpb=1.136885 time=1040.1s + ttt_chunk [571/1893] bpb=1.136621 time=1058.6s + ttt_chunk [581/1893] bpb=1.136257 time=1077.2s + ttt_chunk [591/1893] bpb=1.135719 time=1095.7s + ttt_chunk [601/1893] bpb=1.135711 time=1114.2s + ttt_chunk [611/1893] bpb=1.135386 time=1132.8s + ttt_chunk [621/1893] bpb=1.135235 time=1151.3s + ttt_chunk [631/1893] bpb=1.134973 time=1169.9s + ttt_chunk [641/1893] bpb=1.134519 time=1188.4s + ttt_chunk [651/1893] bpb=1.134057 time=1206.9s + ttt_chunk [661/1893] bpb=1.133947 time=1225.5s + ttt_chunk [671/1893] bpb=1.133482 time=1244.0s + ttt_chunk [681/1893] bpb=1.132918 time=1262.6s + ttt_chunk [691/1893] bpb=1.132994 time=1281.1s + ttt_chunk [701/1893] bpb=1.132163 time=1299.6s + ttt_chunk [711/1893] bpb=1.132176 time=1318.2s + ttt_chunk [721/1893] bpb=1.132090 time=1336.7s + ttt_chunk [731/1893] bpb=1.132331 time=1355.2s + ttt_chunk [741/1893] bpb=1.132205 time=1373.8s + ttt_chunk [751/1893] bpb=1.131884 time=1392.3s + ttt_chunk [761/1893] bpb=1.132028 time=1410.8s + ttt_chunk [771/1893] bpb=1.131860 time=1429.4s + ttt_chunk [781/1893] bpb=1.132024 time=1447.9s + ttt_chunk [791/1893] bpb=1.131869 time=1466.4s + ttt_chunk [801/1893] bpb=1.131804 time=1485.0s + ttt_chunk [811/1893] bpb=1.131817 time=1503.5s + ttt_chunk [821/1893] bpb=1.131702 time=1522.1s + ttt_chunk [831/1893] bpb=1.131418 time=1540.6s + ttt_chunk [841/1893] bpb=1.131180 time=1559.1s + ttt_chunk [851/1893] bpb=1.131241 time=1577.7s + ttt_chunk [861/1893] bpb=1.131312 time=1596.2s + ttt_chunk [871/1893] bpb=1.131521 time=1614.7s + ttt_chunk [881/1893] bpb=1.131519 time=1633.3s + ttt_chunk [891/1893] bpb=1.130978 time=1651.8s + ttt_chunk [901/1893] bpb=1.130995 time=1670.3s + ttt_chunk [911/1893] bpb=1.130849 time=1688.9s + ttt_chunk [921/1893] bpb=1.130984 time=1707.4s + ttt_chunk [931/1893] bpb=1.130928 time=1726.0s + ttt_chunk [941/1893] bpb=1.131129 time=1744.5s + ttt_chunk [951/1893] bpb=1.131431 time=1763.0s + ttt_chunk [961/1893] bpb=1.131741 time=1781.6s + ttt_chunk [971/1893] bpb=1.132107 time=1800.1s + ttt_chunk [981/1893] bpb=1.132319 time=1818.6s + ttt_chunk [991/1893] bpb=1.132236 time=1837.2s + ttt_chunk [1001/1893] bpb=1.132567 time=1855.7s + ttt_chunk [1011/1893] bpb=1.132723 time=1874.3s + ttt_chunk [1021/1893] bpb=1.133011 time=1892.8s + ttt_chunk [1031/1893] bpb=1.133400 time=1911.3s + ttt_chunk [1041/1893] bpb=1.133897 time=1929.9s + ttt_chunk [1051/1893] bpb=1.133756 time=1948.4s + ttt_chunk [1061/1893] bpb=1.133865 time=1967.0s + ttt_chunk [1071/1893] bpb=1.134029 time=1985.5s + ttt_chunk [1081/1893] bpb=1.134076 time=2004.1s + ttt_chunk [1091/1893] bpb=1.134326 time=2022.7s + ttt_chunk [1101/1893] bpb=1.134469 time=2041.2s + ttt_chunk [1111/1893] bpb=1.134274 time=2059.8s + ttt_chunk [1121/1893] bpb=1.134049 time=2078.3s + ttt_chunk [1131/1893] bpb=1.133943 time=2096.9s + ttt_chunk [1141/1893] bpb=1.133705 time=2115.4s + ttt_chunk [1151/1893] bpb=1.133733 time=2134.0s + ttt_chunk [1161/1893] bpb=1.133569 time=2152.5s + ttt_chunk [1171/1893] bpb=1.133389 time=2171.1s + ttt_chunk [1181/1893] bpb=1.133164 time=2189.6s + ttt_chunk [1191/1893] bpb=1.133317 time=2208.2s + ttt_chunk [1201/1893] bpb=1.133519 time=2226.8s + ttt_chunk [1211/1893] bpb=1.133117 time=2245.3s + ttt_chunk [1221/1893] bpb=1.133455 time=2263.9s + ttt_chunk [1231/1893] bpb=1.133394 time=2282.4s + ttt_chunk [1241/1893] bpb=1.133104 time=2300.9s + ttt_chunk [1251/1893] bpb=1.132567 time=2319.5s + ttt_chunk [1261/1893] bpb=1.132300 time=2338.0s + ttt_chunk [1271/1893] bpb=1.132047 time=2356.6s + ttt_chunk [1281/1893] bpb=1.131738 time=2375.1s + ttt_chunk [1291/1893] bpb=1.131494 time=2393.7s + ttt_chunk [1301/1893] bpb=1.131443 time=2412.2s + ttt_chunk [1311/1893] bpb=1.131173 time=2430.7s + ttt_chunk [1321/1893] bpb=1.130872 time=2449.3s + ttt_chunk [1331/1893] bpb=1.130632 time=2467.8s + ttt_chunk [1341/1893] bpb=1.130505 time=2486.4s + ttt_chunk [1351/1893] bpb=1.130352 time=2504.9s + ttt_chunk [1361/1893] bpb=1.130484 time=2523.5s + ttt_chunk [1371/1893] bpb=1.130705 time=2542.0s + ttt_chunk [1381/1893] bpb=1.130910 time=2560.5s + ttt_chunk [1391/1893] bpb=1.130695 time=2579.1s + ttt_chunk [1401/1893] bpb=1.130724 time=2597.6s + ttt_chunk [1411/1893] bpb=1.130831 time=2616.2s + ttt_chunk [1421/1893] bpb=1.130815 time=2634.7s + ttt_chunk [1431/1893] bpb=1.130791 time=2653.3s + ttt_chunk [1441/1893] bpb=1.131256 time=2671.8s + ttt_chunk [1451/1893] bpb=1.131119 time=2691.1s + ttt_chunk [1461/1893] bpb=1.131048 time=2709.6s + ttt_chunk [1471/1893] bpb=1.131643 time=2728.2s + ttt_chunk [1481/1893] bpb=1.131517 time=2746.7s + ttt_chunk [1491/1893] bpb=1.131890 time=2765.3s + ttt_chunk [1501/1893] bpb=1.131872 time=2783.8s + ttt_chunk [1511/1893] bpb=1.131833 time=2802.3s + ttt_chunk [1521/1893] bpb=1.131945 time=2820.9s + ttt_chunk [1531/1893] bpb=1.132160 time=2839.4s + ttt_chunk [1541/1893] bpb=1.132230 time=2858.0s + ttt_chunk [1551/1893] bpb=1.132470 time=2876.5s + ttt_chunk [1561/1893] bpb=1.132554 time=2895.1s + ttt_chunk [1571/1893] bpb=1.132686 time=2913.6s + ttt_chunk [1581/1893] bpb=1.132836 time=2932.1s + ttt_chunk [1591/1893] bpb=1.132902 time=2950.7s + ttt_chunk [1601/1893] bpb=1.133020 time=2969.2s + ttt_chunk [1611/1893] bpb=1.133281 time=2987.8s + ttt_chunk [1621/1893] bpb=1.133141 time=3006.3s + ttt_chunk [1631/1893] bpb=1.133187 time=3024.8s + ttt_chunk [1641/1893] bpb=1.133212 time=3043.4s + ttt_chunk [1651/1893] bpb=1.133269 time=3061.9s + ttt_chunk [1661/1893] bpb=1.133410 time=3080.5s + ttt_chunk [1671/1893] bpb=1.133595 time=3099.0s + ttt_chunk [1681/1893] bpb=1.133686 time=3117.5s + ttt_chunk [1691/1893] bpb=1.133787 time=3136.1s + ttt_chunk [1701/1893] bpb=1.133884 time=3154.6s + ttt_chunk [1711/1893] bpb=1.133862 time=3173.2s + ttt_chunk [1721/1893] bpb=1.133701 time=3191.7s + ttt_chunk [1731/1893] bpb=1.133797 time=3210.2s + ttt_chunk [1741/1893] bpb=1.133534 time=3228.8s + ttt_chunk [1751/1893] bpb=1.133407 time=3247.3s + ttt_chunk [1761/1893] bpb=1.133444 time=3265.9s + ttt_chunk [1771/1893] bpb=1.133395 time=3284.4s + ttt_chunk [1781/1893] bpb=1.133298 time=3303.0s + ttt_chunk [1791/1893] bpb=1.132959 time=3321.5s + ttt_chunk [1801/1893] bpb=1.132941 time=3340.0s + ttt_chunk [1811/1893] bpb=1.132795 time=3358.6s + ttt_chunk [1821/1893] bpb=1.132853 time=3377.1s + ttt_chunk [1831/1893] bpb=1.132699 time=3395.7s + ttt_chunk [1841/1893] bpb=1.132738 time=3414.2s + ttt_chunk [1851/1893] bpb=1.132559 time=3432.7s + ttt_chunk [1861/1893] bpb=1.132478 time=3451.3s + ttt_chunk [1871/1893] bpb=1.132413 time=3469.8s + ttt_chunk [1881/1893] bpb=1.132170 time=3488.4s + ttt_chunk [1891/1893] bpb=1.132153 time=3506.9s + ttt_chunk [1893/1893] bpb=1.132184 time=3509.9s +ttt_sliding:done val_loss=1.911640 val_bpb=1.132184 elapsed=3510.0s +legal_ttt val_loss:1.9116 val_bpb:1.1322 eval_time:3510399ms +legal_ttt_exact val_loss:1.91163996 val_bpb:1.13218386 +wandb: updating run metadata +wandb: uploading output.log; uploading wandb-summary.json; uploading config.yaml +wandb: uploading data +wandb: +wandb: Run history: +wandb: grad_norm ▂█▅▅▄▃▃▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: lr_scale ██████████████████████▇▇▇▆▆▅▅▄▄▄▃▃▃▂▂▂▂▁ +wandb: step_avg_ms ▁▂▃▄▄▅▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇█ +wandb: train_loss ▆█▇▇▇▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: val_bpb █▂▁▁▁▁▁▁ +wandb: val_loss █▂▁▁▁▁▁▁ +wandb: +wandb: Run summary: +wandb: grad_norm 0.03694 +wandb: lr_scale 0.00374 +wandb: step_avg_ms 1389.14072 +wandb: train_loss 1.96371 +wandb: val_bpb 1.1552 +wandb: val_loss 1.95051 +wandb: +wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/qltwebo4 +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_230156-qltwebo4/logs diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/eval_ttt_passes.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/eval_ttt_passes.sh new file mode 100755 index 0000000000..0f01607e15 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/eval_ttt_passes.sh @@ -0,0 +1,82 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +export PYTHONUNBUFFERED=1 +export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True +export CUDA_VISIBLE_DEVICES=0 +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=9000 +export MAX_WALLCLOCK_SECONDS=600 +export WARMUP_STEPS=20 +export WARMDOWN_ITERS=3500 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=64 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=700 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=0 +export TTT_ENABLED=1 +export CORE_START=3 +export CORE_END=8 +export CORE_QUANT_ENABLED=0 +export CORE_QUANT_BITS=6 +export NUM_PASSES=4 +export LORA_RANK=0 +export WANDB_MODE=disabled + +CKPT="$SCRIPT_DIR/final_model.pt" + +for PASSES in 2 6; do + LOG="/home/nesta/parameter-golf/eval_ttt_${PASSES}pass.log" + echo "=== TTT eval with ${PASSES} passes ($(date)) ===" | tee "$LOG" + + $PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + --lora-rank 0 \ + --eval-only-passes "$PASSES" \ + --eval-only-checkpoint "$CKPT" \ + >> "$LOG" 2>&1 + + EXIT=$? + echo "" + if [ $EXIT -ne 0 ]; then + echo "FAILED ${PASSES}-pass (exit=$EXIT)" + tail -20 "$LOG" + else + echo "=== ${PASSES}-pass RESULTS ===" + grep 'legal_ttt_exact' "$LOG" + fi + echo "" +done + +echo "ALL DONE ($(date))" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/lora-fix-plan.md b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/lora-fix-plan.md new file mode 100644 index 0000000000..c983e4d46b --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/lora-fix-plan.md @@ -0,0 +1,204 @@ +# LoRA Stability Fix Plan + +The LoRA per-pass adapters are causing training instability (40× growth ratios, loss spiking to 28.9). Three root causes, all in `train_gpt_recurrent.py`. Apply all fixes. + +--- + +## Fix 1: Add rsLoRA scaling to the forward pass + +**File:** `train_gpt_recurrent.py` +**Location:** `GPT._forward_hidden`, inside the core loop where LoRA is applied + +The raw `B @ A` product is added to weights with no scaling. At rank 8, the output magnitude is √8 ≈ 2.83× too large. Apply `α/√r` scaling (rsLoRA). + +**Find this block:** +```python +if self.lora_rank > 0: + ci = j - self.core_start + q_w = q_w + self.lora_B_q[k, ci] @ self.lora_A_q[k, ci] + k_w = k_w + self.lora_B_k[k, ci] @ self.lora_A_k[k, ci] + v_w = v_w + self.lora_B_v[k, ci] @ self.lora_A_v[k, ci] + out_w = out_w + self.lora_B_out[k, ci] @ self.lora_A_out[k, ci] + up_w = up_w + self.lora_B_up[k, ci] @ self.lora_A_up[k, ci] + down_w = down_w + self.lora_B_down[k, ci] @ self.lora_A_down[k, ci] +``` + +**Replace with:** +```python +if self.lora_rank > 0: + ci = j - self.core_start + s = self._lora_scale # precomputed 1.0 / sqrt(rank) + q_w = q_w + s * (self.lora_B_q[k, ci] @ self.lora_A_q[k, ci]) + k_w = k_w + s * (self.lora_B_k[k, ci] @ self.lora_A_k[k, ci]) + v_w = v_w + s * (self.lora_B_v[k, ci] @ self.lora_A_v[k, ci]) + out_w = out_w + s * (self.lora_B_out[k, ci] @ self.lora_A_out[k, ci]) + up_w = up_w + s * (self.lora_B_up[k, ci] @ self.lora_A_up[k, ci]) + down_w = down_w + s * (self.lora_B_down[k, ci] @ self.lora_A_down[k, ci]) +``` + +**Also add in `GPT.__init__`**, after the LoRA parameter creation block: +```python +self._lora_scale = 1.0 / math.sqrt(lora_rank) if lora_rank > 0 else 1.0 +``` + +--- + +## Fix 2: Change LoRA initialization from kaiming to zero + +**File:** `train_gpt_recurrent.py` +**Location:** `GPT.__init__`, the LoRA parameter creation block + +Kaiming init on A makes `||A||` ≈ √dim ≈ 22.6. After one gradient step on B, `||BA||` is proportional to this — far too large. Both A and B should start at zero so the LoRA is a no-op at initialization and learns its contribution gradually. + +**Find this block:** +```python +if lora_rank > 0 and self.num_core > 0 and num_passes > 1: + nc, np_, r = self.num_core, num_passes, lora_rank + for wname, in_d, out_d in [ + ("q", model_dim, model_dim), ("out", model_dim, model_dim), + ("k", model_dim, kv_dim), ("v", model_dim, kv_dim), + ("up", model_dim, mlp_dim), ("down", mlp_dim, model_dim), + ]: + A = nn.Parameter(torch.empty(np_, nc, r, in_d)) + B = nn.Parameter(torch.zeros(np_, nc, out_d, r)) + nn.init.kaiming_uniform_(A, a=math.sqrt(5)) + setattr(self, f"lora_A_{wname}", A) + setattr(self, f"lora_B_{wname}", B) +``` + +**Replace with:** +```python +if lora_rank > 0 and self.num_core > 0 and num_passes > 1: + nc, np_, r = self.num_core, num_passes, lora_rank + for wname, in_d, out_d in [ + ("q", model_dim, model_dim), ("out", model_dim, model_dim), + ("k", model_dim, kv_dim), ("v", model_dim, kv_dim), + ("up", model_dim, mlp_dim), ("down", mlp_dim, model_dim), + ]: + A = nn.Parameter(torch.zeros(np_, nc, r, in_d)) + B = nn.Parameter(torch.zeros(np_, nc, out_d, r)) + setattr(self, f"lora_A_{wname}", A) + setattr(self, f"lora_B_{wname}", B) + self._lora_scale = 1.0 / math.sqrt(lora_rank) +``` + +--- + +## Fix 3: Give LoRA params their own optimizer group with lower learning rate + +**File:** `train_gpt_recurrent.py` +**Location:** `main()`, in the optimizer setup section + +Currently LoRA params are added to `extra_scalar_params` and trained at `scalar_lr=0.025`. LoRA matrices are 2D weight matrices, not scalars — they need a separate, lower learning rate. + +**Find this block:** +```python +if base_model.lora_rank > 0: + lora_params = [p for n, p in base_model.named_parameters() if "lora_" in n] + for p in lora_params: + p.data = p.data.float() + extra_scalar_params.extend(lora_params) + log0(f"lora: rank={base_model.lora_rank} params={sum(p.numel() for p in lora_params)}") +``` + +**Replace with:** +```python +if base_model.lora_rank > 0: + lora_params = [p for n, p in base_model.named_parameters() if "lora_" in n] + for p in lora_params: + p.data = p.data.float() + # Do NOT add to extra_scalar_params — LoRA gets its own optimizer + log0(f"lora: rank={base_model.lora_rank} params={sum(p.numel() for p in lora_params)}") +``` + +**Then, after `optimizer_scalar` is created, add a new optimizer:** +```python +optimizer_lora = None +if base_model.lora_rank > 0: + lora_lr = args.scalar_lr * 0.1 # 10× lower than scalar_lr + optimizer_lora = torch.optim.AdamW( + [{"params": lora_params, "lr": lora_lr, "base_lr": lora_lr}], + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + weight_decay=args.adam_wd, + fused=True, + ) + # Add LoRA params to replicated_params for distributed all-reduce + replicated_params.extend(lora_params) + log0(f"lora_optimizer: lr={lora_lr} (scalar_lr * 0.1)") +``` + +**Update the optimizers list** (find where it's defined): +```python +optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar] +if optimizer_head is not None: + optimizers.append(optimizer_head) +if optimizer_lora is not None: + optimizers.append(optimizer_lora) +``` + +--- + +## Fix 4: Reduce default LoRA rank from 8 to 2 + +**File:** `train_gpt_recurrent.py` +**Location:** `Hyperparameters` class and CLI args + +Rank 8 across 6 weight types × 5 core layers × 4 passes = 1.2M params of perturbation surface. Rank 2 gives 307K params — enough for per-pass differentiation, small enough that the Jacobian proxy loss can control it. + +**In `Hyperparameters`:** +```python +lora_rank = int(os.environ.get("LORA_RANK", 0)) # no change needed, default is already 0 +``` + +**In the run script, change `--lora-rank 8` to `--lora-rank 2`:** +```bash +--lora-rank 2 +``` + +--- + +## Fix 5: Increase Jacobian proxy weight for LoRA runs + +With LoRA perturbations, the Jacobian proxy loss needs to work harder to keep things contractive. The per-pass weight deltas create additional expansive directions that the loss must counteract. + +**In the run script:** +```bash +--jacobian-proxy-weight 0.1 +``` + +This was already discussed but confirm it's set to 0.1, not 0.01. + +--- + +## Summary + +| Fix | What | Where | Impact | +|-----|------|-------|--------| +| 1 | rsLoRA scaling `1/√r` | `_forward_hidden` + `__init__` | Reduces LoRA output magnitude by √r | +| 2 | Zero init A and B | `__init__` LoRA creation | LoRA is no-op at init, learns gradually | +| 3 | Separate optimizer at 0.1× LR | `main()` optimizer setup | Prevents LoRA params from overshooting | +| 4 | Rank 8 → rank 2 | Run script CLI arg | 4× less perturbation surface | +| 5 | Jacobian weight 0.01 → 0.1 | Run script CLI arg | Stronger contractivity pressure | + +## Test command after fixes: +```bash +NUM_PASSES=4 \ +CORE_START=3 \ +CORE_END=8 \ +ITERATIONS=500 \ +VAL_LOSS_EVERY=50 \ +TRAIN_LOG_EVERY=10 \ +python train_gpt_recurrent.py \ + --feedback-mode diagonal \ + --feedback-rank 2 \ + --jacobian-proxy-weight 0.1 \ + --lora-rank 2 \ + --no-interpass-rmsnorm +``` + +## Expected behavior after fixes: +- Growth ratios at step 0: ~1.0-1.2 (same as without LoRA, since LoRA is zero-initialized) +- Growth ratios at step 50: ~1.0-1.3 (LoRA starting to contribute, Jacobian loss keeping it in check) +- No loss spikes above 10 in the first 20 steps +- Train loss should track the non-LoRA 4-pass run closely for the first ~100 steps, then gradually improve as LoRA learns per-pass specialization diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_2pass_3core.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_2pass_3core.sh new file mode 100755 index 0000000000..b7209ec85c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_2pass_3core.sh @@ -0,0 +1,82 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +export PYTHONUNBUFFERED=1 +export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True +export CUDA_VISIBLE_DEVICES=0 +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=20000 +export MAX_WALLCLOCK_SECONDS=4800 +export VAL_LOSS_EVERY=500 +export TRAIN_LOG_EVERY=50 +export WARMUP_STEPS=20 +export WARMDOWN_ITERS=1700 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=64 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=700 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=1 +export SWA_EVERY=50 +export TTT_ENABLED=1 +export CORE_START=4 +export CORE_END=7 +export CORE_QUANT_ENABLED=0 +export CORE_QUANT_BITS=6 +export NUM_PASSES=2 +export LORA_RANK=0 +export WANDB_PROJECT="parameter-golf" +export WANDB_NAME="2pass_3core_80min" + +LOG="/home/nesta/parameter-golf/full_2pass_3core.log" +echo "START full run: 2-pass 3-core (layers 4-6) TTT SWA, 80min ($(date))" | tee "$LOG" + +$PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + --lora-rank 0 \ + >> "$LOG" 2>&1 + +EXIT=$? +echo "" +if [ $EXIT -ne 0 ]; then + echo "FAILED (exit=$EXIT)" + tail -30 "$LOG" +else + echo "=== FINAL RESULTS ===" + grep 'stopping_early\|peak memory' "$LOG" + grep 'final_int6_roundtrip_exact' "$LOG" + grep 'final_int6_sliding_window_exact' "$LOG" + grep 'legal_ttt_exact' "$LOG" +fi + +echo "FINISHED ($(date))" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_baseline_4pass.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_baseline_4pass.sh new file mode 100644 index 0000000000..394baede74 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_baseline_4pass.sh @@ -0,0 +1,83 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +export PYTHONUNBUFFERED=1 +export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True +export CUDA_VISIBLE_DEVICES=0 +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=20000 +export MAX_WALLCLOCK_SECONDS=4800 +export VAL_LOSS_EVERY=500 +export TRAIN_LOG_EVERY=50 +export WARMUP_STEPS=20 +export WARMDOWN_ITERS=1700 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=64 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=700 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=1 +export SWA_EVERY=50 +export TTT_ENABLED=1 +export CORE_START=3 +export CORE_END=8 +export CORE_QUANT_ENABLED=0 +export CORE_QUANT_BITS=6 +export NUM_PASSES=4 +export LORA_RANK=0 +export WANDB_PROJECT="parameter-golf" +export WANDB_NAME="full_4pass_baseline_80min" + +LOG="/home/nesta/parameter-golf/full_baseline.log" +echo "START full run: 4-pass baseline (no LoRA) TTT SWA, 80min ($(date))" | tee "$LOG" + +$PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + --lora-rank 0 \ + >> "$LOG" 2>&1 + +EXIT=$? +echo "" +if [ $EXIT -ne 0 ]; then + echo "FAILED (exit=$EXIT)" + tail -30 "$LOG" +else + echo "=== FINAL RESULTS ===" + grep 'stopping_early\|peak memory' "$LOG" + grep 'final_int6_roundtrip_exact' "$LOG" + grep 'final_int6_sliding_window_exact' "$LOG" + grep 'final_int6_sliding_window_s64_exact' "$LOG" + grep 'legal_ttt_exact' "$LOG" +fi + +echo "FINISHED ($(date))" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_full_4pass.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_full_4pass.sh new file mode 100644 index 0000000000..ec225c8e1b --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_full_4pass.sh @@ -0,0 +1,81 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +export PYTHONUNBUFFERED=1 +export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=20000 +export MAX_WALLCLOCK_SECONDS=4800 +export VAL_LOSS_EVERY=500 +export TRAIN_LOG_EVERY=50 +export WARMUP_STEPS=20 +export WARMDOWN_ITERS=1700 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=64 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=700 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=1 +export SWA_EVERY=50 +export TTT_ENABLED=1 +export CORE_START=3 +export CORE_END=8 +export CORE_QUANT_ENABLED=0 +export CORE_QUANT_BITS=6 +export NUM_PASSES=4 +export LORA_RANK=2 +export WANDB_PROJECT="parameter-golf" +export WANDB_NAME="full_4pass_fixedLR_lateQAT_LoRA_TTT_80min" + +LOG="/home/nesta/parameter-golf/full_4pass_v2.log" +echo "START full run: 4-pass LoRA-r8 fixedLR lateQAT TTT, 80min ($(date))" + +$PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + --lora-rank 2 \ + > "$LOG" 2>&1 + +EXIT=$? +echo "" +if [ $EXIT -ne 0 ]; then + echo "FAILED (exit=$EXIT)" + tail -30 "$LOG" +else + echo "=== FINAL RESULTS ===" + grep 'stopping_early\|peak memory' "$LOG" + grep 'final_int6_roundtrip_exact' "$LOG" + grep 'final_int6_sliding_window_exact' "$LOG" + grep 'final_int6_sliding_window_s64_exact' "$LOG" +fi + +echo "FINISHED ($(date))" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_lora_test.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_lora_test.sh new file mode 100755 index 0000000000..52f13cbe5d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_lora_test.sh @@ -0,0 +1,79 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +export PYTHONUNBUFFERED=1 +export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=500 +export MAX_WALLCLOCK_SECONDS=0 +export VAL_LOSS_EVERY=50 +export TRAIN_LOG_EVERY=10 +export WARMUP_STEPS=20 +export WARMDOWN_ITERS=0 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=0 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=700 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=0 +export LATE_QAT_THRESHOLD=0 +export TTT_ENABLED=0 +export CORE_START=3 +export CORE_END=8 +export CORE_QUANT_ENABLED=0 +export CORE_QUANT_BITS=6 +export NUM_PASSES=4 +export LORA_RANK=2 +export WANDB_PROJECT="parameter-golf" +export WANDB_NAME="lora_test_500step_r2_fixed" +export TORCH_COMPILE_DISABLE=1 + +LOG="/home/nesta/parameter-golf/lora_test_500step.log" +echo "START LoRA test: 4-pass r8, 500 steps ($(date))" | tee "$LOG" + +$PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + --lora-rank 2 \ + >> "$LOG" 2>&1 + +EXIT=$? +echo "" +if [ $EXIT -ne 0 ]; then + echo "FAILED (exit=$EXIT)" + tail -30 "$LOG" +else + echo "=== FINAL RESULTS ===" + grep 'val_bpb' "$LOG" | tail -5 +fi + +echo "FINISHED ($(date))" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_lora_test_r8.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_lora_test_r8.sh new file mode 100755 index 0000000000..3ebe4b3249 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_lora_test_r8.sh @@ -0,0 +1,83 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +export PYTHONUNBUFFERED=1 +export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=20000 +export MAX_WALLCLOCK_SECONDS=4800 +export VAL_LOSS_EVERY=500 +export TRAIN_LOG_EVERY=50 +export WARMUP_STEPS=20 +export WARMDOWN_ITERS=1700 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=64 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=700 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=1 +export SWA_EVERY=50 +export TTT_ENABLED=1 +export CORE_START=3 +export CORE_END=8 +export CORE_QUANT_ENABLED=0 +export CORE_QUANT_BITS=6 +export NUM_PASSES=4 +export LORA_RANK=8 +export WANDB_PROJECT="parameter-golf" +export WANDB_NAME="full_4pass_lora_r8_delayed_80min" + +LOG="/home/nesta/parameter-golf/full_lora_r8.log" +echo "START full run: 4-pass LoRA-r8 delayed-warmup TTT SWA, 80min ($(date))" | tee "$LOG" + +$PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + --lora-rank 8 \ + --lora-warmup-steps 1500 \ + >> "$LOG" 2>&1 + +EXIT=$? +echo "" +if [ $EXIT -ne 0 ]; then + echo "FAILED (exit=$EXIT)" + tail -30 "$LOG" +else + echo "=== FINAL RESULTS ===" + grep 'stopping_early\|peak memory' "$LOG" + grep 'final_int6_roundtrip_exact' "$LOG" + grep 'final_int6_sliding_window_exact' "$LOG" + grep 'final_int6_sliding_window_s64_exact' "$LOG" + grep 'legal_ttt_exact' "$LOG" +fi + +echo "FINISHED ($(date))" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/sweep_passes.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/sweep_passes.sh new file mode 100755 index 0000000000..56a9ce472b --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/sweep_passes.sh @@ -0,0 +1,88 @@ +#!/bin/bash +set -uo pipefail + +PYTHON="/home/nesta/parameter-golf/.venv/bin/python3" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +set -a; source /home/nesta/parameter-golf/.env; set +a + +export PYTHONUNBUFFERED=1 +export TORCH_COMPILE_DISABLE=1 +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" +export SEED=1337 +export ITERATIONS=50 +export MAX_WALLCLOCK_SECONDS=900 +export VAL_LOSS_EVERY=25 +export TRAIN_LOG_EVERY=10 +export WARMUP_STEPS=5 +export WARMDOWN_ITERS=10 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=0 +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=5 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 +export SWA_ENABLED=0 +export LATE_QAT=0 +export TTT_ENABLED=0 +export CORE_START=3 +export CORE_END=8 +export CORE_QUANT_ENABLED=0 +export WANDB_PROJECT="parameter-golf" + +RESULTS="/home/nesta/parameter-golf/sweep_passes_results.txt" +echo "=== Pass count sweep: noRMS, jac=0.1 ===" > "$RESULTS" + +for PASSES in 5 6 8; do + export NUM_PASSES=$PASSES + export WANDB_NAME="sweep_${PASSES}pass_noRMS_j0.1" + LOG="/home/nesta/parameter-golf/sweep_${PASSES}pass.log" + + echo "[$PASSES-pass] START ($(date +%H:%M:%S))" + + $PYTHON train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + > "$LOG" 2>&1 || { + echo "[$PASSES-pass] FAILED (exit=$?)" + MEM=$(grep 'OutOfMemory\|peak memory' "$LOG" | tail -1) + echo "$PASSES-pass: FAILED $MEM" >> "$RESULTS" + tail -5 "$LOG" + continue + } + + BPB_50=$(grep 'step:50/.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + INT6_BPB=$(grep 'final_int6_roundtrip_exact.*val_bpb:' "$LOG" | head -1 | sed 's/.*val_bpb:\([0-9.]*\).*/\1/' || echo "N/A") + MEM=$(grep 'peak memory' "$LOG" | head -1 | sed 's/.*allocated: \([0-9]*\) MiB.*/\1/' || echo "N/A") + STEP_AVG=$(grep 'step:50/.*step_avg:' "$LOG" | head -1 | sed 's/.*step_avg:\([0-9.]*\)ms.*/\1/' || echo "N/A") + GROWTH=$(grep 'step:50/.*growth=' "$LOG" | head -1 | sed "s/.*growth=\[//;s/\].*//;s/'//g" || echo "N/A") + + echo "[$PASSES-pass] DONE => bpb@50=$BPB_50 int6=$INT6_BPB step=${STEP_AVG}ms mem=${MEM}MiB" + echo "$PASSES-pass: bpb=$BPB_50 int6=$INT6_BPB step=${STEP_AVG}ms mem=${MEM}MiB" >> "$RESULTS" +done + +echo "" >> "$RESULTS" +echo "=== SWEEP COMPLETE ($(date)) ===" >> "$RESULTS" +cat "$RESULTS" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py index 1eebea808a..a82ec73612 100644 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py @@ -20,6 +20,8 @@ import numpy as np import sentencepiece as spm import torch +import torch._dynamo +torch._dynamo.config.recompile_limit = 32 import torch.distributed as dist import torch.nn.functional as F from torch import Tensor, nn @@ -114,6 +116,7 @@ class Hyperparameters: num_passes = int(os.environ.get("NUM_PASSES", 1)) core_quant_bits = int(os.environ.get("CORE_QUANT_BITS", 6)) core_quant_enabled = bool(int(os.environ.get("CORE_QUANT_ENABLED", "0"))) + lora_rank = int(os.environ.get("LORA_RANK", 0)) # --- Batched Newton-Schulz orthogonalization --- @@ -835,6 +838,7 @@ def __init__( core_quant_enabled: bool = False, residual_scale: nn.Module | None = None, interpass_rmsnorm: bool = True, + lora_rank: int = 0, ): super().__init__() self._ve_target_dim = num_kv_heads * (model_dim // num_heads) @@ -856,6 +860,7 @@ def __init__( self.num_core = self.core_end - core_start self.num_tail = num_layers - self.core_end self.residual_scale = residual_scale + self.lora_rank = lora_rank self.tok_emb = nn.Embedding(vocab_size, model_dim) self.bigram = BigramHashEmbedding(bigram_vocab_size, bigram_dim, model_dim) if bigram_vocab_size > 0 else None self.smear = SmearGate(model_dim) @@ -870,6 +875,21 @@ def __init__( self.kv_bank = nn.Parameter(torch.empty(2 * num_layers, kv_dim, model_dim)) self.mlp_up_bank = nn.Parameter(torch.empty(num_layers, mlp_dim, model_dim)) self.mlp_down_bank = nn.Parameter(torch.empty(num_layers, model_dim, mlp_dim)) + # Per-pass LoRA adapters for recurrent core (scaled B @ A added to bank weights) + self._lora_scale = 1.0 / math.sqrt(lora_rank) if lora_rank > 0 else 1.0 + self.register_buffer('_lora_step_mul', torch.ones((), dtype=torch.float32), persistent=False) + if lora_rank > 0 and self.num_core > 0 and num_passes > 1: + nc, np_, r = self.num_core, num_passes, lora_rank + for wname, in_d, out_d in [ + ("q", model_dim, model_dim), ("out", model_dim, model_dim), + ("k", model_dim, kv_dim), ("v", model_dim, kv_dim), + ("up", model_dim, mlp_dim), ("down", mlp_dim, model_dim), + ]: + A = nn.Parameter(torch.empty(np_, nc, r, in_d)) + nn.init.normal_(A, mean=0.0, std=0.01) + B = nn.Parameter(torch.zeros(np_, nc, out_d, r)) + setattr(self, f"lora_A_{wname}", A) + setattr(self, f"lora_B_{wname}", B) self.blocks = nn.ModuleList( [ Block( @@ -1002,6 +1022,15 @@ def _forward_hidden(self, input_ids: Tensor, feedback_fn=None, h_prev = x ve = self._get_ve(j, input_ids, ve_cache) q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(j) + if self.lora_rank > 0: + ci = j - self.core_start + s = self._lora_scale * self._lora_step_mul + q_w = q_w + s * (self.lora_B_q[k, ci] @ self.lora_A_q[k, ci]) + k_w = k_w + s * (self.lora_B_k[k, ci] @ self.lora_A_k[k, ci]) + v_w = v_w + s * (self.lora_B_v[k, ci] @ self.lora_A_v[k, ci]) + out_w = out_w + s * (self.lora_B_out[k, ci] @ self.lora_A_out[k, ci]) + up_w = up_w + s * (self.lora_B_up[k, ci] @ self.lora_A_up[k, ci]) + down_w = down_w + s * (self.lora_B_down[k, ci] @ self.lora_A_down[k, ci]) x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, v_embed=ve, v0=v0) if v0 is None and raw_v is not None: @@ -1463,6 +1492,14 @@ def parse_args() -> argparse.Namespace: g.add_argument("--residual-scale-init", type=float, default=0.5) g.add_argument("--jacobian-proxy-weight", type=float, default=0.01) g.add_argument("--no-interpass-rmsnorm", action="store_true") + g.add_argument("--lora-rank", type=int, default=0) + g.add_argument("--lora-warmup-steps", type=int, default=0, + help="Linearly ramp LoRA scale from 0 to 1 over this many steps.") + g = parser.add_argument_group("eval-only") + g.add_argument("--eval-only-passes", type=int, default=None, + help="Skip training; load final_model.pt and run TTT eval with this many passes.") + g.add_argument("--eval-only-checkpoint", type=str, default="final_model.pt", + help="Checkpoint path for --eval-only-passes mode.") return parser.parse_args() def main() -> None: @@ -1571,6 +1608,7 @@ def log0(msg: str, console: bool = True) -> None: core_quant_enabled=args.core_quant_enabled, residual_scale=None, interpass_rmsnorm=not cli.no_interpass_rmsnorm, + lora_rank=cli.lora_rank or args.lora_rank, ).to(device).bfloat16() # Banks stay FP32 (like CastedLinear weights), cast to BF16 in forward base_model.qo_bank.data = base_model.qo_bank.data.float() @@ -1609,9 +1647,56 @@ def feedback_fn(h, pass_idx): residual_scale = ResidualScale(args.num_passes, cli.residual_scale_init).to(device) base_model.residual_scale = residual_scale extra_scalar_params.extend(residual_scale.parameters()) + lora_params: list[nn.Parameter] = [] + if base_model.lora_rank > 0: + lora_params = [p for n, p in base_model.named_parameters() if "lora_" in n] + for p in lora_params: + p.data = p.data.float() + log0(f"lora: rank={base_model.lora_rank} params={sum(p.numel() for p in lora_params)}") log0(f"recurrence: core_start={args.core_start} core_end={args.core_end} " f"num_passes={args.num_passes} stem={base_model.num_stem} " f"core={base_model.num_core} tail={base_model.num_tail}") + + # --- Eval-only mode: load checkpoint, override passes, run TTT, exit --- + if cli.eval_only_passes is not None: + ckpt_path = cli.eval_only_checkpoint + log0(f"eval_only: loading checkpoint {ckpt_path}") + ckpt_sd = torch.load(ckpt_path, map_location=device, weights_only=True) + base_model.load_state_dict(ckpt_sd, strict=True) + base_model.qo_bank.data = base_model.qo_bank.data.float() + base_model.kv_bank.data = base_model.kv_bank.data.float() + base_model.mlp_up_bank.data = base_model.mlp_up_bank.data.float() + base_model.mlp_down_bank.data = base_model.mlp_down_bank.data.float() + for module in base_model.modules(): + if isinstance(module, CastedLinear): + module.float() + restore_low_dim_params_to_fp32(base_model) + target_passes = cli.eval_only_passes + trained_passes = base_model.num_passes + log0(f"eval_only: overriding num_passes {trained_passes} -> {target_passes}") + base_model.num_passes = target_passes + if base_model.residual_scale is not None: + old_scales = base_model.residual_scale.scales.data + if target_passes != old_scales.shape[0]: + new_scales = torch.full((target_passes,), cli.residual_scale_init, + dtype=torch.float32, device=old_scales.device) + copy_len = min(target_passes, old_scales.shape[0]) + new_scales[:copy_len] = old_scales[:copy_len] + base_model.residual_scale.scales = nn.Parameter(new_scales) + log0(f"eval_only: ResidualScale padded/trimmed {old_scales.shape[0]} -> {target_passes}") + base_model.eval() + log0(f"eval_only: running TTT with {target_passes} passes") + ttt_loss, ttt_bpb = eval_val_sliding_ttt( + args, base_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, log0=log0, + ) + log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f}") + log0(f"legal_ttt_exact val_loss:{ttt_loss:.8f} val_bpb:{ttt_bpb:.8f}") + if distributed: + dist.destroy_process_group() + return + # No DDP -- Parallel Muon handles bank grad communication via reduce-scatter, # and non-bank grads are manually all-reduced before Adam steps. compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) @@ -1689,9 +1774,23 @@ def feedback_fn(h, pass_idx): fused=True, ) replicated_params.append(base_model.lm_head.weight) + optimizer_lora = None + if lora_params: + lora_lr = args.scalar_lr * 0.1 + optimizer_lora = torch.optim.AdamW( + [{"params": lora_params, "lr": lora_lr, "base_lr": lora_lr}], + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + weight_decay=args.adam_wd, + fused=True, + ) + replicated_params.extend(lora_params) + log0(f"lora_optimizer: lr={lora_lr} (scalar_lr * 0.1)") optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar] if optimizer_head is not None: optimizers.append(optimizer_head) + if optimizer_lora is not None: + optimizers.append(optimizer_lora) n_params = sum(p.numel() for p in base_model.parameters()) mtp_params = sum(p.numel() for p in base_model.mtp_heads.parameters()) log0(f"model_params:{n_params}") @@ -1838,9 +1937,12 @@ def lr_mul(step: int, elapsed_ms: float) -> float: break elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) scale = lr_mul(step, elapsed_ms) - if args.late_qat_threshold > 0 and scale < args.late_qat_threshold and not CastedLinear._qat_enabled: + if args.late_qat_threshold > 0 and step > 100 and scale < args.late_qat_threshold and not CastedLinear._qat_enabled: CastedLinear._qat_enabled = True - log0(f"late_qat:enabled step:{step} scale:{scale:.4f}") + base_model.core_quant_enabled = True + log0(f"late_qat:enabled step:{step} scale:{scale:.4f} core_quant:on") + if base_model.lora_rank > 0 and cli.lora_warmup_steps > 0: + base_model._lora_step_mul.fill_(min(step / cli.lora_warmup_steps, 1.0)) zero_grad_all() train_loss = torch.zeros((), device=device) for micro_step in range(grad_accum_steps): @@ -1857,8 +1959,9 @@ def lr_mul(step: int, elapsed_ms: float) -> float: for opt in optimizers: for group in opt.param_groups: group["lr"] = group["base_lr"] * scale + grad_norm = None if args.grad_clip_norm > 0: - torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm) + grad_norm = torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm) # === 3-phase overlapped optimizer step === # Phase 1: Launch async reduce-scatter for banks (biggest first) optimizer_muon.launch_reduce_scatters() @@ -1871,6 +1974,8 @@ def lr_mul(step: int, elapsed_ms: float) -> float: optimizer_scalar.step() if optimizer_head is not None: optimizer_head.step() + if optimizer_lora is not None: + optimizer_lora.step() # Phase 3: Wait for RS, local NS5, all-gather (banks processed last) optimizer_muon.step() zero_grad_all() @@ -1901,12 +2006,16 @@ def lr_mul(step: int, elapsed_ms: float) -> float: ) if should_log_train: tl = train_loss.item() + gn_str = f" grad_norm:{grad_norm:.4f}" if grad_norm is not None else "" log0( - f"step:{step}/{args.iterations} train_loss:{tl:.4f} " + f"step:{step}/{args.iterations} train_loss:{tl:.4f}{gn_str} " f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms / step:.2f}ms" ) if use_wandb: - _wandb.log({"train_loss": tl, "step_avg_ms": approx_training_time_ms / step, "lr_scale": scale}, step=step) + wlog = {"train_loss": tl, "step_avg_ms": approx_training_time_ms / step, "lr_scale": scale} + if grad_norm is not None: + wlog["grad_norm"] = float(grad_norm) + _wandb.log(wlog, step=step) reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms if distributed and max_wallclock_ms is not None: reached_cap_tensor = torch.tensor(int(reached_cap), device=device) @@ -2003,6 +2112,7 @@ def lr_mul(step: int, elapsed_ms: float) -> float: core_start=args.core_start, core_end=args.core_end, num_passes=args.num_passes, interpass_rmsnorm=not cli.no_interpass_rmsnorm, + lora_rank=cli.lora_rank or args.lora_rank, ).to(device).bfloat16() if residual_scale is not None: eval_rs = ResidualScale(args.num_passes, cli.residual_scale_init).to(device) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug-internal.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug-internal.log index 8e9b7f0d45..4ddba16660 120000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug-internal.log +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug-internal.log @@ -1 +1 @@ -run-20260326_125242-meaoom9b/logs/debug-internal.log \ No newline at end of file +run-20260327_080959-p8sqkbqa/logs/debug-internal.log \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug.log index 733e002e6a..749bc8ab5b 120000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug.log +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug.log @@ -1 +1 @@ -run-20260326_125242-meaoom9b/logs/debug.log \ No newline at end of file +run-20260327_080959-p8sqkbqa/logs/debug.log \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run index 699dd5e8ca..602d8a22f1 120000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run @@ -1 +1 @@ -run-20260326_125242-meaoom9b \ No newline at end of file +run-20260327_080959-p8sqkbqa \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/config.yaml new file mode 100644 index 0000000000..b678a4fc67 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + vbqzv6lrtlyc5qq1jo1gdcj8ooc1lprq: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39287189504" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: b38248cf6d4a1387d06b2906628c717e59747b11 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T12:52:42.386699Z" + writerId: vbqzv6lrtlyc5qq1jo1gdcj8ooc1lprq + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 50 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log index 932fd79c6a..f59955a5a8 100644 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log @@ -28,3 +28,5 @@ Serialized model: 106023671 bytes Code size: 99082 bytes Serialized model int6+lzma: 4795396 bytes Total submission size int6+lzma: 4894478 bytes +final_int6_roundtrip val_loss:6.1634 val_bpb:3.6503 eval_time:82960ms +final_int6_roundtrip_exact val_loss:6.16344777 val_bpb:3.65034094 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/wandb-summary.json new file mode 100644 index 0000000000..ca4a995134 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/wandb-summary.json @@ -0,0 +1 @@ +{"_wandb":{"runtime":608},"_timestamp":1.7745299878809197e+09,"_step":50,"val_loss":3.728250972378476,"val_bpb":2.2080802280723275,"train_loss":3.765305995941162,"step_avg_ms":3156.9755141000496,"lr_scale":1,"_runtime":608.858490227} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/config.yaml new file mode 100644 index 0000000000..f0e7c9a992 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + rhynn20m0sn3ont2pbgskdqzb45bqnf1: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39648141312" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T13:08:04.378163Z" + writerId: rhynn20m0sn3ont2pbgskdqzb45bqnf1 + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 50 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927201 +num_layers: + value: 11 +num_passes: + value: 5 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/output.log new file mode 100644 index 0000000000..b598d1d44f --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['9697.0', '10451.6', '11267.2', '12136.6', '13066.3', '14092.3', '15164.6', '16287.1', '17473.2', '18730.2', '17176.6', '18430.5', '19751.7', '21150.8', '22632.8', '20820.9', '22301.9', '23864.0', '25513.0', '27242.6', '25115.3', '26855.7', '28683.3', '30605.1', '32620.9'] growth=['1.076', '1.078', '1.078', '1.077', '1.077', '1.079', '1.076', '1.074', '1.073', '1.072', '1.075', '1.073', '1.072', '1.071', '1.070', '1.072', '1.071', '1.070', '1.069', '1.068', '1.070', '1.069', '1.068', '1.067', '1.066'] +step:1/50 train_loss:6.9310 train_time:3595ms step_avg:3594.71ms +step:2/50 train_loss:8.2519 train_time:7178ms step_avg:3588.94ms +step:3/50 train_loss:7.4903 train_time:10793ms step_avg:3597.54ms +step:4/50 train_loss:7.6972 train_time:14409ms step_avg:3602.19ms +step:5/50 train_loss:7.4012 train_time:18024ms step_avg:3604.78ms +step:6/50 train_loss:7.0545 train_time:21640ms step_avg:3606.66ms +step:7/50 train_loss:6.8316 train_time:25257ms step_avg:3608.12ms +step:8/50 train_loss:6.7797 train_time:28873ms step_avg:3609.17ms +step:9/50 train_loss:6.4829 train_time:32490ms step_avg:3610.01ms +step:10/50 train_loss:6.1437 train_time:36108ms step_avg:3610.80ms +step:20/50 train_loss:4.7380 train_time:72276ms step_avg:3613.81ms +step:25/50 val_loss:4.3185 val_bpb:2.5577 train_time:90403ms step_avg:3616.13ms h_norms=['12234.1', '10923.7', '9965.4', '9287.7', '8892.1', '8588.1', '8375.3', '8299.3', '8339.8', '8524.9', '8358.5', '8246.4', '8260.4', '8384.4', '8644.6', '8290.1', '8234.7', '8305.1', '8486.1', '8800.2', '8326.9', '8299.6', '8403.4', '8622.6', '8975.6'] growth=['0.878', '0.893', '0.912', '0.932', '0.957', '0.966', '0.975', '0.991', '1.005', '1.022', '0.978', '0.987', '1.002', '1.015', '1.031', '0.985', '0.993', '1.009', '1.022', '1.037', '0.988', '0.997', '1.013', '1.026', '1.041'] +step:30/50 train_loss:4.1582 train_time:108463ms step_avg:3615.44ms +step:40/50 train_loss:3.8947 train_time:144657ms step_avg:3616.43ms +step:50/50 train_loss:3.7328 train_time:180990ms step_avg:3619.81ms +step:50/50 val_loss:3.6922 val_bpb:2.1867 train_time:181025ms step_avg:3620.49ms h_norms=['17283.6', '14695.4', '12995.7', '11901.6', '11252.4', '11294.1', '11357.9', '11432.8', '11508.8', '11635.1', '11340.6', '11532.3', '11678.7', '11798.8', '11950.4', '11501.0', '11757.0', '11936.4', '12077.0', '12238.6', '11722.3', '12001.2', '12189.6', '12335.8', '12498.2'] growth=['0.819', '0.850', '0.884', '0.916', '0.945', '1.004', '1.006', '1.007', '1.007', '1.011', '1.023', '1.017', '1.013', '1.010', '1.013', '1.034', '1.022', '1.015', '1.012', '1.013', '1.038', '1.024', '1.016', '1.012', '1.013'] +peak memory allocated: 78629 MiB reserved: 79984 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9839 val_bpb:3.5440 eval_time:98671ms +Serialized model: 106023671 bytes +Code size: 99082 bytes +Serialized model int6+lzma: 4803484 bytes +Total submission size int6+lzma: 4902566 bytes +final_int6_roundtrip val_loss:6.1921 val_bpb:3.6673 eval_time:98102ms +final_int6_roundtrip_exact val_loss:6.19214207 val_bpb:3.66733532 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/wandb-metadata.json new file mode 100644 index 0000000000..d230625b56 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T13:08:04.378163Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39648141312" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "rhynn20m0sn3ont2pbgskdqzb45bqnf1" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/wandb-summary.json new file mode 100644 index 0000000000..54f85aa3df --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/wandb-summary.json @@ -0,0 +1 @@ +{"train_loss":3.7327795028686523,"val_loss":3.6922127628653922,"_timestamp":1.774530980960416e+09,"val_bpb":2.1867363704644256,"step_avg_ms":3619.808673739899,"_step":50,"_runtime":710.136496234,"lr_scale":1,"_wandb":{"runtime":710}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/output.log new file mode 100644 index 0000000000..42f9f123fe --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/output.log @@ -0,0 +1,19 @@ +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['9459.8', '10026.0', '10638.5', '11293.0', '11994.1', '12754.8', '13546.4', '14370.5', '15234.0', '16144.6', '14936.2', '15835.7', '16776.4', '17764.6', '18807.4', '17439.0', '18467.7', '19543.9', '20664.7', '21838.3', '20287.3', '21442.5', '22656.2', '23915.0', '25239.7', '23491.8', '24792.6', '26161.8', '27579.4', '29075.0'] growth=['1.057', '1.060', '1.061', '1.062', '1.062', '1.063', '1.062', '1.061', '1.060', '1.060', '1.061', '1.060', '1.059', '1.059', '1.059', '1.060', '1.059', '1.058', '1.057', '1.057', '1.058', '1.057', '1.057', '1.056', '1.055', '1.057', '1.055', '1.055', '1.054', '1.054'] +step:1/50 train_loss:6.9310 train_time:4157ms step_avg:4156.91ms +step:2/50 train_loss:8.1611 train_time:8307ms step_avg:4153.58ms +step:3/50 train_loss:7.5076 train_time:12490ms step_avg:4163.29ms +step:4/50 train_loss:7.6959 train_time:16672ms step_avg:4167.94ms +step:5/50 train_loss:7.4174 train_time:20854ms step_avg:4170.71ms +step:6/50 train_loss:7.1131 train_time:25035ms step_avg:4172.58ms +step:7/50 train_loss:6.9487 train_time:29219ms step_avg:4174.14ms +step:8/50 train_loss:6.7735 train_time:33402ms step_avg:4175.31ms +step:9/50 train_loss:6.4261 train_time:37586ms step_avg:4176.27ms +step:10/50 train_loss:6.0743 train_time:41771ms step_avg:4177.07ms +step:20/50 train_loss:4.7079 train_time:83751ms step_avg:4187.54ms +step:25/50 val_loss:4.2787 val_bpb:2.5341 t diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/wandb-metadata.json new file mode 100644 index 0000000000..e1aaadbfe9 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T13:20:11.134489Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39649058816" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "xsj35rs3wkhoxundi982hutqr3l1l7mn" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/output.log new file mode 100644 index 0000000000..86e820ede8 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/output.log @@ -0,0 +1,35 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9310 train_time:1288ms step_avg:1287.95ms +step:2/20000 train_loss:8.3536 train_time:2591ms step_avg:1295.41ms +step:3/20000 train_loss:7.5089 train_time:3939ms step_avg:1313.01ms +step:4/20000 train_loss:7.5822 train_time:5294ms step_avg:1323.40ms +step:5/20000 train_loss:7.3524 train_time:6645ms step_avg:1328.95ms +step:6/20000 train_loss:7.0866 train_time:7993ms step_avg:1332.18ms +step:7/20000 train_loss:6.9398 train_time:9356ms step_avg:1336.57ms +step:8/20000 train_loss:6.8951 train_time:10726ms step_avg:1340.75ms +step:9/20000 train_loss:6.5426 train_time:12083ms step_avg:1342.55ms +step:10/20000 train_loss:6.1426 train_time:13437ms step_avg:1343.72ms +step:50/20000 train_loss:3.6826 train_time:68838ms step_avg:1376.77ms +step:100/20000 train_loss:3.1286 train_time:138070ms step_avg:1380.70ms +step:150/20000 train_loss:2.7613 train_time:272923ms step_avg:1819.49ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/wandb-metadata.json new file mode 100644 index 0000000000..5143899619 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T13:26:34.814695Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39649619968" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "kfir03rica7dai0u58fwumltoas9wnln" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/output.log new file mode 100644 index 0000000000..0622a06eba --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms +step:1/20000 train_loss:6.9310 train_time:2909ms step_avg:2908.98ms +step:2/20000 train_loss:8.4243 train_time:5662ms step_avg:2830.81ms +step:3/20000 train_loss:7.8519 train_time:8500ms step_avg:2833.22ms +step:4/20000 train_loss:7.1213 train_time:11343ms step_avg:2835.65ms +step:5/20000 train_loss:6.5923 train_time:14192ms step_avg:2838.40ms +step:6/20000 train_loss:6.3670 train_time:17042ms step_avg:2840.33ms +step:7/20000 train_loss:6.2103 train_time:19891ms step_avg:2841.54ms +step:8/20000 train_loss:6.1333 train_time:22735ms step_avg:2841.84ms +step:9/20000 train_loss:6.0992 train_time:25576ms step_avg:2841.78ms +step:10/20000 train_loss:5.9961 train_time:28419ms step_avg:2841.93ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/wandb-metadata.json new file mode 100644 index 0000000000..6ba34c3527 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T13:30:36.462475Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39852167168" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "k8oaolw8fvkuv3t9jkeyvfpvbgtalxyy" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/config.yaml new file mode 100644 index 0000000000..0efa0f2ab1 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/config.yaml @@ -0,0 +1,96 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + c6ejn0w4gay1zi0rou3jtsnxmm8tprci: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "39973859328" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T13:33:53.996998Z" + writerId: c6ejn0w4gay1zi0rou3jtsnxmm8tprci + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/output.log new file mode 100644 index 0000000000..0b74980ee8 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/output.log @@ -0,0 +1,67 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2084, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1760, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1027, in forward + def forward(self, input_ids: Tensor, target_ids: Tensor, + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward + return compiled_fn(full_args) + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper + all_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g + return f(*args) + ^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply + return super().apply(*args, **kwargs) # type: ignore[misc] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward + fw_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper + return compiled_fn(runtime_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn + outs = compiled_fn(args) + ^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/bj/cbjz4wub4qksf632gjklhdzgwbvw2qz6t5g7gsnwo3esf3zyblfo.py", line 10736, in call + buf771 = empty_strided_cuda((48, 2048, 1536), (3145728, 1536, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 99.88 MiB is free. Process 273188 has 51.64 GiB memory in use. Process 275202 has 51.56 GiB memory in use. Including non-PyTorch memory, this process has 36.48 GiB memory in use. 79.97 GiB allowed; Of the allocated memory 35.81 GiB is allocated by PyTorch, and 11.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/wandb-metadata.json new file mode 100644 index 0000000000..186f272501 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T13:33:53.996998Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39973859328" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "c6ejn0w4gay1zi0rou3jtsnxmm8tprci" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/wandb-summary.json new file mode 100644 index 0000000000..e563fa0a44 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":30,"_wandb":{"runtime":30}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/output.log new file mode 100644 index 0000000000..b0a576e071 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/output.log @@ -0,0 +1,122 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9310 train_time:1286ms step_avg:1285.72ms +step:2/20000 train_loss:8.4243 train_time:2520ms step_avg:1259.75ms +step:3/20000 train_loss:7.5899 train_time:3791ms step_avg:1263.65ms +step:4/20000 train_loss:7.3604 train_time:5065ms step_avg:1266.15ms +step:5/20000 train_loss:7.2017 train_time:6337ms step_avg:1267.44ms +step:6/20000 train_loss:7.1139 train_time:7608ms step_avg:1267.94ms +step:7/20000 train_loss:7.0266 train_time:8884ms step_avg:1269.15ms +step:8/20000 train_loss:6.8703 train_time:10155ms step_avg:1269.33ms +step:9/20000 train_loss:6.5277 train_time:11432ms step_avg:1270.24ms +step:10/20000 train_loss:6.1364 train_time:12711ms step_avg:1271.13ms +step:50/20000 train_loss:3.7012 train_time:66255ms step_avg:1325.09ms +step:100/20000 train_loss:3.1707 train_time:133824ms step_avg:1338.24ms +step:150/20000 train_loss:2.8377 train_time:201511ms step_avg:1343.41ms +step:200/20000 train_loss:2.6154 train_time:269226ms step_avg:1346.13ms +step:250/20000 train_loss:2.6168 train_time:336977ms step_avg:1347.91ms +step:300/20000 train_loss:2.4755 train_time:404746ms step_avg:1349.15ms +step:350/20000 train_loss:2.5226 train_time:472525ms step_avg:1350.07ms +step:400/20000 train_loss:2.4300 train_time:540349ms step_avg:1350.87ms +step:450/20000 train_loss:2.2536 train_time:608178ms step_avg:1351.51ms +step:500/20000 train_loss:2.3113 train_time:676567ms step_avg:1353.13ms +step:500/20000 val_loss:2.3274 val_bpb:1.3784 train_time:676611ms step_avg:1353.22ms +step:550/20000 train_loss:2.3695 train_time:744686ms step_avg:1353.98ms +step:600/20000 train_loss:2.2671 train_time:812797ms step_avg:1354.66ms +step:650/20000 train_loss:2.2373 train_time:880946ms step_avg:1355.30ms +step:700/20000 train_loss:2.3081 train_time:949104ms step_avg:1355.86ms +step:750/20000 train_loss:2.2737 train_time:1017254ms step_avg:1356.34ms +step:800/20000 train_loss:2.2530 train_time:1085409ms step_avg:1356.76ms +step:850/20000 train_loss:2.1855 train_time:1153602ms step_avg:1357.18ms +step:900/20000 train_loss:2.1047 train_time:1221810ms step_avg:1357.57ms +step:950/20000 train_loss:2.3058 train_time:1290011ms step_avg:1357.91ms +step:1000/20000 train_loss:2.2370 train_time:1358693ms step_avg:1358.69ms +step:1000/20000 val_loss:2.1807 val_bpb:1.2916 train_time:1358737ms step_avg:1358.74ms +step:1050/20000 train_loss:2.1633 train_time:1426912ms step_avg:1358.96ms +step:1100/20000 train_loss:2.1901 train_time:1495141ms step_avg:1359.22ms +step:1150/20000 train_loss:2.1451 train_time:1563371ms step_avg:1359.45ms +step:1200/20000 train_loss:2.1925 train_time:1631660ms step_avg:1359.72ms +step:1250/20000 train_loss:2.2160 train_time:1699949ms step_avg:1359.96ms +step:1300/20000 train_loss:2.1869 train_time:1768261ms step_avg:1360.20ms +step:1350/20000 train_loss:2.1586 train_time:1836588ms step_avg:1360.44ms +step:1400/20000 train_loss:2.1726 train_time:1904911ms step_avg:1360.65ms +step:1450/20000 train_loss:2.1689 train_time:1973248ms step_avg:1360.86ms +step:1500/20000 train_loss:2.1391 train_time:2042104ms step_avg:1361.40ms +step:1500/20000 val_loss:2.1246 val_bpb:1.2583 train_time:2042149ms step_avg:1361.43ms +step:1550/20000 train_loss:2.1090 train_time:2110481ms step_avg:1361.60ms +step:1600/20000 train_loss:2.1871 train_time:2178869ms step_avg:1361.79ms +step:1650/20000 train_loss:1.9698 train_time:2247283ms step_avg:1361.99ms +step:1700/20000 train_loss:2.0933 train_time:2315736ms step_avg:1362.20ms +step:1750/20000 train_loss:2.0614 train_time:2384155ms step_avg:1362.37ms +step:1800/20000 train_loss:2.0974 train_time:2452595ms step_avg:1362.55ms +step:1850/20000 train_loss:2.1094 train_time:2521066ms step_avg:1362.74ms +step:1900/20000 train_loss:2.0530 train_time:2589507ms step_avg:1362.90ms +step:1950/20000 train_loss:2.0371 train_time:2657961ms step_avg:1363.06ms +step:2000/20000 train_loss:2.2915 train_time:2726880ms step_avg:1363.44ms +step:2000/20000 val_loss:2.0686 val_bpb:1.2252 train_time:2726924ms step_avg:1363.46ms +step:2050/20000 train_loss:2.0580 train_time:2795382ms step_avg:1363.60ms +step:2100/20000 train_loss:2.0309 train_time:2863872ms step_avg:1363.75ms +step:2150/20000 train_loss:2.0080 train_time:2932366ms step_avg:1363.89ms +step:2200/20000 train_loss:2.1611 train_time:3000869ms step_avg:1364.03ms +step:2250/20000 train_loss:2.0531 train_time:3069359ms step_avg:1364.16ms +step:2300/20000 train_loss:2.0314 train_time:3137845ms step_avg:1364.28ms +step:2350/20000 train_loss:1.9839 train_time:3206343ms step_avg:1364.40ms +step:2400/20000 train_loss:2.0978 train_time:3274856ms step_avg:1364.52ms +step:2450/20000 train_loss:2.0583 train_time:3343349ms step_avg:1364.63ms +step:2500/20000 train_loss:2.0143 train_time:3412281ms step_avg:1364.91ms +step:2500/20000 val_loss:2.0210 val_bpb:1.1969 train_time:3412325ms step_avg:1364.93ms +step:2550/20000 train_loss:2.0163 train_time:3480766ms step_avg:1365.01ms +step:2600/20000 train_loss:1.9947 train_time:3549233ms step_avg:1365.09ms +step:2650/20000 train_loss:1.9997 train_time:3617731ms step_avg:1365.18ms +step:2700/20000 train_loss:2.0195 train_time:3686191ms step_avg:1365.26ms +step:2750/20000 train_loss:2.0010 train_time:3754675ms step_avg:1365.34ms +step:2800/20000 train_loss:2.0359 train_time:3823161ms step_avg:1365.41ms +swa:start step:2850 +step:2850/20000 train_loss:1.9860 train_time:3891626ms step_avg:1365.48ms +step:2900/20000 train_loss:2.0033 train_time:3960176ms step_avg:1365.58ms +step:2950/20000 train_loss:2.0417 train_time:4028712ms step_avg:1365.67ms +late_qat:enabled step:2990 scale:0.1498 +step:3000/20000 train_loss:1.9297 train_time:4097686ms step_avg:1365.90ms +step:3000/20000 val_loss:1.9846 val_bpb:1.1754 train_time:4097782ms step_avg:1365.93ms +step:3050/20000 train_loss:1.9368 train_time:4166118ms step_avg:1365.94ms +step:3100/20000 train_loss:2.0003 train_time:4234401ms step_avg:1365.94ms +step:3150/20000 train_loss:2.0099 train_time:4302671ms step_avg:1365.93ms +step:3200/20000 train_loss:1.9846 train_time:4370945ms step_avg:1365.92ms +step:3250/20000 train_loss:1.9515 train_time:4439218ms step_avg:1365.91ms +step:3300/20000 train_loss:1.9330 train_time:4507468ms step_avg:1365.90ms +step:3350/20000 train_loss:1.9699 train_time:4575766ms step_avg:1365.90ms +step:3400/20000 train_loss:2.0133 train_time:4644041ms step_avg:1365.89ms +step:3450/20000 train_loss:1.9670 train_time:4712315ms step_avg:1365.89ms +step:3500/20000 train_loss:1.9497 train_time:4781271ms step_avg:1366.08ms +step:3500/20000 val_loss:1.9558 val_bpb:1.1583 train_time:4781368ms step_avg:1366.11ms +step:3514/20000 val_loss:1.9557 val_bpb:1.1583 train_time:4800532ms step_avg:1366.12ms +stopping_early: wallclock_cap train_time:4800532ms step:3514/20000 +peak memory allocated: 50545 MiB reserved: 50594 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9532 val_bpb:1.1568 eval_time:32961ms +Serialized model: 106023671 bytes +Code size: 99082 bytes +Serialized model int6+lzma: 14754232 bytes +Total submission size int6+lzma: 14853314 bytes +final_int6_roundtrip val_loss:1.9685 val_bpb:1.1659 eval_time:63637ms +final_int6_roundtrip_exact val_loss:1.96850574 val_bpb:1.16585998 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/wandb-metadata.json new file mode 100644 index 0000000000..1eb2a4d50d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T13:35:58.369574Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "39974141952" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "gnflxawf44rwf4l3yxepjocuqeg7flu6" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/output.log new file mode 100644 index 0000000000..f7740182bb --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/output.log @@ -0,0 +1,42 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms h_norms=['2856693.8', '3617744.0', '3024587.2', '2505640.5', '2045268.8', '1692691.0', '1466293.1', '1317613.1', '1128710.0', '1036196.3', '1421818.8', '1274564.5', '1193117.0', '1029778.4', '1490634.8', '1094878.2', '980524.6', '772689.9', '2807584.8', '1688950.2'] growth=['37.088', '1.266', '0.836', '0.828', '0.816', '0.828', '0.866', '0.899', '0.857', '0.918', '0.893', '0.896', '0.936', '0.863', '1.448', '0.874', '0.896', '0.788', '3.634', '0.602'] +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3120ms step_avg:3120.20ms +late_qat:enabled step:1 scale:0.1125 core_quant:on +step:2/500 train_loss:8.2958 grad_norm:3.6201 train_time:6355ms step_avg:3177.30ms +step:3/500 train_loss:7.7840 grad_norm:1.9355 train_time:9607ms step_avg:3202.48ms +step:4/500 train_loss:7.6312 grad_norm:2.0781 train_time:12861ms step_avg:3215.33ms +step:5/500 train_loss:7.2389 grad_norm:1.7372 train_time:16114ms step_avg:3222.85ms +step:6/500 train_loss:6.9255 grad_norm:1.3250 train_time:19368ms step_avg:3227.98ms +step:7/500 train_loss:6.7056 grad_norm:1.6786 train_time:22623ms step_avg:3231.91ms +step:8/500 train_loss:6.5605 grad_norm:1.9357 train_time:25876ms step_avg:3234.53ms +step:9/500 train_loss:6.3963 grad_norm:1.8729 train_time:29129ms step_avg:3236.52ms +step:10/500 train_loss:6.2184 grad_norm:1.5032 train_time:32381ms step_avg:3238.14ms +step:20/500 train_loss:5.6501 grad_norm:0.3987 train_time:64880ms step_avg:3244.02ms +step:30/500 train_loss:5.4810 grad_norm:0.2723 train_time:97397ms step_avg:3246.58ms +step:40/500 train_loss:5.3294 grad_norm:0.3154 train_time:130061ms step_avg:3251.52ms +step:50/500 train_loss:5.1292 grad_norm:0.4699 train_time:162596ms step_avg:3251.93ms +step:50/500 val_loss:5.1050 val_bpb:3.0235 train_time:162628ms step_avg:3252.56ms h_norms=['5481.6', '5232.7', '5130.8', '5133.4', '5378.8', '5546.0', '6001.6', '6423.6', '6559.1', '6909.3', '6350.7', '7392.7', '7713.3', '8030.2', '8264.0', '7556.6', '7855.9', '8165.5', '8383.7', '8953.6'] growth=['0.776', '0.955', '0.981', '1.000', '1.048', '1.031', '1.082', '1.070', '1.021', '1.053', '1.047', '1.164', '1.043', '1.041', '1.029', '1.066', '1.040', '1.039', '1.027', '1.068'] +step:60/500 train_loss:4.8449 grad_norm:1.4378 train_time:195133ms step_avg:3252.21ms +step:70/500 train_loss:4.6316 grad_norm:0.9923 train_time:227701ms step_avg:3252.87ms +step:80/500 train_loss:4.4605 grad_norm:0.7018 train_time:285889ms step_avg:3573.62ms +step:90/500 train_loss:4.2816 grad_norm:0.4120 train_time:355477ms step_avg:3949.75ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/wandb-metadata.json new file mode 100644 index 0000000000..d62171ee9f --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T15:19:33.286229Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40031674368" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "25w6qmdyrteywsquqtdyvpurhl41p049" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/output.log new file mode 100644 index 0000000000..1aefa45d4f --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/output.log @@ -0,0 +1,51 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms h_norms=['3412002.5', '3162921.2', '9173489.0', '7602557.5', '6195891.5', '5488329.5', '11289328.0', '9569341.0', '7780140.0', '6543904.0', '3864158.2', '3367786.2', '3060453.0', '2578833.0', '2199645.5', '3025920.8', '2612817.8', '2319063.8', '1985769.4', '1676057.1'] growth=['40.638', '0.927', '2.900', '0.829', '0.815', '0.886', '2.057', '0.848', '0.813', '0.841', '0.926', '0.872', '0.909', '0.843', '0.853', '0.893', '0.863', '0.888', '0.856', '0.844'] +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3123ms step_avg:3122.62ms +step:2/500 train_loss:8.2188 grad_norm:3.3853 train_time:6236ms step_avg:3118.00ms +step:3/500 train_loss:28.9039 grad_norm:69.9957 train_time:9371ms step_avg:3123.80ms +step:4/500 train_loss:25.5037 grad_norm:61.7113 train_time:12508ms step_avg:3126.98ms +step:5/500 train_loss:7.6034 grad_norm:1.4821 train_time:15651ms step_avg:3130.24ms +step:6/500 train_loss:26.0025 grad_norm:97.9849 train_time:18790ms step_avg:3131.59ms +step:7/500 train_loss:16.5605 grad_norm:40.5487 train_time:21927ms step_avg:3132.48ms +step:8/500 train_loss:11.7107 grad_norm:10.1270 train_time:25068ms step_avg:3133.44ms +step:9/500 train_loss:11.1058 grad_norm:11.1070 train_time:28207ms step_avg:3134.13ms +step:10/500 train_loss:8.9177 grad_norm:2.2146 train_time:31350ms step_avg:3134.98ms +step:20/500 train_loss:18.2283 grad_norm:171.1513 train_time:62685ms step_avg:3134.24ms +step:30/500 train_loss:7.3933 grad_norm:3.9780 train_time:93988ms step_avg:3132.95ms +step:40/500 train_loss:6.1536 grad_norm:0.7697 train_time:125446ms step_avg:3136.15ms +step:50/500 train_loss:5.7265 grad_norm:0.4587 train_time:156800ms step_avg:3136.00ms +step:50/500 val_loss:5.6210 val_bpb:3.3291 train_time:156832ms step_avg:3136.63ms h_norms=['1483117.2', '2001691.6', '5509635.5', '3288294.0', '2112858.5', '9030652.0', '23049164.0', '14409614.0', '9373722.0', '5569532.0', '1012941.9', '764749.4', '575192.3', '488720.7', '685087.8', '1171268.4', '1059284.2', '703405.6', '689976.9', '514776.4'] growth=['1.514', '1.350', '2.752', '0.597', '0.643', '4.274', '2.552', '0.625', '0.651', '0.594', '0.774', '0.755', '0.752', '0.850', '1.402', '1.114', '0.904', '0.664', '0.981', '0.746'] +step:60/500 train_loss:5.3182 grad_norm:1.1502 train_time:188173ms step_avg:3136.22ms +step:70/500 train_loss:4.9805 grad_norm:0.4400 train_time:219545ms step_avg:3136.35ms +step:80/500 train_loss:4.6777 grad_norm:0.3607 train_time:250922ms step_avg:3136.52ms +step:90/500 train_loss:4.4124 grad_norm:0.7668 train_time:282302ms step_avg:3136.69ms +step:100/500 train_loss:4.2591 grad_norm:0.5623 train_time:313684ms step_avg:3136.84ms +step:100/500 val_loss:4.2203 val_bpb:2.4995 train_time:313716ms step_avg:3137.16ms h_norms=['2539958.8', '1891926.2', '1842518.8', '1259791.9', '834439.0', '2452532.2', '7493377.5', '4728913.5', '3378154.8', '1967445.1', '10342628.0', '6845392.5', '4585671.5', '3120444.5', '2068628.8', '1138807.9', '1212089.2', '1098464.0', '943749.6', '733953.3'] growth=['3.252', '0.745', '0.974', '0.684', '0.662', '2.939', '3.055', '0.631', '0.714', '0.582', '12.042', '0.662', '0.670', '0.680', '0.663', '1.142', '1.064', '0.906', '0.859', '0.778'] +step:110/500 train_loss:4.1071 grad_norm:0.5719 train_time:345069ms step_avg:3136.99ms +step:120/500 train_loss:3.9553 grad_norm:0.6834 train_time:376449ms step_avg:3137.07ms +step:130/500 train_loss:3.8550 grad_norm:2.0622 train_time:407872ms step_avg:3137.47ms +step:140/500 train_loss:3.7516 grad_norm:0.7069 train_time:439278ms step_avg:3137.70ms +step:150/500 train_loss:3.6360 grad_norm:0.2515 train_time:470666ms step_avg:3137.77ms +step:150/500 val_loss:3.6143 val_bpb:2.1406 train_time:470698ms step_avg:3137.98ms h_norms=['1691694.6', '1214504.1', '1032287.4', '726402.4', '507756.3', '1093338.5', '3688532.0', '2256534.5', '1410329.9', '885931.9', '2614546.0', '1920724.0', '1389947.5', '1044625.5', '756565.3', '642715.2', '699743.8', '648021.7', '565602.9', '444204.0'] growth=['3.923', '0.718', '0.850', '0.704', '0.699', '2.153', '3.374', '0.612', '0.625', '0.628', '5.054', '0.735', '0.724', '0.752', '0.724', '1.191', '1.089', '0.926', '0.873', '0.785'] +step:160/500 train_loss:3.5437 grad_norm:0.7928 train_time:502073ms step_avg:3137.96ms +step:170/500 train_loss:3.3979 grad_norm:0.3021 train_time:533493ms step_avg:3138.19ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/wandb-metadata.json new file mode 100644 index 0000000000..4431072ff6 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T15:27:23.336267Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40032288768" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "g1k79ctd5ttim8lftwafyg1cyggreaf8" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/config.yaml new file mode 100644 index 0000000000..75480ae6fa --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/config.yaml @@ -0,0 +1,100 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + nvlmclw7rkuhf512qzhh4icb15og7n8i: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "2" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40033751040" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T15:44:55.039349Z" + writerId: nvlmclw7rkuhf512qzhh4icb15og7n8i + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 500 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 27234400 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/output.log new file mode 100644 index 0000000000..37c9fe540c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/output.log @@ -0,0 +1,100 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['14828.0', '14150.7', '13588.8', '13118.6', '12709.1', '12463.7', '12206.9', '11992.7', '11805.3', '11630.9', '12068.4', '11893.6', '11752.5', '11630.9', '11516.5', '11809.8', '11709.4', '11634.5', '11573.3', '11514.1'] growth=['0.947', '0.954', '0.960', '0.965', '0.969', '0.981', '0.979', '0.982', '0.984', '0.985', '0.988', '0.986', '0.988', '0.990', '0.990', '0.995', '0.992', '0.994', '0.995', '0.995'] +step:1/500 train_loss:6.9310 grad_norm:0.3717 train_time:3145ms step_avg:3144.67ms +step:2/500 train_loss:8.3473 grad_norm:3.5331 train_time:6277ms step_avg:3138.33ms +step:3/500 train_loss:7.5115 grad_norm:1.8520 train_time:9436ms step_avg:3145.34ms +step:4/500 train_loss:7.5611 grad_norm:1.8993 train_time:12598ms step_avg:3149.58ms +step:5/500 train_loss:7.3182 grad_norm:1.9103 train_time:15757ms step_avg:3151.41ms +step:6/500 train_loss:7.0753 grad_norm:1.7013 train_time:18921ms step_avg:3153.55ms +step:7/500 train_loss:6.9528 grad_norm:2.0667 train_time:22082ms step_avg:3154.56ms +step:8/500 train_loss:6.9028 grad_norm:1.4281 train_time:25244ms step_avg:3155.47ms +step:9/500 train_loss:6.5408 grad_norm:1.0079 train_time:28404ms step_avg:3156.00ms +step:10/500 train_loss:6.1499 grad_norm:0.9864 train_time:31565ms step_avg:3156.49ms +step:20/500 train_loss:4.7832 grad_norm:1.0980 train_time:63165ms step_avg:3158.27ms +step:30/500 train_loss:4.1875 grad_norm:1.0659 train_time:94790ms step_avg:3159.67ms +step:40/500 train_loss:3.8630 grad_norm:0.8877 train_time:126560ms step_avg:3164.01ms +step:50/500 train_loss:3.6884 grad_norm:0.7170 train_time:158209ms step_avg:3164.18ms +step:50/500 val_loss:3.6586 val_bpb:2.1668 train_time:158241ms step_avg:3164.81ms h_norms=['12735.5', '11200.0', '10164.1', '9525.6', '9159.4', '9155.7', '9167.3', '9205.2', '9275.0', '9389.5', '9135.8', '9214.4', '9298.5', '9399.7', '9535.9', '9187.1', '9297.4', '9404.6', '9522.6', '9670.8'] growth=['0.845', '0.879', '0.908', '0.937', '0.962', '1.000', '1.001', '1.004', '1.008', '1.012', '1.011', '1.009', '1.009', '1.011', '1.014', '1.016', '1.012', '1.012', '1.013', '1.016'] +step:60/500 train_loss:3.5065 grad_norm:1.0078 train_time:189842ms step_avg:3164.03ms +step:70/500 train_loss:3.4063 grad_norm:0.7213 train_time:221493ms step_avg:3164.18ms +step:80/500 train_loss:3.3329 grad_norm:0.5494 train_time:253155ms step_avg:3164.43ms +step:90/500 train_loss:3.1786 grad_norm:0.4390 train_time:284819ms step_avg:3164.66ms +step:100/500 train_loss:3.1304 grad_norm:0.5572 train_time:316454ms step_avg:3164.54ms +step:100/500 val_loss:3.0898 val_bpb:1.8300 train_time:316486ms step_avg:3164.86ms h_norms=['13387.7', '11921.1', '11209.5', '11093.3', '11390.5', '11631.9', '12018.0', '12495.4', '13068.8', '13826.8', '12333.3', '12773.9', '13286.8', '13868.9', '14643.5', '13114.6', '13524.0', '14018.2', '14576.3', '15343.0'] growth=['0.849', '0.890', '0.940', '0.990', '1.027', '1.021', '1.033', '1.040', '1.046', '1.058', '1.018', '1.036', '1.040', '1.044', '1.056', '1.006', '1.031', '1.037', '1.040', '1.053'] +step:110/500 train_loss:3.0264 grad_norm:0.3633 train_time:348121ms step_avg:3164.73ms +step:120/500 train_loss:2.9410 grad_norm:0.3409 train_time:379800ms step_avg:3165.00ms +step:130/500 train_loss:2.8723 grad_norm:0.3861 train_time:411469ms step_avg:3165.15ms +step:140/500 train_loss:2.8200 grad_norm:0.2985 train_time:443103ms step_avg:3165.02ms +step:150/500 train_loss:2.7805 grad_norm:0.3407 train_time:474745ms step_avg:3164.97ms +step:150/500 val_loss:2.7663 val_bpb:1.6384 train_time:474777ms step_avg:3165.18ms h_norms=['14917.2', '13747.5', '13056.9', '12886.7', '13272.0', '13836.5', '13983.7', '14170.1', '14538.1', '15295.8', '14298.5', '14552.7', '14814.1', '15249.9', '16101.4', '14844.7', '15109.6', '15379.1', '15827.9', '16720.2'] growth=['0.910', '0.922', '0.950', '0.987', '1.030', '1.043', '1.011', '1.013', '1.026', '1.052', '1.027', '1.018', '1.018', '1.029', '1.056', '1.002', '1.018', '1.018', '1.029', '1.056'] +step:160/500 train_loss:2.7721 grad_norm:0.4282 train_time:506415ms step_avg:3165.09ms +step:170/500 train_loss:2.7118 grad_norm:0.3055 train_time:538058ms step_avg:3165.05ms +step:180/500 train_loss:2.6200 grad_norm:0.2472 train_time:569702ms step_avg:3165.01ms +step:190/500 train_loss:2.6444 grad_norm:0.3218 train_time:601362ms step_avg:3165.06ms +step:200/500 train_loss:2.5645 grad_norm:0.2424 train_time:633157ms step_avg:3165.79ms +step:200/500 val_loss:2.6022 val_bpb:1.5411 train_time:633189ms step_avg:3165.94ms h_norms=['17029.4', '15746.8', '14904.1', '14604.7', '14967.2', '16073.3', '15851.9', '15730.9', '15825.5', '16423.3', '16300.0', '16203.6', '16170.9', '16336.4', '17025.2', '16594.0', '16523.8', '16512.3', '16692.3', '17399.1'] growth=['0.922', '0.925', '0.946', '0.980', '1.025', '1.074', '0.986', '0.992', '1.006', '1.038', '1.056', '0.994', '0.998', '1.010', '1.042', '1.026', '0.996', '0.999', '1.011', '1.042'] +step:210/500 train_loss:2.5650 grad_norm:0.3549 train_time:664794ms step_avg:3165.68ms +step:220/500 train_loss:2.6050 grad_norm:0.3495 train_time:696436ms step_avg:3165.62ms +step:230/500 train_loss:2.5417 grad_norm:0.3301 train_time:728081ms step_avg:3165.57ms +step:240/500 train_loss:2.5378 grad_norm:0.2441 train_time:759737ms step_avg:3165.57ms +step:250/500 train_loss:2.5756 grad_norm:0.3062 train_time:791358ms step_avg:3165.43ms +step:250/500 val_loss:2.5268 val_bpb:1.4965 train_time:791390ms step_avg:3165.56ms h_norms=['18846.3', '17410.6', '16438.8', '15959.5', '16305.4', '17764.0', '17388.4', '17140.6', '17049.7', '17635.7', '17928.4', '17683.9', '17472.3', '17397.4', '17960.3', '18072.6', '17862.5', '17642.2', '17546.1', '18026.9'] growth=['0.917', '0.924', '0.944', '0.971', '1.022', '1.089', '0.979', '0.986', '0.995', '1.034', '1.076', '0.986', '0.988', '0.996', '1.032', '1.045', '0.988', '0.988', '0.995', '1.027'] +step:260/500 train_loss:2.5344 grad_norm:0.2526 train_time:823044ms step_avg:3165.55ms +step:270/500 train_loss:2.4962 grad_norm:0.3197 train_time:854686ms step_avg:3165.50ms +step:280/500 train_loss:2.4329 grad_norm:0.2331 train_time:886309ms step_avg:3165.39ms +step:290/500 train_loss:2.4736 grad_norm:0.2550 train_time:917933ms step_avg:3165.29ms +step:300/500 train_loss:2.4380 grad_norm:0.2264 train_time:949572ms step_avg:3165.24ms +step:300/500 val_loss:2.4532 val_bpb:1.4529 train_time:949604ms step_avg:3165.35ms h_norms=['20941.1', '19182.7', '18052.0', '17457.7', '17865.7', '19511.0', '19019.9', '18704.8', '18540.8', '19344.5', '19674.8', '19330.4', '18996.2', '18791.5', '19454.1', '19743.2', '19426.2', '19055.5', '18799.3', '19275.4'] growth=['0.909', '0.916', '0.941', '0.967', '1.023', '1.092', '0.975', '0.983', '0.991', '1.043', '1.080', '0.982', '0.983', '0.989', '1.035', '1.050', '0.984', '0.981', '0.987', '1.025'] +step:310/500 train_loss:2.3557 grad_norm:0.2355 train_time:981222ms step_avg:3165.23ms +step:320/500 train_loss:2.4268 grad_norm:0.2275 train_time:1012875ms step_avg:3165.24ms +step:330/500 train_loss:2.4731 grad_norm:0.2582 train_time:1044522ms step_avg:3165.22ms +step:340/500 train_loss:2.3813 grad_norm:0.2222 train_time:1076174ms step_avg:3165.22ms +step:350/500 train_loss:2.4827 grad_norm:0.2099 train_time:1107944ms step_avg:3165.55ms +step:350/500 val_loss:2.4038 val_bpb:1.4237 train_time:1107975ms step_avg:3165.64ms h_norms=['23083.6', '21048.2', '19811.9', '19011.2', '19310.5', '21330.8', '20798.1', '20502.5', '20108.1', '20815.6', '21418.1', '21061.6', '20705.5', '20224.1', '20676.4', '21384.5', '21039.3', '20616.6', '20063.3', '20274.8'] growth=['0.904', '0.912', '0.941', '0.960', '1.016', '1.105', '0.975', '0.986', '0.981', '1.035', '1.096', '0.983', '0.983', '0.977', '1.022', '1.067', '0.984', '0.980', '0.973', '1.011'] +step:360/500 train_loss:2.2488 grad_norm:0.2012 train_time:1139604ms step_avg:3165.57ms +step:370/500 train_loss:2.4498 grad_norm:0.1714 train_time:1171266ms step_avg:3165.58ms +step:380/500 train_loss:2.3948 grad_norm:0.2284 train_time:1202904ms step_avg:3165.54ms +step:390/500 train_loss:2.3586 grad_norm:0.1837 train_time:1234583ms step_avg:3165.60ms +step:400/500 train_loss:2.4046 grad_norm:0.2053 train_time:1266209ms step_avg:3165.52ms +step:400/500 val_loss:2.3705 val_bpb:1.4040 train_time:1266241ms step_avg:3165.60ms h_norms=['25593.2', '23180.9', '21902.2', '20833.8', '21042.7', '23067.7', '22480.5', '22319.1', '21743.0', '22667.8', '23199.6', '22798.3', '22512.9', '21781.8', '22345.9', '23150.2', '22743.8', '22358.7', '21535.2', '21760.3'] growth=['0.893', '0.906', '0.945', '0.951', '1.010', '1.096', '0.975', '0.993', '0.974', '1.043', '1.091', '0.983', '0.987', '0.968', '1.026', '1.067', '0.982', '0.983', '0.963', '1.010'] +step:410/500 train_loss:2.3694 grad_norm:0.2156 train_time:1297880ms step_avg:3165.56ms +step:420/500 train_loss:2.4022 grad_norm:0.1854 train_time:1329534ms step_avg:3165.56ms +step:430/500 train_loss:2.3222 grad_norm:0.2276 train_time:1361181ms step_avg:3165.54ms +step:440/500 train_loss:2.4095 grad_norm:0.1854 train_time:1392812ms step_avg:3165.48ms +step:450/500 train_loss:2.2365 grad_norm:0.2154 train_time:1424471ms step_avg:3165.49ms +step:450/500 val_loss:2.3386 val_bpb:1.3851 train_time:1424502ms step_avg:3165.56ms h_norms=['28710.0', '25807.1', '24367.3', '23021.1', '23326.2', '25712.0', '25093.1', '24936.9', '24053.5', '25235.9', '25868.9', '25395.7', '25094.3', '23995.7', '24591.2', '25723.8', '25209.5', '24761.3', '23522.8', '23651.5'] growth=['0.884', '0.899', '0.944', '0.945', '1.013', '1.102', '0.976', '0.994', '0.965', '1.049', '1.100', '0.982', '0.988', '0.956', '1.025', '1.077', '0.980', '0.982', '0.950', '1.005'] +step:460/500 train_loss:2.3715 grad_norm:0.2246 train_time:1456119ms step_avg:3165.48ms +step:470/500 train_loss:2.3155 grad_norm:0.2162 train_time:1487752ms step_avg:3165.43ms +step:480/500 train_loss:2.2437 grad_norm:0.1745 train_time:1519404ms step_avg:3165.42ms +step:490/500 train_loss:2.2728 grad_norm:0.2349 train_time:1551048ms step_avg:3165.41ms +step:500/500 train_loss:2.2855 grad_norm:0.1518 train_time:1582691ms step_avg:3165.38ms +step:500/500 val_loss:2.3106 val_bpb:1.3684 train_time:1582722ms step_avg:3165.44ms h_norms=['31454.2', '27922.0', '26377.7', '24565.6', '24573.1', '27173.7', '26402.9', '26559.0', '25384.0', '26739.2', '27427.3', '26814.3', '26736.1', '25274.3', '25895.2', '27331.6', '26687.2', '26396.8', '24736.9', '24660.2'] growth=['0.877', '0.888', '0.945', '0.931', '1.000', '1.106', '0.972', '1.006', '0.956', '1.053', '1.108', '0.978', '0.997', '0.945', '1.025', '1.088', '0.976', '0.989', '0.937', '0.997'] +peak memory allocated: 66505 MiB reserved: 67408 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:2.6566 val_bpb:1.5734 eval_time:85304ms +Serialized model: 107256259 bytes +Code size: 102003 bytes +Serialized model int6+lzma: 9596524 bytes +Total submission size int6+lzma: 9698527 bytes +final_int6_roundtrip val_loss:2.7108 val_bpb:1.6055 eval_time:84596ms +final_int6_roundtrip_exact val_loss:2.71075277 val_bpb:1.60546048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/wandb-metadata.json new file mode 100644 index 0000000000..107ff97656 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T15:44:55.039349Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "2" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40033751040" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "nvlmclw7rkuhf512qzhh4icb15og7n8i" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/wandb-summary.json new file mode 100644 index 0000000000..2c1cdb413b --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/wandb-summary.json @@ -0,0 +1 @@ +{"_timestamp":1.7745424819536245e+09,"val_bpb":1.3684446567690138,"train_loss":2.2855165004730225,"_wandb":{"runtime":2788},"_runtime":2788.429911049,"step_avg_ms":3165.3810700719764,"grad_norm":0.15179167687892914,"val_loss":2.3105614811374906,"_step":500,"lr_scale":1} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/output.log new file mode 100644 index 0000000000..c167c75c19 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/output.log @@ -0,0 +1,93 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['14775.5', '14115.8', '13566.0', '13105.3', '12700.6', '12471.2', '12236.8', '12036.6', '11858.5', '11689.9', '12098.3', '11947.3', '11822.2', '11710.3', '11602.1', '11865.1', '11790.1', '11731.6', '11680.5', '11627.6'] growth=['0.948', '0.955', '0.961', '0.966', '0.969', '0.982', '0.981', '0.984', '0.985', '0.986', '0.990', '0.988', '0.990', '0.991', '0.991', '0.997', '0.994', '0.995', '0.996', '0.995'] +step:1/500 train_loss:6.9310 grad_norm:0.3717 train_time:3140ms step_avg:3139.78ms +step:2/500 train_loss:8.3473 grad_norm:3.5331 train_time:6273ms step_avg:3136.63ms +step:3/500 train_loss:7.5115 grad_norm:1.8520 train_time:9435ms step_avg:3145.04ms +step:4/500 train_loss:7.5611 grad_norm:1.8997 train_time:12598ms step_avg:3149.51ms +step:5/500 train_loss:7.3178 grad_norm:1.9097 train_time:15759ms step_avg:3151.86ms +step:6/500 train_loss:7.0757 grad_norm:1.7063 train_time:18919ms step_avg:3153.23ms +step:7/500 train_loss:6.9543 grad_norm:2.0712 train_time:22079ms step_avg:3154.18ms +step:8/500 train_loss:6.9054 grad_norm:1.4262 train_time:25237ms step_avg:3154.66ms +step:9/500 train_loss:6.5427 grad_norm:1.0088 train_time:28396ms step_avg:3155.09ms +step:10/500 train_loss:6.1508 grad_norm:0.9852 train_time:31553ms step_avg:3155.25ms +step:20/500 train_loss:4.7862 grad_norm:1.1515 train_time:63123ms step_avg:3156.17ms +step:30/500 train_loss:4.1778 grad_norm:0.8901 train_time:94709ms step_avg:3156.95ms +step:40/500 train_loss:3.8576 grad_norm:1.1652 train_time:126432ms step_avg:3160.79ms +step:50/500 train_loss:3.6818 grad_norm:0.6740 train_time:158029ms step_avg:3160.57ms +step:50/500 val_loss:3.6685 val_bpb:2.1727 train_time:158061ms step_avg:3161.21ms h_norms=['12863.0', '11304.5', '10260.4', '9600.6', '9205.4', '9177.2', '9156.3', '9186.9', '9244.0', '9340.1', '9128.5', '9175.6', '9256.6', '9347.6', '9467.6', '9152.0', '9233.4', '9342.2', '9452.6', '9586.7'] growth=['0.847', '0.879', '0.908', '0.936', '0.959', '0.997', '0.998', '1.003', '1.006', '1.010', '1.007', '1.005', '1.009', '1.010', '1.013', '1.012', '1.009', '1.012', '1.012', '1.014'] +step:60/500 train_loss:3.5054 grad_norm:0.6287 train_time:189648ms step_avg:3160.80ms +step:70/500 train_loss:3.3867 grad_norm:0.3547 train_time:221269ms step_avg:3160.99ms +step:80/500 train_loss:3.3521 grad_norm:0.6155 train_time:252895ms step_avg:3161.19ms +step:90/500 train_loss:3.1908 grad_norm:0.5564 train_time:284515ms step_avg:3161.28ms +step:100/500 train_loss:3.1363 grad_norm:0.5304 train_time:316098ms step_avg:3160.98ms +step:100/500 val_loss:3.1033 val_bpb:1.8380 train_time:316130ms step_avg:3161.30ms h_norms=['13308.9', '11916.3', '11272.1', '11158.4', '11454.6', '11658.4', '12034.7', '12510.9', '13065.2', '13811.4', '12347.7', '12779.1', '13288.3', '13857.8', '14627.5', '13120.1', '13520.6', '14007.0', '14556.3', '15322.5'] growth=['0.851', '0.895', '0.946', '0.990', '1.027', '1.018', '1.032', '1.040', '1.044', '1.057', '1.014', '1.035', '1.040', '1.043', '1.056', '1.002', '1.031', '1.036', '1.039', '1.053'] +step:110/500 train_loss:3.0377 grad_norm:0.4577 train_time:347728ms step_avg:3161.16ms +step:120/500 train_loss:2.9310 grad_norm:0.2932 train_time:379357ms step_avg:3161.31ms +step:130/500 train_loss:2.8607 grad_norm:0.3171 train_time:411020ms step_avg:3161.69ms +step:140/500 train_loss:2.8210 grad_norm:0.3873 train_time:442615ms step_avg:3161.53ms +step:150/500 train_loss:2.7654 grad_norm:0.2930 train_time:474228ms step_avg:3161.52ms +step:150/500 val_loss:2.7671 val_bpb:1.6388 train_time:474260ms step_avg:3161.73ms h_norms=['14988.2', '13795.7', '13108.2', '12902.9', '13200.7', '13774.5', '13856.8', '14021.4', '14325.3', '14996.9', '14146.5', '14360.2', '14618.0', '15004.1', '15799.7', '14631.0', '14869.2', '15146.1', '15557.6', '16412.6'] growth=['0.912', '0.920', '0.950', '0.984', '1.023', '1.043', '1.006', '1.012', '1.022', '1.047', '1.026', '1.015', '1.018', '1.026', '1.053', '0.999', '1.016', '1.019', '1.027', '1.055'] +step:160/500 train_loss:2.7691 grad_norm:0.3606 train_time:505863ms step_avg:3161.64ms +step:170/500 train_loss:2.7187 grad_norm:0.4251 train_time:537482ms step_avg:3161.66ms +step:180/500 train_loss:2.6329 grad_norm:0.3579 train_time:569087ms step_avg:3161.59ms +step:190/500 train_loss:2.6453 grad_norm:0.3163 train_time:600712ms step_avg:3161.64ms +step:200/500 train_loss:2.5662 grad_norm:0.2988 train_time:632432ms step_avg:3162.16ms +step:200/500 val_loss:2.6122 val_bpb:1.5471 train_time:632464ms step_avg:3162.32ms h_norms=['16776.0', '15547.2', '14750.1', '14438.0', '14724.7', '15842.7', '15653.2', '15553.5', '15587.1', '16103.0', '16018.9', '15971.6', '15973.8', '16090.8', '16737.7', '16299.2', '16286.8', '16309.8', '16448.2', '17131.7'] growth=['0.920', '0.927', '0.949', '0.979', '1.020', '1.076', '0.988', '0.994', '1.002', '1.033', '1.057', '0.997', '1.000', '1.007', '1.040', '1.024', '0.999', '1.001', '1.008', '1.042'] +step:210/500 train_loss:2.5532 grad_norm:0.2664 train_time:664045ms step_avg:3162.12ms +step:220/500 train_loss:2.6024 grad_norm:0.2987 train_time:695666ms step_avg:3162.12ms +step:230/500 train_loss:2.5356 grad_norm:0.2965 train_time:727281ms step_avg:3162.09ms +step:240/500 train_loss:2.5368 grad_norm:0.2684 train_time:758901ms step_avg:3162.09ms +step:250/500 train_loss:2.5806 grad_norm:0.3130 train_time:790509ms step_avg:3162.03ms +step:250/500 val_loss:2.5228 val_bpb:1.4941 train_time:790540ms step_avg:3162.16ms h_norms=['18876.3', '17455.2', '16574.6', '16150.7', '16451.4', '17948.8', '17544.9', '17340.5', '17260.6', '17849.1', '18128.4', '17863.6', '17701.0', '17631.6', '18226.5', '18294.9', '18057.8', '17883.5', '17791.2', '18311.4'] growth=['0.919', '0.925', '0.950', '0.974', '1.019', '1.091', '0.977', '0.988', '0.995', '1.034', '1.075', '0.985', '0.991', '0.996', '1.034', '1.041', '0.987', '0.990', '0.995', '1.029'] +step:260/500 train_loss:2.5372 grad_norm:0.2644 train_time:822155ms step_avg:3162.14ms +step:270/500 train_loss:2.4938 grad_norm:0.2812 train_time:853783ms step_avg:3162.16ms +step:280/500 train_loss:2.4306 grad_norm:0.2003 train_time:885408ms step_avg:3162.17ms +step:290/500 train_loss:2.4765 grad_norm:0.3092 train_time:917016ms step_avg:3162.12ms +step:300/500 train_loss:2.4326 grad_norm:0.1681 train_time:948617ms step_avg:3162.06ms +step:300/500 val_loss:2.4489 val_bpb:1.4504 train_time:948649ms step_avg:3162.16ms h_norms=['20956.8', '19310.5', '18350.0', '17752.1', '17947.3', '19637.2', '19181.5', '18935.7', '18738.5', '19319.1', '19714.4', '19430.9', '19185.8', '18962.7', '19445.1', '19787.3', '19502.4', '19205.8', '18928.7', '19244.6'] growth=['0.908', '0.921', '0.950', '0.967', '1.011', '1.094', '0.977', '0.987', '0.990', '1.031', '1.080', '0.986', '0.987', '0.988', '1.025', '1.049', '0.986', '0.985', '0.986', '1.017'] +step:310/500 train_loss:2.3587 grad_norm:0.2276 train_time:980220ms step_avg:3162.00ms +step:320/500 train_loss:2.4347 grad_norm:0.2859 train_time:1011827ms step_avg:3161.96ms +step:330/500 train_loss:2.4707 grad_norm:0.2418 train_time:1043434ms step_avg:3161.92ms +step:340/500 train_loss:2.3754 grad_norm:0.1686 train_time:1075069ms step_avg:3161.97ms +step:350/500 train_loss:2.4888 grad_norm:0.2348 train_time:1106822ms step_avg:3162.35ms +step:350/500 val_loss:2.4056 val_bpb:1.4248 train_time:1106854ms step_avg:3162.44ms h_norms=['23133.0', '21145.9', '20229.4', '19385.2', '19483.8', '21466.3', '20951.7', '20846.7', '20425.8', '21015.3', '21540.5', '21179.0', '20979.7', '20461.2', '20830.5', '21487.0', '21095.2', '20800.2', '20191.7', '20302.1'] growth=['0.900', '0.914', '0.957', '0.958', '1.005', '1.102', '0.976', '0.995', '0.980', '1.029', '1.088', '0.983', '0.991', '0.975', '1.018', '1.059', '0.982', '0.986', '0.971', '1.005'] +step:360/500 train_loss:2.2591 grad_norm:0.2411 train_time:1138444ms step_avg:3162.34ms +step:370/500 train_loss:2.4545 grad_norm:0.1909 train_time:1170061ms step_avg:3162.33ms +step:380/500 train_loss:2.3976 grad_norm:0.2061 train_time:1201677ms step_avg:3162.31ms +step:390/500 train_loss:2.3586 grad_norm:0.1674 train_time:1233312ms step_avg:3162.34ms +step:400/500 train_loss:2.4052 grad_norm:0.1943 train_time:1264905ms step_avg:3162.26ms +step:400/500 val_loss:2.3698 val_bpb:1.4036 train_time:1264937ms step_avg:3162.34ms h_norms=['25393.6', '23052.1', '22258.0', '21100.4', '21137.0', '23185.3', '22585.4', '22799.7', '22117.3', '22854.6', '23335.3', '22912.0', '22955.3', '22121.1', '22550.5', '23327.2', '22852.6', '22741.1', '21795.4', '21892.2'] growth=['0.894', '0.908', '0.966', '0.948', '1.002', '1.097', '0.974', '1.009', '0.970', '1.033', '1.088', '0.982', '1.002', '0.964', '1.019', '1.063', '0.980', '0.995', '0.958', '1.004'] +step:410/500 train_loss:2.3671 grad_norm:0.1937 train_time:1296526ms step_avg:3162.26ms +step:420/500 train_loss:2.4047 grad_norm:0.1777 train_time:1328148ms step_avg:3162.26ms +step:430/500 train_loss:2.3188 grad_norm:0.1700 train_time:1359758ms step_avg:3162.23ms +step:440/500 train_loss:2.4078 grad_norm:0.1629 train_time:1391376ms step_avg:3162.22ms +step:450/500 train_loss:2.2324 grad_norm:0.1648 train_time:1422989ms step_avg:3162.20ms +step:450/500 val_loss:2.3380 val_bpb:1.3847 train_time:1423021ms step_avg:3162.27ms h_norms=['28599.1', '25678.3', '24727.3', '23273.6', '23246.6', '25483.9', '24735.6', '24912.8', '23956.1', '24918.5', '25533.7', '24997.5', '24992.8', '23857.5', '24365.7', '25454.0', '24832.7', '24642.9', '23373.0', '23408.4'] growth=['0.878', '0.898', '0.963', '0.941', '0.999', '1.096', '0.971', '1.007', '0.962', '1.040', '1.094', '0.979', '1.000', '0.955', '1.021', '1.072', '0.976', '0.992', '0.948', '1.002'] +step:460/500 train_loss:2.3718 grad_norm:0.1999 train_time:1454626ms step_avg:3162.23ms +step:470/500 train_loss:2.3160 grad_norm:0.1574 train_time:1486238ms step_avg:3162.21ms +step:480/500 train_loss:2.2435 grad_norm:0.1638 train_time:1517858ms step_avg:3162.20ms +step:490/500 train_loss:2.2685 grad_norm:0.1750 train_time:1549486ms step_avg:3162.22ms +step:500/500 train_loss:2.2886 grad_norm:0.1629 train_time:1581107ms step_avg:3162.21ms +step:500/500 val_loss:2.3129 val_bpb:1.3698 train_time:1581139ms step_avg:3162.28ms h_norms=['31084.9', '27716.0', '27061.8', '25055.7', '24710.5', '27086.1', '26368.3', '27059.3', '25774.6', '26889.8', '27399.1', '26809.2', '27111.8', '25485.0', '25899.8', '27258.6', '26544.8', '26595.7', '24796.2', '24675.7'] growth=['0.870', '0.892', '0.976', '0.926', '0.986', '1.096', '0.973', '1.026', '0.953', '1.043', '1.098', '0.978', '1.011', '0.940', '1.016', '1.079', '0.974', '1.002', '0.932', '0.995'] +peak memory allocated: 66518 MiB reserved: 67422 MiB +ema:applying EMA weights diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/wandb-metadata.json new file mode 100644 index 0000000000..023e4c993b --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T16:33:11.139684Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40032813056" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "ekws3lenfyhrzx9alioiina1wzdm3lpu" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/output.log new file mode 100644 index 0000000000..57d9b7b561 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/output.log @@ -0,0 +1,35 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['14832.4', '14167.9', '13609.8', '13150.6', '12748.8', '12545.2', '12303.4', '12097.1', '11923.4', '11759.5', '12183.2', '12024.0', '11892.6', '11786.0', '11682.4', '11960.2', '11876.8', '11812.8', '11767.3', '11718.7'] growth=['0.949', '0.955', '0.961', '0.966', '0.969', '0.984', '0.981', '0.983', '0.986', '0.986', '0.992', '0.987', '0.989', '0.991', '0.991', '1.000', '0.993', '0.995', '0.996', '0.996'] +step:1/500 train_loss:6.9310 grad_norm:0.3717 train_time:3140ms step_avg:3140.38ms +step:2/500 train_loss:8.3473 grad_norm:3.5331 train_time:6270ms step_avg:3134.76ms +step:3/500 train_loss:7.5115 grad_norm:1.8521 train_time:9429ms step_avg:3142.91ms +step:4/500 train_loss:7.5611 grad_norm:1.8995 train_time:12585ms step_avg:3146.24ms +step:5/500 train_loss:7.3178 grad_norm:1.9091 train_time:15741ms step_avg:3148.20ms +step:6/500 train_loss:7.0749 grad_norm:1.6999 train_time:18894ms step_avg:3149.03ms +step:7/500 train_loss:6.9522 grad_norm:2.0634 train_time:22046ms step_avg:3149.47ms +step:8/500 train_loss:6.9027 grad_norm:1.4293 train_time:25201ms step_avg:3150.09ms +step:9/500 train_loss:6.5406 grad_norm:1.0086 train_time:28361ms step_avg:3151.26ms +step:10/500 train_loss:6.1498 grad_norm:0.9873 train_time:31520ms step_avg:3152.04ms +step:20/500 train_loss:4.7838 grad_norm:1.1062 train_time:63114ms step_avg:3155.70ms +step:30/500 train_loss:4.1864 grad_norm:0.9834 train_time:94733ms step_avg:3157.75ms +step:40/500 train_loss:3.8592 grad_norm:1.1661 train_time:126430ms step_avg:3160.76ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/wandb-metadata.json new file mode 100644 index 0000000000..190d9cb412 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T17:15:48.572448Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40035905536" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "4ak4touiwxcjhxhy2hjzyl7n1anoaquq" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/output.log new file mode 100644 index 0000000000..2a4260a40a --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/output.log @@ -0,0 +1,42 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.01ms h_norms=['14111.1', '13526.4', '13079.9', '12686.8', '12345.1', '12169.5', '12016.5', '11895.0', '11762.2', '11626.3', '11910.0', '11834.3', '11775.0', '11694.1', '11606.3', '11768.3', '11743.2', '11726.3', '11686.2', '11633.5'] growth=['0.946', '0.959', '0.967', '0.970', '0.973', '0.986', '0.987', '0.990', '0.989', '0.988', '0.993', '0.994', '0.995', '0.993', '0.992', '0.999', '0.998', '0.999', '0.997', '0.995'] +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3132ms step_avg:3132.23ms +step:2/500 train_loss:8.2518 grad_norm:3.3887 train_time:6252ms step_avg:3125.87ms +step:3/500 train_loss:7.4857 grad_norm:1.6765 train_time:9403ms step_avg:3134.45ms +step:4/500 train_loss:7.7545 grad_norm:2.0593 train_time:12554ms step_avg:3138.50ms +step:5/500 train_loss:7.4669 grad_norm:2.0969 train_time:15705ms step_avg:3141.04ms +step:6/500 train_loss:7.1116 grad_norm:1.8286 train_time:18855ms step_avg:3142.45ms +step:7/500 train_loss:6.8770 grad_norm:2.4120 train_time:22007ms step_avg:3143.88ms +step:8/500 train_loss:6.7917 grad_norm:1.6092 train_time:25159ms step_avg:3144.88ms +step:9/500 train_loss:6.5438 grad_norm:1.2773 train_time:28308ms step_avg:3145.33ms +step:10/500 train_loss:6.1776 grad_norm:1.1355 train_time:31457ms step_avg:3145.72ms +step:20/500 train_loss:4.8363 grad_norm:1.5152 train_time:62951ms step_avg:3147.57ms +step:30/500 train_loss:4.1657 grad_norm:0.7915 train_time:94460ms step_avg:3148.68ms +step:40/500 train_loss:3.8635 grad_norm:0.8288 train_time:126098ms step_avg:3152.46ms +step:50/500 train_loss:3.6893 grad_norm:1.0199 train_time:157622ms step_avg:3152.43ms +step:50/500 val_loss:3.6358 val_bpb:2.1533 train_time:157653ms step_avg:3153.06ms h_norms=['12813.5', '11251.6', '10224.6', '9522.0', '9100.5', '9110.7', '9077.5', '9115.4', '9128.8', '9187.3', '9069.4', '9101.9', '9187.7', '9232.6', '9313.7', '9094.9', '9161.1', '9273.7', '9332.8', '9426.5'] growth=['0.852', '0.878', '0.909', '0.931', '0.956', '1.001', '0.996', '1.004', '1.001', '1.006', '1.013', '1.004', '1.009', '1.005', '1.009', '1.020', '1.007', '1.012', '1.006', '1.010'] +step:60/500 train_loss:3.5102 grad_norm:0.6809 train_time:189149ms step_avg:3152.49ms +step:70/500 train_loss:3.3973 grad_norm:0.6864 train_time:220677ms step_avg:3152.52ms +step:80/500 train_loss:3.3157 grad_norm:0.5593 train_time:252203ms step_avg:3152.54ms +step:90/500 train_loss:3.1562 grad_norm:0.4835 train_time:283735ms step_avg:3152.61ms +step:100/500 train_loss:3.1161 grad_norm:0.5429 train_time:315255ms step_avg:3152.55ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/wandb-metadata.json new file mode 100644 index 0000000000..9435a17610 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T17:21:42.601391Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40036614144" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "zsqyowgo1bd68e5g3pxandd4ri78ced7" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/output.log new file mode 100644 index 0000000000..95bd7f1c09 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/output.log @@ -0,0 +1,42 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms h_norms=['14329.1', '13723.6', '13269.1', '12851.2', '12501.0', '12349.8', '12190.5', '12066.6', '11911.7', '11770.1', '12059.5', '11984.8', '11930.4', '11833.9', '11738.3', '11921.0', '11898.5', '11878.8', '11820.1', '11770.2'] growth=['0.946', '0.958', '0.967', '0.969', '0.973', '0.988', '0.987', '0.990', '0.987', '0.988', '0.994', '0.994', '0.995', '0.992', '0.992', '1.001', '0.998', '0.998', '0.995', '0.996'] +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3140ms step_avg:3139.82ms +step:2/500 train_loss:8.2452 grad_norm:3.3679 train_time:6268ms step_avg:3133.81ms +step:3/500 train_loss:7.4836 grad_norm:1.6553 train_time:9427ms step_avg:3142.31ms +step:4/500 train_loss:7.7829 grad_norm:2.0735 train_time:12585ms step_avg:3146.17ms +step:5/500 train_loss:7.5027 grad_norm:2.1035 train_time:15742ms step_avg:3148.49ms +step:6/500 train_loss:7.1378 grad_norm:1.8533 train_time:18903ms step_avg:3150.44ms +step:7/500 train_loss:6.8951 grad_norm:2.4164 train_time:22063ms step_avg:3151.85ms +step:8/500 train_loss:6.7800 grad_norm:1.6064 train_time:25224ms step_avg:3153.04ms +step:9/500 train_loss:6.5383 grad_norm:1.2870 train_time:28387ms step_avg:3154.06ms +step:10/500 train_loss:6.1716 grad_norm:1.1190 train_time:31546ms step_avg:3154.63ms +step:20/500 train_loss:4.8024 grad_norm:1.3364 train_time:63138ms step_avg:3156.91ms +step:30/500 train_loss:4.1822 grad_norm:2.0484 train_time:94744ms step_avg:3158.14ms +step:40/500 train_loss:3.8452 grad_norm:0.8175 train_time:126488ms step_avg:3162.19ms +step:50/500 train_loss:3.6888 grad_norm:0.9309 train_time:158100ms step_avg:3162.00ms +step:50/500 val_loss:3.6447 val_bpb:2.1586 train_time:158132ms step_avg:3162.64ms h_norms=['12971.7', '11306.1', '10212.0', '9466.3', '9016.7', '8957.7', '8905.6', '8930.6', '8942.4', '9004.0', '8900.7', '8919.4', '8995.0', '9040.1', '9125.4', '8916.2', '8974.5', '9075.4', '9139.3', '9237.6'] growth=['0.842', '0.872', '0.903', '0.927', '0.953', '0.993', '0.994', '1.003', '1.001', '1.007', '1.007', '1.002', '1.008', '1.005', '1.009', '1.015', '1.007', '1.011', '1.007', '1.011'] +step:60/500 train_loss:3.5046 grad_norm:0.9289 train_time:189719ms step_avg:3161.98ms +step:70/500 train_loss:3.3859 grad_norm:0.5209 train_time:221302ms step_avg:3161.45ms +step:80/500 train_loss:3.3231 grad_norm:0.4777 train_time:252892ms step_avg:3161.15ms +step:90/500 train_loss:3.1589 grad_norm:0.4236 train_time:284505ms step_avg:3161.17ms +step:100/500 train_loss:3.1087 grad_norm:0.3903 train_time:316117ms step_avg:3161.17ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/wandb-metadata.json new file mode 100644 index 0000000000..3ef4641876 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T17:31:59.519889Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40037466112" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "25g3y2yi8o8e70ilv3o8yfytq2c1btaa" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/output.log new file mode 100644 index 0000000000..6b4a192299 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/output.log @@ -0,0 +1,32 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms h_norms=['71486.6', '109852.4', '112898.0', '103194.9', '101669.1', '103905.5', '128262.2', '125516.1', '152206.8', '148436.4', '117193.6', '119566.9', '200935.0', '175293.6', '156000.0', '116143.9', '120580.6', '121930.9', '128013.4', '136703.7'] growth=['1.595', '1.537', '1.028', '0.914', '0.985', '1.022', '1.234', '0.979', '1.213', '0.975', '1.121', '1.020', '1.681', '0.872', '0.890', '1.062', '1.038', '1.011', '1.050', '1.068'] +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3141ms step_avg:3140.54ms +step:2/500 train_loss:8.1781 grad_norm:3.1870 train_time:6272ms step_avg:3136.02ms +step:3/500 train_loss:10.9302 grad_norm:11.9253 train_time:9425ms step_avg:3141.53ms +step:4/500 train_loss:7.6575 grad_norm:2.1410 train_time:12578ms step_avg:3144.58ms +step:5/500 train_loss:7.0804 grad_norm:1.6717 train_time:15735ms step_avg:3146.99ms +step:6/500 train_loss:6.9353 grad_norm:1.4697 train_time:18890ms step_avg:3148.35ms +step:7/500 train_loss:6.8485 grad_norm:2.6919 train_time:22046ms step_avg:3149.40ms +step:8/500 train_loss:6.7499 grad_norm:1.6549 train_time:25202ms step_avg:3150.24ms +step:9/500 train_loss:6.8418 grad_norm:1.7199 train_time:28357ms step_avg:3150.83ms +step:10/500 train_loss:6.5955 grad_norm:3.6129 train_time:31518ms step_avg:3151.83ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/wandb-metadata.json new file mode 100644 index 0000000000..b1bd36fe40 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T17:42:30.058350Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40038367232" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "if1uipy6cslxcem4m8wtg1r9dyeo3v40" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/output.log new file mode 100644 index 0000000000..359bec767f --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/output.log @@ -0,0 +1,23 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.01ms h_norms=['49006.3', '70788.1', '67619.4', '66895.7', '80691.9', '85686.5', '91540.4', '100243.9', '98369.8', '96924.4', '115874.1', '105822.4', '350994.9', '231898.9', '189787.3', '83816.5', '104820.2', '91576.3', '75437.8', '74316.6'] growth=['1.006', '1.444', '0.955', '0.989', '1.206', '1.062', '1.068', '1.095', '0.981', '0.985', '1.418', '0.913', '3.317', '0.661', '0.818', '0.916', '1.251', '0.874', '0.824', '0.985'] +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3140ms step_avg:3139.77ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/wandb-metadata.json new file mode 100644 index 0000000000..6603f5733c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T17:46:16.648152Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40038879232" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "67jfqphebh4th2c6nenvohnmt0rb5hqw" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/output.log new file mode 100644 index 0000000000..9d6fc84aca --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/output.log @@ -0,0 +1,42 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms h_norms=['101044.9', '100874.6', '129352.4', '167648.2', '181140.5', '180204.0', '175799.0', '197658.3', '355913.8', '317039.1', '212913.5', '192786.2', '329348.0', '289972.6', '272664.0', '228358.1', '225461.7', '218889.9', '214625.2', '213371.5'] growth=['2.311', '0.998', '1.282', '1.296', '1.080', '0.995', '0.976', '1.124', '1.801', '0.891', '1.072', '0.905', '1.708', '0.880', '0.940', '1.089', '0.987', '0.971', '0.981', '0.994'] +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3142ms step_avg:3142.50ms +step:2/500 train_loss:8.1781 grad_norm:3.1870 train_time:6271ms step_avg:3135.66ms +step:3/500 train_loss:10.9302 grad_norm:11.9253 train_time:9428ms step_avg:3142.66ms +step:4/500 train_loss:7.6576 grad_norm:2.1410 train_time:12584ms step_avg:3146.09ms +step:5/500 train_loss:7.0801 grad_norm:1.6701 train_time:15745ms step_avg:3148.90ms +step:6/500 train_loss:6.9355 grad_norm:1.4723 train_time:18906ms step_avg:3151.05ms +step:7/500 train_loss:6.8501 grad_norm:2.6989 train_time:22064ms step_avg:3152.00ms +step:8/500 train_loss:6.7493 grad_norm:1.6422 train_time:25222ms step_avg:3152.79ms +step:9/500 train_loss:6.8287 grad_norm:1.6899 train_time:28380ms step_avg:3153.31ms +step:10/500 train_loss:6.6041 grad_norm:3.7299 train_time:31539ms step_avg:3153.89ms +step:20/500 train_loss:6.9737 grad_norm:6.8887 train_time:63117ms step_avg:3155.83ms +step:30/500 train_loss:4.9900 grad_norm:1.1052 train_time:94692ms step_avg:3156.39ms +step:40/500 train_loss:4.3631 grad_norm:0.4267 train_time:126432ms step_avg:3160.80ms +step:50/500 train_loss:4.0261 grad_norm:0.4910 train_time:158038ms step_avg:3160.76ms +step:50/500 val_loss:3.9805 val_bpb:2.3575 train_time:158070ms step_avg:3161.40ms h_norms=['90033.2', '84674.3', '83484.3', '85619.1', '82273.7', '100121.1', '118020.5', '132644.2', '141924.9', '144315.5', '114997.4', '136485.1', '157285.0', '167945.2', '171560.8', '128904.4', '152052.9', '175277.8', '187742.6', '192007.2'] growth=['0.938', '0.940', '0.986', '1.026', '0.961', '1.217', '1.179', '1.124', '1.070', '1.017', '1.236', '1.187', '1.152', '1.068', '1.022', '1.242', '1.180', '1.153', '1.071', '1.023'] +step:60/500 train_loss:3.7728 grad_norm:0.4561 train_time:189599ms step_avg:3159.98ms +step:70/500 train_loss:3.6609 grad_norm:0.5540 train_time:221200ms step_avg:3159.99ms +step:80/500 train_loss:3.5918 grad_norm:0.4286 train_time:252796ms step_avg:3159.95ms +step:90/500 train_loss:3.4451 grad_norm:0.3377 train_time:284393ms step_avg:3159.92ms +step:100/500 train_loss:3.3839 grad_norm:0.4333 train_time:315997ms step_avg:3159.97ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/wandb-metadata.json new file mode 100644 index 0000000000..6464b0f022 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T17:49:02.380446Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40039342080" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "u302yixkov4nhd7ubajrtr7tgcll377y" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/output.log new file mode 100644 index 0000000000..3d99758f31 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/output.log @@ -0,0 +1,36 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.01ms +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3138ms step_avg:3138.10ms +step:2/500 train_loss:6.6659 grad_norm:0.8083 train_time:6266ms step_avg:3133.06ms +step:3/500 train_loss:6.0874 grad_norm:0.8065 train_time:9426ms step_avg:3141.98ms +step:4/500 train_loss:5.8500 grad_norm:0.4762 train_time:12583ms step_avg:3145.81ms +step:5/500 train_loss:5.8868 grad_norm:1.3528 train_time:15741ms step_avg:3148.16ms +step:6/500 train_loss:5.8887 grad_norm:1.1622 train_time:18899ms step_avg:3149.82ms +step:7/500 train_loss:5.8843 grad_norm:1.2053 train_time:22057ms step_avg:3150.96ms +step:8/500 train_loss:5.8714 grad_norm:1.3344 train_time:25217ms step_avg:3152.11ms +step:9/500 train_loss:5.8215 grad_norm:1.0506 train_time:28378ms step_avg:3153.13ms +step:10/500 train_loss:5.6859 grad_norm:0.9420 train_time:31538ms step_avg:3153.85ms +step:20/500 train_loss:5.5241 grad_norm:1.6334 train_time:63125ms step_avg:3156.24ms +step:30/500 train_loss:4.5729 grad_norm:1.5870 train_time:94746ms step_avg:3158.20ms +step:40/500 train_loss:4.2207 grad_norm:2.9475 train_time:126497ms step_avg:3162.43ms +step:50/500 train_loss:3.8353 grad_norm:0.7305 train_time:158105ms step_avg:3162.11ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/wandb-metadata.json new file mode 100644 index 0000000000..e6e4246edb --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T17:59:15.931922Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40040157184" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "8ix4bdzxz7wda9usagb0ydickv50a7xi" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/output.log new file mode 100644 index 0000000000..995bbc6d62 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/output.log @@ -0,0 +1,48 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3131ms step_avg:3130.65ms +step:2/500 train_loss:6.0630 grad_norm:0.8852 train_time:6251ms step_avg:3125.47ms +step:3/500 train_loss:6.2767 grad_norm:1.1566 train_time:9401ms step_avg:3133.57ms +step:4/500 train_loss:6.8246 grad_norm:2.3769 train_time:12551ms step_avg:3137.76ms +step:5/500 train_loss:7.1034 grad_norm:2.4893 train_time:15703ms step_avg:3140.51ms +step:6/500 train_loss:6.7987 grad_norm:1.9131 train_time:18852ms step_avg:3142.07ms +step:7/500 train_loss:6.8808 grad_norm:2.6979 train_time:22000ms step_avg:3142.92ms +step:8/500 train_loss:6.8119 grad_norm:2.5149 train_time:25150ms step_avg:3143.70ms +step:9/500 train_loss:6.6692 grad_norm:1.6300 train_time:28301ms step_avg:3144.53ms +step:10/500 train_loss:6.4778 grad_norm:2.4469 train_time:31451ms step_avg:3145.07ms +step:20/500 train_loss:5.0602 grad_norm:1.3291 train_time:62943ms step_avg:3147.14ms +step:30/500 train_loss:4.8216 grad_norm:2.0738 train_time:94481ms step_avg:3149.37ms +step:40/500 train_loss:4.1574 grad_norm:1.0588 train_time:126107ms step_avg:3152.67ms +step:50/500 train_loss:3.8998 grad_norm:0.9799 train_time:157602ms step_avg:3152.05ms +step:50/500 val_loss:3.8491 val_bpb:2.2797 train_time:157634ms step_avg:3152.68ms h_norms=['62087.4', '62948.8', '63328.1', '64415.2', '65555.9', '88174.0', '99297.2', '117964.5', '117972.9', '117057.8', '94021.8', '103414.3', '113567.6', '119372.5', '124603.3', '83857.0', '106503.6', '121953.5', '119428.6', '122676.6'] growth=['0.982', '1.014', '1.006', '1.017', '1.018', '1.345', '1.126', '1.188', '1.000', '0.992', '1.392', '1.100', '1.098', '1.051', '1.044', '1.165', '1.270', '1.145', '0.979', '1.027'] +step:60/500 train_loss:3.6849 grad_norm:0.5692 train_time:189133ms step_avg:3152.22ms +step:70/500 train_loss:3.5871 grad_norm:0.6241 train_time:220645ms step_avg:3152.08ms +step:80/500 train_loss:3.5228 grad_norm:0.4259 train_time:252153ms step_avg:3151.92ms +step:90/500 train_loss:3.3905 grad_norm:0.5294 train_time:283679ms step_avg:3151.99ms +step:100/500 train_loss:3.3403 grad_norm:0.4869 train_time:315174ms step_avg:3151.74ms +step:100/500 val_loss:3.3095 val_bpb:1.9601 train_time:315205ms step_avg:3152.05ms h_norms=['57455.8', '58322.8', '59194.5', '60695.2', '61908.6', '74000.3', '96013.6', '104985.8', '108845.2', '110765.5', '83814.2', '95809.5', '103009.6', '108539.2', '113703.6', '75220.7', '86779.0', '96119.3', '103205.4', '107897.3'] growth=['0.983', '1.015', '1.015', '1.025', '1.020', '1.195', '1.297', '1.093', '1.037', '1.018', '1.346', '1.143', '1.075', '1.054', '1.048', '1.182', '1.154', '1.108', '1.074', '1.045'] +step:110/500 train_loss:3.2594 grad_norm:0.5430 train_time:346619ms step_avg:3151.08ms +step:120/500 train_loss:3.1627 grad_norm:0.3443 train_time:378074ms step_avg:3150.62ms +step:130/500 train_loss:3.0993 grad_norm:0.4013 train_time:409594ms step_avg:3150.72ms +step:140/500 train_loss:3.0509 grad_norm:0.4336 train_time:441106ms step_avg:3150.76ms +step:150/500 train_loss:2.9906 grad_norm:0.4297 train_time:472653ms step_avg:3151.02ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/wandb-metadata.json new file mode 100644 index 0000000000..4236902350 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T18:04:47.618414Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40040914944" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "3ob9i6x004ftq8p8pk2wxcoz4c3birh3" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/output.log new file mode 100644 index 0000000000..0031d895ac --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/output.log @@ -0,0 +1,35 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms jpw:0.1000 +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3138ms step_avg:3137.92ms +step:2/500 train_loss:6.0630 grad_norm:0.8852 train_time:6265ms step_avg:3132.44ms +step:3/500 train_loss:6.2767 grad_norm:1.1566 train_time:9418ms step_avg:3139.33ms +step:4/500 train_loss:6.8246 grad_norm:2.3705 train_time:12572ms step_avg:3142.95ms +step:5/500 train_loss:7.1027 grad_norm:2.4915 train_time:15728ms step_avg:3145.56ms +step:6/500 train_loss:7.8286 grad_norm:8.6216 train_time:18882ms step_avg:3146.95ms +step:7/500 train_loss:6.9764 grad_norm:2.0638 train_time:22040ms step_avg:3148.60ms +step:8/500 train_loss:6.6684 grad_norm:1.9308 train_time:25196ms step_avg:3149.51ms +step:9/500 train_loss:6.6037 grad_norm:1.6760 train_time:28351ms step_avg:3150.11ms +step:10/500 train_loss:6.3408 grad_norm:1.7243 train_time:31507ms step_avg:3150.71ms +step:20/500 train_loss:5.1670 grad_norm:1.0415 train_time:63051ms step_avg:3152.55ms +step:30/500 train_loss:4.5147 grad_norm:1.1903 train_time:94615ms step_avg:3153.85ms +step:40/500 train_loss:4.0341 grad_norm:0.6994 train_time:126304ms step_avg:3157.60ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/wandb-metadata.json new file mode 100644 index 0000000000..7f636b6a81 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/wandb-metadata.json @@ -0,0 +1,57 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T18:18:43.776616Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--jacobian-proxy-init", + "1.0", + "--jacobian-warmdown-steps", + "100", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40041984000" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "7cqvs9pug6pue5wub8p3t9muxgo3mx9j" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/output.log new file mode 100644 index 0000000000..da423b9e56 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/output.log @@ -0,0 +1,39 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.03ms jpw:0.1000 +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3134ms step_avg:3133.83ms +step:2/500 train_loss:8.1781 grad_norm:3.1870 train_time:6255ms step_avg:3127.28ms +step:3/500 train_loss:38.9994 grad_norm:116.1021 train_time:9401ms step_avg:3133.70ms +step:4/500 train_loss:11.0719 grad_norm:18.4144 train_time:12551ms step_avg:3137.63ms +step:5/500 train_loss:7.4167 grad_norm:1.3480 train_time:15699ms step_avg:3139.71ms +step:6/500 train_loss:7.3220 grad_norm:4.6318 train_time:18851ms step_avg:3141.88ms +step:7/500 train_loss:7.3726 grad_norm:2.9103 train_time:22002ms step_avg:3143.20ms +step:8/500 train_loss:7.0895 grad_norm:2.4760 train_time:25155ms step_avg:3144.38ms +step:9/500 train_loss:6.7142 grad_norm:1.5844 train_time:28307ms step_avg:3145.17ms +step:10/500 train_loss:6.5562 grad_norm:2.4313 train_time:31456ms step_avg:3145.64ms +step:20/500 train_loss:5.8714 grad_norm:3.6692 train_time:62966ms step_avg:3148.32ms +step:30/500 train_loss:5.0120 grad_norm:1.4222 train_time:94460ms step_avg:3148.67ms +step:40/500 train_loss:4.3942 grad_norm:0.4195 train_time:126054ms step_avg:3151.34ms +step:50/500 train_loss:4.0610 grad_norm:0.4135 train_time:157574ms step_avg:3151.47ms +step:50/500 val_loss:4.0211 val_bpb:2.3816 train_time:157605ms step_avg:3152.10ms h_norms=['106271.3', '105043.7', '106111.3', '108041.9', '104153.4', '127767.0', '319009.2', '273895.3', '225243.8', '193857.2', '141638.6', '148276.8', '156801.2', '166727.3', '166524.3', '147292.3', '167621.6', '181512.6', '192726.1', '193725.8'] growth=['0.971', '0.988', '1.010', '1.018', '0.964', '1.227', '2.497', '0.859', '0.822', '0.861', '1.310', '1.047', '1.057', '1.063', '0.999', '1.237', '1.138', '1.083', '1.062', '1.005'] jpw:0.5590 +step:60/500 train_loss:3.8035 grad_norm:0.5494 train_time:189124ms step_avg:3152.07ms +step:70/500 train_loss:3.6830 grad_norm:0.5418 train_time:220632ms step_avg:3151.88ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/wandb-metadata.json new file mode 100644 index 0000000000..d5ed6189d5 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/wandb-metadata.json @@ -0,0 +1,57 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T18:23:38.606618Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--jacobian-proxy-init", + "1.0", + "--jacobian-warmdown-steps", + "100", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40042651648" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "41xbvhovsx2cjrl9sh755pc7rf7chrpi" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/output.log new file mode 100644 index 0000000000..3ec7213eb5 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/output.log @@ -0,0 +1,39 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.01ms jpw:0.1000 +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3128ms step_avg:3128.41ms +step:2/500 train_loss:8.2222 grad_norm:3.3072 train_time:6251ms step_avg:3125.58ms +step:3/500 train_loss:33.5787 grad_norm:96.8257 train_time:9398ms step_avg:3132.71ms +step:4/500 train_loss:9.6490 grad_norm:12.2426 train_time:12549ms step_avg:3137.20ms +step:5/500 train_loss:7.2469 grad_norm:1.2742 train_time:15703ms step_avg:3140.50ms +step:6/500 train_loss:7.2557 grad_norm:4.1964 train_time:18853ms step_avg:3142.10ms +step:7/500 train_loss:7.1683 grad_norm:2.3011 train_time:22004ms step_avg:3143.39ms +step:8/500 train_loss:6.8964 grad_norm:2.7614 train_time:25156ms step_avg:3144.44ms +step:9/500 train_loss:6.7751 grad_norm:3.1754 train_time:28305ms step_avg:3145.02ms +step:10/500 train_loss:6.3857 grad_norm:1.2630 train_time:31455ms step_avg:3145.47ms +step:20/500 train_loss:5.6141 grad_norm:1.3015 train_time:62945ms step_avg:3147.24ms +step:30/500 train_loss:5.6210 grad_norm:9.5553 train_time:94425ms step_avg:3147.51ms +step:40/500 train_loss:4.6846 grad_norm:0.5640 train_time:126016ms step_avg:3150.39ms +step:50/500 train_loss:4.1659 grad_norm:0.2352 train_time:157487ms step_avg:3149.75ms +step:50/500 val_loss:4.1176 val_bpb:2.4387 train_time:157519ms step_avg:3150.38ms h_norms=['335866.4', '282945.0', '272903.9', '222662.3', '156368.6', '961668.8', '881032.8', '713078.4', '516760.4', '397430.0', '153059.2', '147951.4', '140618.3', '126310.0', '114506.3', '125878.2', '191353.2', '171239.0', '146885.7', '127148.9'] growth=['1.930', '0.842', '0.965', '0.816', '0.702', '6.150', '0.916', '0.809', '0.725', '0.769', '0.966', '0.967', '0.950', '0.898', '0.907', '1.026', '1.520', '0.895', '0.858', '0.866'] jpw:0.5590 +step:60/500 train_loss:3.8723 grad_norm:0.4775 train_time:189007ms step_avg:3150.12ms +step:70/500 train_loss:3.7416 grad_norm:0.5037 train_time:220553ms step_avg:3150.76ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/wandb-metadata.json new file mode 100644 index 0000000000..abc9745468 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/wandb-metadata.json @@ -0,0 +1,57 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T18:31:44.598765Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--jacobian-proxy-init", + "1.0", + "--jacobian-warmdown-steps", + "100", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40043270144" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "2vgq6klm9tmj4dc1tcpqmvwzjgdm14bn" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/config.yaml new file mode 100644 index 0000000000..0c556f5cfc --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/config.yaml @@ -0,0 +1,106 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + p9t8sdzrgiip8uwe1safy5db893hm1c3: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --jacobian-proxy-init + - "5.0" + - --jacobian-warmdown-steps + - "100" + - --lora-warmup-steps + - "50" + - --no-interpass-rmsnorm + - --lora-rank + - "8" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40043945984" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T18:40:07.925502Z" + writerId: p9t8sdzrgiip8uwe1safy5db893hm1c3 + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 500 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 28156000 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/output.log new file mode 100644 index 0000000000..9e72f53613 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/output.log @@ -0,0 +1,100 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms jpw:0.1000 +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3131ms step_avg:3131.00ms +step:2/500 train_loss:8.2530 grad_norm:3.3922 train_time:6252ms step_avg:3126.10ms +step:3/500 train_loss:8.0060 grad_norm:9.9122 train_time:9401ms step_avg:3133.59ms +step:4/500 train_loss:7.1538 grad_norm:1.2770 train_time:12549ms step_avg:3137.34ms +step:5/500 train_loss:7.1510 grad_norm:1.7610 train_time:15698ms step_avg:3139.56ms +step:6/500 train_loss:6.9298 grad_norm:2.0765 train_time:18847ms step_avg:3141.20ms +step:7/500 train_loss:6.9181 grad_norm:3.2331 train_time:21996ms step_avg:3142.27ms +step:8/500 train_loss:6.7962 grad_norm:1.6115 train_time:25145ms step_avg:3143.09ms +step:9/500 train_loss:6.4521 grad_norm:1.1930 train_time:28294ms step_avg:3143.83ms +step:10/500 train_loss:6.1161 grad_norm:1.3597 train_time:31450ms step_avg:3144.97ms +step:20/500 train_loss:4.7664 grad_norm:2.1164 train_time:62972ms step_avg:3148.59ms +step:30/500 train_loss:4.3513 grad_norm:2.1193 train_time:94486ms step_avg:3149.53ms +step:40/500 train_loss:4.3373 grad_norm:6.4797 train_time:126119ms step_avg:3152.97ms +step:50/500 train_loss:3.9627 grad_norm:0.7730 train_time:157627ms step_avg:3152.54ms +step:50/500 val_loss:3.9200 val_bpb:2.3216 train_time:157659ms step_avg:3153.17ms h_norms=['64274.7', '65654.7', '65790.9', '63923.0', '61167.5', '71576.8', '76431.3', '80533.1', '84488.5', '86665.3', '87025.9', '725127.4', '632645.4', '544538.6', '445896.0', '80419.9', '91824.9', '110608.9', '131367.6', '145242.4'] growth=['0.989', '1.021', '1.002', '0.972', '0.957', '1.170', '1.068', '1.054', '1.049', '1.026', '1.357', '8.332', '0.872', '0.861', '0.819', '1.180', '1.142', '1.205', '1.188', '1.106'] jpw:2.5990 +step:60/500 train_loss:3.7054 grad_norm:0.9363 train_time:189133ms step_avg:3152.22ms +step:70/500 train_loss:3.5955 grad_norm:0.4140 train_time:220643ms step_avg:3152.04ms +step:80/500 train_loss:3.5108 grad_norm:0.4226 train_time:252172ms step_avg:3152.16ms +step:90/500 train_loss:3.3388 grad_norm:0.4857 train_time:283690ms step_avg:3152.12ms +step:100/500 train_loss:3.2920 grad_norm:0.4328 train_time:315205ms step_avg:3152.05ms +step:100/500 val_loss:3.2529 val_bpb:1.9266 train_time:315237ms step_avg:3152.37ms h_norms=['68496.9', '69533.5', '70329.0', '70151.5', '67282.9', '82184.2', '97588.3', '111284.9', '122709.0', '127061.8', '114348.1', '154423.4', '158482.6', '160652.5', '158037.5', '95279.7', '112085.8', '124794.6', '134699.2', '137489.2'] growth=['1.024', '1.015', '1.011', '0.997', '0.959', '1.221', '1.187', '1.140', '1.103', '1.035', '1.651', '1.350', '1.026', '1.014', '0.984', '1.374', '1.176', '1.113', '1.079', '1.021'] jpw:0.1490 +step:110/500 train_loss:3.1973 grad_norm:0.4638 train_time:346673ms step_avg:3151.57ms +step:120/500 train_loss:3.0966 grad_norm:0.3586 train_time:378163ms step_avg:3151.36ms +step:130/500 train_loss:3.0353 grad_norm:0.4942 train_time:409701ms step_avg:3151.55ms +step:140/500 train_loss:2.9792 grad_norm:0.3505 train_time:441225ms step_avg:3151.60ms +step:150/500 train_loss:2.9324 grad_norm:0.3410 train_time:472737ms step_avg:3151.58ms +step:150/500 val_loss:2.9123 val_bpb:1.7248 train_time:472768ms step_avg:3151.79ms h_norms=['54591.3', '54936.0', '54614.2', '52686.0', '48419.6', '56665.0', '62607.1', '67063.4', '69061.8', '67313.2', '78936.0', '89378.7', '89174.9', '84854.0', '79911.0', '75692.4', '81948.0', '80201.0', '77479.5', '72017.6'] growth=['1.016', '1.006', '0.994', '0.965', '0.919', '1.170', '1.105', '1.071', '1.030', '0.975', '1.623', '1.132', '0.998', '0.952', '0.942', '1.556', '1.083', '0.979', '0.966', '0.930'] jpw:0.1000 +step:160/500 train_loss:2.9131 grad_norm:0.3417 train_time:504245ms step_avg:3151.53ms +step:170/500 train_loss:2.8519 grad_norm:0.3201 train_time:535772ms step_avg:3151.60ms +step:180/500 train_loss:2.7615 grad_norm:0.3215 train_time:567320ms step_avg:3151.78ms +step:190/500 train_loss:2.7781 grad_norm:0.3936 train_time:598853ms step_avg:3151.86ms +step:200/500 train_loss:2.6955 grad_norm:0.3496 train_time:630517ms step_avg:3152.58ms +step:200/500 val_loss:2.7432 val_bpb:1.6247 train_time:630549ms step_avg:3152.74ms h_norms=['46320.0', '46179.5', '45761.4', '44508.9', '40976.6', '47797.8', '51896.9', '53164.9', '52566.1', '49366.9', '45993.8', '52287.0', '57769.8', '55335.3', '50418.6', '61436.7', '59953.2', '55890.7', '54343.5', '50288.3'] growth=['0.998', '0.997', '0.991', '0.973', '0.921', '1.166', '1.086', '1.024', '0.989', '0.939', '1.122', '1.137', '1.105', '0.958', '0.911', '1.498', '0.976', '0.932', '0.972', '0.925'] jpw:0.1000 +step:210/500 train_loss:2.6793 grad_norm:0.2864 train_time:662071ms step_avg:3152.72ms +step:220/500 train_loss:2.7259 grad_norm:0.2815 train_time:693632ms step_avg:3152.87ms +step:230/500 train_loss:2.6538 grad_norm:0.2559 train_time:725190ms step_avg:3153.00ms +step:240/500 train_loss:2.6575 grad_norm:0.3120 train_time:756724ms step_avg:3153.02ms +step:250/500 train_loss:2.7087 grad_norm:0.4401 train_time:788283ms step_avg:3153.13ms +step:250/500 val_loss:2.6326 val_bpb:1.5592 train_time:788315ms step_avg:3153.26ms h_norms=['43543.6', '43248.6', '42211.9', '41586.9', '38254.6', '45794.9', '47618.7', '47301.2', '46865.0', '43024.5', '41461.4', '45151.9', '46339.2', '43777.7', '39702.6', '44273.0', '48162.3', '46945.5', '44420.7', '40421.7'] growth=['0.993', '0.993', '0.976', '0.985', '0.920', '1.197', '1.040', '0.993', '0.991', '0.918', '1.086', '1.089', '1.026', '0.945', '0.907', '1.160', '1.088', '0.975', '0.946', '0.910'] jpw:0.1000 +step:260/500 train_loss:2.6426 grad_norm:0.3115 train_time:819839ms step_avg:3153.23ms +step:270/500 train_loss:2.6021 grad_norm:0.3289 train_time:851372ms step_avg:3153.23ms +step:280/500 train_loss:2.5391 grad_norm:0.2397 train_time:882901ms step_avg:3153.22ms +step:290/500 train_loss:2.5774 grad_norm:0.2781 train_time:914431ms step_avg:3153.21ms +step:300/500 train_loss:2.5344 grad_norm:0.2146 train_time:945978ms step_avg:3153.26ms +step:300/500 val_loss:2.5490 val_bpb:1.5097 train_time:946010ms step_avg:3153.37ms h_norms=['43232.1', '42432.1', '41223.9', '40796.4', '37886.6', '45413.0', '46563.8', '45871.2', '46402.4', '43360.7', '41003.4', '46803.1', '44047.2', '41094.8', '37367.9', '56900.2', '52918.1', '50843.4', '45060.9', '39790.1'] growth=['0.974', '0.981', '0.972', '0.990', '0.929', '1.199', '1.025', '0.985', '1.012', '0.934', '1.082', '1.141', '0.941', '0.933', '0.909', '1.499', '0.930', '0.961', '0.886', '0.883'] jpw:0.1000 +step:310/500 train_loss:2.4523 grad_norm:0.2264 train_time:977542ms step_avg:3153.36ms +step:320/500 train_loss:2.5183 grad_norm:0.2105 train_time:1009085ms step_avg:3153.39ms +step:330/500 train_loss:2.5571 grad_norm:0.2283 train_time:1040628ms step_avg:3153.42ms +step:340/500 train_loss:2.4750 grad_norm:0.4910 train_time:1072163ms step_avg:3153.42ms +step:350/500 train_loss:2.5743 grad_norm:0.2184 train_time:1103847ms step_avg:3153.85ms +step:350/500 val_loss:2.4914 val_bpb:1.4755 train_time:1103879ms step_avg:3153.94ms h_norms=['44224.9', '42616.1', '40801.7', '40665.7', '38003.5', '45954.9', '46072.2', '44968.3', '46243.3', '43608.8', '39877.4', '46404.4', '44214.0', '40903.7', '37660.7', '46128.9', '44751.9', '42821.9', '41310.8', '37880.0'] growth=['0.963', '0.964', '0.957', '0.997', '0.935', '1.209', '1.003', '0.976', '1.028', '0.943', '1.050', '1.164', '0.953', '0.925', '0.921', '1.213', '0.970', '0.957', '0.965', '0.917'] jpw:0.1000 +step:360/500 train_loss:2.3401 grad_norm:0.2802 train_time:1135398ms step_avg:3153.88ms +step:370/500 train_loss:2.5430 grad_norm:0.2968 train_time:1166946ms step_avg:3153.91ms +step:380/500 train_loss:2.4805 grad_norm:0.2258 train_time:1198510ms step_avg:3153.97ms +step:390/500 train_loss:2.4340 grad_norm:0.1956 train_time:1230116ms step_avg:3154.14ms +step:400/500 train_loss:2.4815 grad_norm:0.1874 train_time:1261679ms step_avg:3154.20ms +step:400/500 val_loss:2.4425 val_bpb:1.4466 train_time:1261711ms step_avg:3154.28ms h_norms=['46265.4', '44746.1', '42831.1', '42585.7', '40307.7', '47408.2', '47175.3', '46076.6', '47453.2', '45874.0', '42031.8', '46244.6', '48766.2', '44593.0', '39205.9', '48067.7', '47603.8', '45564.2', '42074.1', '38804.8'] growth=['0.948', '0.967', '0.957', '0.994', '0.947', '1.176', '0.995', '0.977', '1.030', '0.967', '1.041', '1.100', '1.055', '0.914', '0.879', '1.187', '0.990', '0.957', '0.923', '0.922'] jpw:0.1000 +step:410/500 train_loss:2.4447 grad_norm:0.2270 train_time:1293242ms step_avg:3154.25ms +step:420/500 train_loss:2.4766 grad_norm:0.1941 train_time:1324799ms step_avg:3154.28ms +step:430/500 train_loss:2.3966 grad_norm:0.2439 train_time:1356341ms step_avg:3154.28ms +step:440/500 train_loss:2.4855 grad_norm:0.2573 train_time:1387879ms step_avg:3154.27ms +step:450/500 train_loss:2.3007 grad_norm:0.5815 train_time:1419425ms step_avg:3154.28ms +step:450/500 val_loss:2.4005 val_bpb:1.4217 train_time:1419456ms step_avg:3154.35ms h_norms=['49465.2', '47383.0', '45019.3', '44070.9', '41581.5', '48767.1', '48045.7', '46626.6', '47351.6', '45564.7', '73151.9', '1831187.8', '1431135.6', '1073570.8', '836909.8', '68914.1', '62393.5', '62367.8', '57867.6', '52301.5'] growth=['0.938', '0.958', '0.950', '0.979', '0.944', '1.173', '0.985', '0.970', '1.016', '0.962', '1.777', '25.033', '0.782', '0.750', '0.780', '1.670', '0.905', '1.000', '0.928', '0.904'] jpw:0.1000 +step:460/500 train_loss:2.4303 grad_norm:0.1559 train_time:1450982ms step_avg:3154.31ms +step:470/500 train_loss:2.3816 grad_norm:0.1371 train_time:1482497ms step_avg:3154.25ms +step:480/500 train_loss:2.3382 grad_norm:0.2851 train_time:1514014ms step_avg:3154.20ms +step:490/500 train_loss:2.3456 grad_norm:0.4312 train_time:1545525ms step_avg:3154.13ms +step:500/500 train_loss:2.3579 grad_norm:0.2399 train_time:1577067ms step_avg:3154.13ms +step:500/500 val_loss:2.3751 val_bpb:1.4067 train_time:1577098ms step_avg:3154.20ms h_norms=['54182.7', '51741.9', '48776.3', '47916.3', '45263.7', '52401.3', '51762.4', '50104.6', '51713.6', '50582.7', '434613.8', '385628.6', '300553.7', '315624.6', '281217.0', '176020.4', '267042.6', '201712.7', '153802.4', '124577.2'] growth=['0.928', '0.955', '0.943', '0.982', '0.945', '1.158', '0.988', '0.968', '1.032', '0.978', '9.613', '0.887', '0.779', '1.050', '0.891', '3.893', '1.517', '0.755', '0.762', '0.810'] jpw:0.1000 +peak memory allocated: 66527 MiB reserved: 67428 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:2.7827 val_bpb:1.6481 eval_time:85106ms +Serialized model: 110942659 bytes +Code size: 103206 bytes +Serialized model int6+lzma: 10736384 bytes +Total submission size int6+lzma: 10839590 bytes +final_int6_roundtrip val_loss:2.8519 val_bpb:1.6890 eval_time:84507ms +final_int6_roundtrip_exact val_loss:2.85187392 val_bpb:1.68904037 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/wandb-metadata.json new file mode 100644 index 0000000000..53a740a9d7 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/wandb-metadata.json @@ -0,0 +1,59 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T18:40:07.925502Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--jacobian-proxy-init", + "5.0", + "--jacobian-warmdown-steps", + "100", + "--lora-warmup-steps", + "50", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40043945984" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "p9t8sdzrgiip8uwe1safy5db893hm1c3" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/wandb-summary.json new file mode 100644 index 0000000000..9a7d2e44a8 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/wandb-summary.json @@ -0,0 +1 @@ +{"lr_scale":1,"grad_norm":0.23986004292964935,"train_loss":2.357853889465332,"_wandb":{"runtime":2777},"_step":500,"val_loss":2.375083764247166,"val_bpb":1.4066583871911558,"step_avg_ms":3154.1338251400157,"_runtime":2777.546620495,"_timestamp":1.7745529855891504e+09} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/config.yaml new file mode 100644 index 0000000000..185fab703b --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/config.yaml @@ -0,0 +1,104 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + gqp3e35vfrjqc78cu18676xpicu32fzn: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --jacobian-proxy-init + - "2.0" + - --jacobian-warmdown-steps + - "100" + - --lora-warmup-steps + - "50" + - --no-interpass-rmsnorm + - --lora-rank + - "8" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40052236288" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T19:35:30.290587Z" + writerId: gqp3e35vfrjqc78cu18676xpicu32fzn + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 500 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 28156000 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/output.log new file mode 100644 index 0000000000..ee502872af --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/output.log @@ -0,0 +1,50 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2155, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1815, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1061, in forward + x, h_core_in, h_core_out, pass_penalty = self._forward_hidden(input_ids, feedback_fn, stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1033, in _forward_hidden + x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 786, in forward + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 747, in forward + x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 215.94 MiB is free. Including non-PyTorch memory, this process has 54.45 GiB memory in use. Process 386552 has 65.25 GiB memory in use. Process 386553 has 19.88 GiB memory in use. Of the allocated memory 53.67 GiB is allocated by PyTorch, and 120.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/wandb-metadata.json new file mode 100644 index 0000000000..b8fee3f7a7 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/wandb-metadata.json @@ -0,0 +1,59 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T19:35:30.290587Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--jacobian-proxy-init", + "2.0", + "--jacobian-warmdown-steps", + "100", + "--lora-warmup-steps", + "50", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40052236288" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "gqp3e35vfrjqc78cu18676xpicu32fzn" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/wandb-summary.json new file mode 100644 index 0000000000..b0a620d0c1 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/wandb-summary.json @@ -0,0 +1 @@ +{"_wandb":{"runtime":0},"_runtime":0} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/config.yaml new file mode 100644 index 0000000000..956c0c7a25 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/config.yaml @@ -0,0 +1,104 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + ijj0agehn16h6kvvawah0icafug73yke: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --jacobian-proxy-init + - "2.0" + - --jacobian-warmdown-steps + - "100" + - --lora-warmup-steps + - "50" + - --no-interpass-rmsnorm + - --lora-rank + - "8" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40052224000" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T19:35:30.190074Z" + writerId: ijj0agehn16h6kvvawah0icafug73yke + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 500 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 28156000 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/output.log new file mode 100644 index 0000000000..0670c5c62f --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/output.log @@ -0,0 +1,14 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2155, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1816, in main + (warmup_loss * grad_scale).backward() + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward + torch.autograd.backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward + _engine_run_backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward + return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 576.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 509.94 MiB is free. Process 386554 has 54.45 GiB memory in use. Including non-PyTorch memory, this process has 64.96 GiB memory in use. Process 386553 has 19.88 GiB memory in use. Of the allocated memory 64.13 GiB is allocated by PyTorch, and 110.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/wandb-metadata.json new file mode 100644 index 0000000000..bfaebcaa1c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/wandb-metadata.json @@ -0,0 +1,59 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T19:35:30.190074Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--jacobian-proxy-init", + "2.0", + "--jacobian-warmdown-steps", + "100", + "--lora-warmup-steps", + "50", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40052224000" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "ijj0agehn16h6kvvawah0icafug73yke" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/wandb-summary.json new file mode 100644 index 0000000000..1d476fc886 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":0,"_wandb":{"runtime":0}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/config.yaml new file mode 100644 index 0000000000..9d55a97684 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/config.yaml @@ -0,0 +1,104 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + vcb65sfrwojgh1nskerjp2lhh1kjjlgy: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --jacobian-proxy-init + - "2.0" + - --jacobian-warmdown-steps + - "100" + - --lora-warmup-steps + - "50" + - --no-interpass-rmsnorm + - --lora-rank + - "8" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40052248576" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T19:35:30.424020Z" + writerId: vcb65sfrwojgh1nskerjp2lhh1kjjlgy + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 500 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 28156000 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/output.log new file mode 100644 index 0000000000..42fa78428c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/output.log @@ -0,0 +1,50 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2155, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1815, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1061, in forward + x, h_core_in, h_core_out, pass_penalty = self._forward_hidden(input_ids, feedback_fn, stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1033, in _forward_hidden + x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 786, in forward + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 748, in forward + return F.linear(x.square(), down_w.to(x.dtype)) + ^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 576.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 195.94 MiB is free. Process 386554 has 54.46 GiB memory in use. Process 386552 has 65.25 GiB memory in use. Including non-PyTorch memory, this process has 19.88 GiB memory in use. Of the allocated memory 19.11 GiB is allocated by PyTorch, and 100.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/wandb-metadata.json new file mode 100644 index 0000000000..b7808faf52 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/wandb-metadata.json @@ -0,0 +1,59 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T19:35:30.424020Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--jacobian-proxy-init", + "2.0", + "--jacobian-warmdown-steps", + "100", + "--lora-warmup-steps", + "50", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40052248576" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "vcb65sfrwojgh1nskerjp2lhh1kjjlgy" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/wandb-summary.json new file mode 100644 index 0000000000..1d476fc886 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":0,"_wandb":{"runtime":0}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/output.log new file mode 100644 index 0000000000..f109d80e95 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/output.log @@ -0,0 +1,63 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms jpw:0.1000 +step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3150ms step_avg:3149.64ms +step:2/500 train_loss:8.2530 grad_norm:3.3922 train_time:6286ms step_avg:3143.18ms +step:3/500 train_loss:7.4761 grad_norm:1.7568 train_time:9456ms step_avg:3151.99ms +step:4/500 train_loss:7.5765 grad_norm:1.7785 train_time:12627ms step_avg:3156.70ms +step:5/500 train_loss:7.3284 grad_norm:1.8550 train_time:15795ms step_avg:3159.04ms +step:6/500 train_loss:7.0817 grad_norm:1.4558 train_time:18966ms step_avg:3160.97ms +step:7/500 train_loss:6.9079 grad_norm:2.0922 train_time:22139ms step_avg:3162.66ms +step:8/500 train_loss:6.9275 grad_norm:2.0590 train_time:25309ms step_avg:3163.66ms +step:9/500 train_loss:6.6772 grad_norm:1.6278 train_time:28478ms step_avg:3164.21ms +step:10/500 train_loss:6.2414 grad_norm:1.7465 train_time:31650ms step_avg:3164.98ms +step:20/500 train_loss:5.1363 grad_norm:2.7138 train_time:63335ms step_avg:3166.75ms +step:30/500 train_loss:4.3551 grad_norm:1.3891 train_time:95058ms step_avg:3168.60ms +step:40/500 train_loss:4.0787 grad_norm:1.7074 train_time:126907ms step_avg:3172.68ms +step:50/500 train_loss:3.8860 grad_norm:0.8342 train_time:158631ms step_avg:3172.61ms +step:50/500 val_loss:3.8436 val_bpb:2.2764 train_time:158662ms step_avg:3173.24ms h_norms=['29399.1', '30127.6', '31287.8', '32221.2', '34106.9', '39952.1', '45442.3', '51958.3', '58827.5', '67490.0', '55560.2', '64617.7', '74357.9', '86351.5', '99810.2', '82827.4', '103212.4', '115017.4', '128838.7', '149489.9'] growth=['1.084', '1.025', '1.039', '1.030', '1.059', '1.171', '1.137', '1.143', '1.132', '1.147', '1.186', '1.163', '1.151', '1.161', '1.156', '1.254', '1.246', '1.114', '1.120', '1.160'] jpw:1.0690 +step:60/500 train_loss:3.6867 grad_norm:0.7324 train_time:190352ms step_avg:3172.53ms +step:70/500 train_loss:3.5703 grad_norm:1.0982 train_time:222085ms step_avg:3172.64ms +step:80/500 train_loss:3.5100 grad_norm:0.6625 train_time:253815ms step_avg:3172.68ms +step:90/500 train_loss:3.3401 grad_norm:0.6580 train_time:285565ms step_avg:3172.94ms +step:100/500 train_loss:3.2941 grad_norm:0.6615 train_time:317291ms step_avg:3172.91ms +step:100/500 val_loss:3.2441 val_bpb:1.9214 train_time:317323ms step_avg:3173.23ms h_norms=['47705.2', '50092.6', '51794.0', '52615.4', '53976.0', '64508.0', '74412.4', '84893.8', '95510.2', '106712.2', '85551.0', '100970.7', '116826.7', '133686.2', '150967.2', '117625.8', '139019.5', '162155.8', '186225.0', '211766.1'] growth=['1.052', '1.050', '1.034', '1.016', '1.026', '1.195', '1.154', '1.141', '1.125', '1.117', '1.208', '1.180', '1.157', '1.144', '1.129', '1.213', '1.182', '1.166', '1.148', '1.137'] jpw:0.1190 +step:110/500 train_loss:3.1981 grad_norm:0.6386 train_time:349008ms step_avg:3172.80ms +step:120/500 train_loss:3.1035 grad_norm:0.5124 train_time:380716ms step_avg:3172.63ms +step:130/500 train_loss:3.0258 grad_norm:0.3921 train_time:412455ms step_avg:3172.73ms +step:140/500 train_loss:2.9851 grad_norm:0.4252 train_time:444162ms step_avg:3172.58ms +step:150/500 train_loss:2.9393 grad_norm:0.5980 train_time:475876ms step_avg:3172.51ms +step:150/500 val_loss:2.9206 val_bpb:1.7298 train_time:475907ms step_avg:3172.71ms h_norms=['44309.3', '46766.3', '47516.9', '47905.3', '47774.2', '56567.2', '63533.7', '70064.7', '76357.1', '81758.3', '68481.7', '78087.2', '86586.4', '95168.0', '103378.6', '83962.7', '95447.8', '106783.7', '118504.9', '130044.9'] growth=['1.061', '1.055', '1.016', '1.008', '0.997', '1.184', '1.123', '1.103', '1.090', '1.071', '1.183', '1.140', '1.109', '1.099', '1.086', '1.172', '1.137', '1.119', '1.110', '1.097'] jpw:0.1000 +step:160/500 train_loss:2.9259 grad_norm:0.4479 train_time:507599ms step_avg:3172.49ms +step:170/500 train_loss:2.8777 grad_norm:0.4693 train_time:539353ms step_avg:3172.66ms +step:180/500 train_loss:2.7824 grad_norm:0.3977 train_time:571098ms step_avg:3172.77ms +step:190/500 train_loss:2.8183 grad_norm:0.5028 train_time:602819ms step_avg:3172.73ms +step:200/500 train_loss:2.7302 grad_norm:0.4966 train_time:634675ms step_avg:3173.37ms +step:200/500 val_loss:2.7679 val_bpb:1.6393 train_time:634706ms step_avg:3173.53ms h_norms=['43898.3', '46876.7', '47639.6', '47540.1', '46729.0', '54471.5', '60239.5', '64693.1', '68195.4', '70570.1', '61019.1', '68002.8', '73186.1', '77789.0', '81406.4', '69222.6', '76278.3', '82542.8', '88466.1', '93667.9'] growth=['1.112', '1.068', '1.016', '0.998', '0.983', '1.166', '1.106', '1.074', '1.054', '1.035', '1.147', '1.114', '1.076', '1.063', '1.047', '1.137', '1.102', '1.082', '1.072', '1.059'] jpw:0.1000 +step:210/500 train_loss:2.7159 grad_norm:0.3988 train_time:666398ms step_avg:3173.33ms +step:220/500 train_loss:2.7636 grad_norm:0.3531 train_time:698119ms step_avg:3173.27ms +step:230/500 train_loss:2.6959 grad_norm:0.3534 train_time:729840ms step_avg:3173.22ms +step:240/500 train_loss:2.6946 grad_norm:0.3099 train_time:761572ms step_avg:3173.21ms +step:250/500 train_loss:2.7419 grad_norm:0.3689 train_time:793293ms step_avg:3173.17ms +step:250/500 val_loss:2.6831 val_bpb:1.5891 train_time:793325ms step_avg:3173.30ms h_norms=['43230.1', '46896.1', '47276.5', '45457.8', '43084.9', '50881.0', '55210.7', '57425.2', '57111.8', '55992.8', '52524.6', '56946.6', '59050.3', '59130.6', '58425.0', '54273.5', '57840.7', '60095.9', '60604.2', '60888.6'] growth=['1.135', '1.085', '1.008', '0.962', '0.948', '1.181', '1.085', '1.040', '0.995', '0.980', '1.143', '1.084', '1.037', '1.001', '0.988', '1.113', '1.066', '1.039', '1.008', '1.005'] jpw:0.1000 +step:260/500 train_loss:2.6850 grad_norm:0.3416 train_time:825063ms step_avg:3173.32ms +step:270/500 train_loss:2.6478 grad_norm:0.3872 train_time:856842ms step_avg:3173.49ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/wandb-metadata.json new file mode 100644 index 0000000000..e160fc8182 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/wandb-metadata.json @@ -0,0 +1,59 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T19:43:37.901583Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--jacobian-proxy-init", + "2.0", + "--jacobian-warmdown-steps", + "100", + "--lora-warmup-steps", + "50", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40052879360" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "sila933dt7ckpfpkkyotpswbesbod99u" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/output.log new file mode 100644 index 0000000000..d61a863916 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/output.log @@ -0,0 +1,65 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms jpw:0.1000 +step:1/500 train_loss:6.9303 train_time:3121ms step_avg:3121.37ms +step:2/500 train_loss:8.2964 train_time:6284ms step_avg:3141.84ms +step:3/500 train_loss:7.7322 train_time:9457ms step_avg:3152.21ms +step:4/500 train_loss:8.5580 train_time:12628ms step_avg:3156.95ms +step:5/500 train_loss:8.4686 train_time:15797ms step_avg:3159.39ms +step:6/500 train_loss:7.7993 train_time:18963ms step_avg:3160.57ms +step:7/500 train_loss:7.2392 train_time:22130ms step_avg:3161.45ms +step:8/500 train_loss:7.0090 train_time:25296ms step_avg:3162.02ms +step:9/500 train_loss:6.5969 train_time:28465ms step_avg:3162.73ms +step:10/500 train_loss:6.4712 train_time:31634ms step_avg:3163.42ms +step:20/500 train_loss:5.4373 train_time:63316ms step_avg:3165.82ms +step:30/500 train_loss:4.7425 train_time:95013ms step_avg:3167.10ms +step:40/500 train_loss:4.5571 train_time:126854ms step_avg:3171.35ms +step:50/500 train_loss:4.2765 train_time:158552ms step_avg:3171.04ms +step:50/500 val_loss:4.2310 val_bpb:2.5059 train_time:158584ms step_avg:3171.67ms h_norms=['128020.9', '142329.3', '150817.1', '164904.4', '172878.4', '221093.7', '304101.2', '290425.7', '279737.3', '288668.9', '254141.1', '302318.5', '342082.8', '372346.5', '408459.7', '357174.3', '431074.6', '494662.8', '539688.3', '601738.6'] growth=['1.137', '1.112', '1.060', '1.093', '1.048', '1.279', '1.375', '0.955', '0.963', '1.032', '1.207', '1.190', '1.132', '1.088', '1.097', '1.197', '1.207', '1.148', '1.091', '1.115'] jpw:0.1000 +step:60/500 train_loss:4.0440 train_time:190235ms step_avg:3170.59ms +step:70/500 train_loss:3.9054 train_time:221921ms step_avg:3170.31ms +step:80/500 train_loss:3.8129 train_time:253616ms step_avg:3170.20ms +step:90/500 train_loss:3.6730 train_time:285312ms step_avg:3170.13ms +step:100/500 train_loss:3.6064 train_time:317006ms step_avg:3170.06ms +step:100/500 val_loss:3.5712 val_bpb:2.1151 train_time:317038ms step_avg:3170.38ms h_norms=['217506.4', '238701.1', '250839.5', '253256.4', '255121.4', '309055.8', '367379.8', '398821.4', '411101.1', '432634.1', '369051.3', '439129.6', '489535.2', '523037.9', '563631.0', '484894.3', '584870.6', '666437.0', '722848.2', '791578.9'] growth=['1.119', '1.097', '1.051', '1.010', '1.007', '1.211', '1.189', '1.086', '1.031', '1.052', '1.215', '1.190', '1.115', '1.068', '1.078', '1.215', '1.206', '1.139', '1.085', '1.095'] jpw:0.1000 +step:110/500 train_loss:3.5100 train_time:348728ms step_avg:3170.25ms +step:120/500 train_loss:3.4055 train_time:380452ms step_avg:3170.43ms +step:130/500 train_loss:3.3291 train_time:412209ms step_avg:3170.84ms +step:140/500 train_loss:3.2646 train_time:443910ms step_avg:3170.78ms +step:150/500 train_loss:3.1985 train_time:475595ms step_avg:3170.63ms +step:150/500 val_loss:3.1731 val_bpb:1.8793 train_time:475626ms step_avg:3170.84ms h_norms=['202425.6', '219581.3', '229199.7', '231251.1', '232042.3', '280260.3', '331272.2', '359679.4', '373077.9', '392866.2', '327880.4', '388136.0', '431546.2', '460585.2', '494839.2', '422156.4', '505718.2', '574404.4', '622526.9', '679339.4'] growth=['1.104', '1.085', '1.044', '1.009', '1.003', '1.208', '1.182', '1.086', '1.037', '1.053', '1.208', '1.184', '1.112', '1.067', '1.074', '1.212', '1.198', '1.136', '1.084', '1.091'] jpw:0.1000 +step:160/500 train_loss:3.1467 train_time:507288ms step_avg:3170.55ms +step:170/500 train_loss:3.0741 train_time:538985ms step_avg:3170.50ms +step:180/500 train_loss:2.9611 train_time:570664ms step_avg:3170.36ms +step:190/500 train_loss:2.9609 train_time:602355ms step_avg:3170.29ms +step:200/500 train_loss:2.8727 train_time:634178ms step_avg:3170.89ms +step:200/500 val_loss:2.9169 val_bpb:1.7275 train_time:634210ms step_avg:3171.05ms h_norms=['195918.4', '211776.8', '222459.3', '225583.9', '227291.6', '278823.8', '321587.3', '350774.8', '365721.4', '384108.5', '319066.1', '375610.7', '414454.6', '441410.4', '472513.9', '402015.0', '479212.5', '542916.8', '583835.1', '633030.1'] growth=['1.112', '1.081', '1.050', '1.014', '1.008', '1.227', '1.153', '1.091', '1.043', '1.050', '1.204', '1.177', '1.103', '1.065', '1.070', '1.204', '1.192', '1.133', '1.075', '1.084'] jpw:0.1000 +step:210/500 train_loss:2.8522 train_time:665890ms step_avg:3170.90ms +step:220/500 train_loss:2.9033 train_time:697577ms step_avg:3170.81ms +step:230/500 train_loss:2.8202 train_time:729281ms step_avg:3170.79ms +step:240/500 train_loss:2.8185 train_time:760982ms step_avg:3170.76ms +step:250/500 train_loss:2.8500 train_time:792694ms step_avg:3170.78ms +step:250/500 val_loss:2.7950 val_bpb:1.6554 train_time:792726ms step_avg:3170.90ms h_norms=['187987.2', '202979.7', '214368.9', '220031.3', '223903.4', '284093.7', '324382.2', '353471.8', '371760.4', '389398.9', '318921.8', '370811.2', '408036.2', '436269.2', '464205.0', '395210.2', '465609.5', '525084.6', '560371.8', '601627.7'] growth=['1.132', '1.080', '1.056', '1.026', '1.018', '1.269', '1.142', '1.090', '1.052', '1.047', '1.203', '1.163', '1.100', '1.069', '1.064', '1.198', '1.178', '1.128', '1.067', '1.074'] jpw:0.1000 +step:260/500 train_loss:2.7996 train_time:824394ms step_avg:3170.75ms +step:270/500 train_loss:2.7542 train_time:856111ms step_avg:3170.78ms +step:280/500 train_loss:2.6927 train_time:887831ms step_avg:3170.82ms +step:290/500 train_loss:2.7306 train_time:919538ms step_avg:3170.82ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/wandb-metadata.json new file mode 100644 index 0000000000..e6874d4b8c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/wandb-metadata.json @@ -0,0 +1,55 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T20:08:26.327933Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--lora-warmup-steps", + "50", + "--no-interpass-rmsnorm", + "--lora-rank", + "8" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40054611968" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "fsaxjycqjdcmmjuyh1hh12xlw863arur" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/config.yaml new file mode 100644 index 0000000000..ba50bdf3ed --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/config.yaml @@ -0,0 +1,101 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + 2hzl1gm33hkpq8ol8tyxxvx8js89q6ty: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "8" + - --lora-warmup-steps + - "1500" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40056643584" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T20:34:03.046987Z" + writerId: 2hzl1gm33hkpq8ol8tyxxvx8js89q6ty + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 28156000 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/output.log new file mode 100644 index 0000000000..ec78344062 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/output.log @@ -0,0 +1,88 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9291 val_bpb:4.1038 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9303 grad_norm:0.3807 train_time:79129ms step_avg:79129.16ms +late_qat:enabled step:1 scale:0.0351 core_quant:on +step:2/20000 train_loss:8.3329 grad_norm:3.6210 train_time:164629ms step_avg:82314.47ms +step:3/20000 train_loss:8.2999 grad_norm:3.5783 train_time:250620ms step_avg:83539.89ms +step:4/20000 train_loss:8.2004 grad_norm:3.5109 train_time:332890ms step_avg:83222.59ms +W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] torch._dynamo hit config.recompile_limit (8) +W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] function: 'forward' (/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py:1053) +W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] last reason: 0/7: self._lora_step_mul == 0.002 # s = self._lora_scale * getattr(self, '_lora_step_mul', 1.0) # records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py:1024 in _forward_hidden +W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] User stack trace: +W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1055, in forward +W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) +W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1024, in _forward_hidden +W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] s = self._lora_scale * getattr(self, '_lora_step_mul', 1.0) +W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". +W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] To diagnose recompilation issues, see https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html +Traceback (most recent call last): + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1773, in _compile + raise_unimplemented_cache_limit_exceeded() + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1757, in raise_unimplemented_cache_limit_exceeded + unimplemented( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/exc.py", line 634, in unimplemented + raise Unsupported(msg, gb_type, skip_frame) +torch._dynamo.exc.Unsupported: Dynamo recompile limit exceeded + Explanation: Dynamo attempted to recompile the code object too many times, exceeding the recompile_limit cache size limit (currently set to 8). Excessive recompilations can degrade performance due to the compilation overhead of each recompilation. + Hint: To monitor recompilations, enable TORCH_LOGS=recompiles. If recompilations are expected, consider increasing torch._dynamo.config.recompile_limit to an appropriate value. + Hint: See https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html for tips on dealing with recompilations. + + Developer debug context: Limit type: recompile_limit + + For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0039.html + +The above exception was the direct cause of the following exception: + +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2145, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1902, in main + loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 2316, in __call__ + result = self._torchdynamo_orig_backend( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 729, in __call__ + result = _compile( + ^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1780, in _compile + raise FailOnRecompileLimitHit( +torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/wandb-metadata.json new file mode 100644 index 0000000000..ace0ec048d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/wandb-metadata.json @@ -0,0 +1,55 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T20:34:03.046987Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8", + "--lora-warmup-steps", + "1500" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40056643584" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "2hzl1gm33hkpq8ol8tyxxvx8js89q6ty" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/wandb-summary.json new file mode 100644 index 0000000000..4c77cd2d10 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":623.719037113,"val_loss":6.929145795668215,"lr_scale":0.03203231232190314,"grad_norm":3.5108964443206787,"train_loss":8.20039176940918,"_timestamp":1.7745578673258333e+09,"step_avg_ms":83222.5916167481,"_step":4,"val_bpb":4.1038304401177745,"_wandb":{"runtime":623}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/config.yaml new file mode 100644 index 0000000000..5e3ada899d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/config.yaml @@ -0,0 +1,101 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + 1izso2rrzq197f6m1lqb13cnbo1l86f0: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "8" + - --lora-warmup-steps + - "1500" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40561086464" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T20:51:39.232921Z" + writerId: 1izso2rrzq197f6m1lqb13cnbo1l86f0 + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 28156000 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/output.log new file mode 100644 index 0000000000..c8f9580ede --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/output.log @@ -0,0 +1,186 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9291 val_bpb:4.1038 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9303 grad_norm:0.3807 train_time:1305ms step_avg:1305.20ms +step:2/20000 train_loss:8.2624 grad_norm:3.4023 train_time:2653ms step_avg:1326.56ms +step:3/20000 train_loss:7.4846 grad_norm:1.6936 train_time:4015ms step_avg:1338.21ms +step:4/20000 train_loss:7.7154 grad_norm:1.9838 train_time:5385ms step_avg:1346.32ms +step:5/20000 train_loss:7.4456 grad_norm:2.1207 train_time:6747ms step_avg:1349.42ms +step:6/20000 train_loss:7.0896 grad_norm:1.7550 train_time:8108ms step_avg:1351.41ms +step:7/20000 train_loss:6.8569 grad_norm:2.3306 train_time:9470ms step_avg:1352.92ms +step:8/20000 train_loss:6.7973 grad_norm:1.6453 train_time:10833ms step_avg:1354.11ms +step:9/20000 train_loss:6.5582 grad_norm:1.2844 train_time:12199ms step_avg:1355.49ms +step:10/20000 train_loss:6.2034 grad_norm:1.2514 train_time:13569ms step_avg:1356.86ms +step:50/20000 train_loss:3.6856 grad_norm:0.8303 train_time:68670ms step_avg:1373.39ms +step:100/20000 train_loss:3.1157 grad_norm:0.4168 train_time:137708ms step_avg:1377.08ms +step:150/20000 train_loss:2.7810 grad_norm:0.3675 train_time:206699ms step_avg:1377.99ms +step:200/20000 train_loss:2.5703 grad_norm:0.3384 train_time:275646ms step_avg:1378.23ms +step:250/20000 train_loss:2.5789 grad_norm:0.2913 train_time:344574ms step_avg:1378.30ms +step:300/20000 train_loss:2.4357 grad_norm:0.2132 train_time:413532ms step_avg:1378.44ms +step:350/20000 train_loss:2.4866 grad_norm:0.2055 train_time:483575ms step_avg:1381.64ms +step:400/20000 train_loss:2.4144 grad_norm:0.2487 train_time:552589ms step_avg:1381.47ms +step:450/20000 train_loss:2.2333 grad_norm:0.1527 train_time:621533ms step_avg:1381.18ms +step:500/20000 train_loss:2.2865 grad_norm:0.1511 train_time:690504ms step_avg:1381.01ms +step:500/20000 val_loss:2.3117 val_bpb:1.3691 train_time:690515ms step_avg:1381.03ms +step:550/20000 train_loss:2.3480 grad_norm:0.1354 train_time:760462ms step_avg:1382.66ms +step:600/20000 train_loss:2.2536 grad_norm:0.2002 train_time:829477ms step_avg:1382.46ms +step:650/20000 train_loss:2.2306 grad_norm:0.1194 train_time:898608ms step_avg:1382.47ms +step:700/20000 train_loss:2.3041 grad_norm:0.1617 train_time:967715ms step_avg:1382.45ms +step:750/20000 train_loss:2.2754 grad_norm:0.1308 train_time:1036878ms step_avg:1382.50ms +step:800/20000 train_loss:2.2542 grad_norm:0.1211 train_time:1106111ms step_avg:1382.64ms +step:850/20000 train_loss:2.1786 grad_norm:0.0688 train_time:1175361ms step_avg:1382.78ms +step:900/20000 train_loss:2.0929 grad_norm:0.0751 train_time:1244692ms step_avg:1382.99ms +step:950/20000 train_loss:2.2961 grad_norm:0.1601 train_time:1314910ms step_avg:1384.12ms +step:1000/20000 train_loss:2.2261 grad_norm:0.0833 train_time:1384163ms step_avg:1384.16ms +step:1000/20000 val_loss:2.1725 val_bpb:1.2867 train_time:1384175ms step_avg:1384.17ms +step:1050/20000 train_loss:2.1507 grad_norm:0.1497 train_time:1453425ms step_avg:1384.21ms +step:1100/20000 train_loss:2.1755 grad_norm:0.0691 train_time:1522714ms step_avg:1384.29ms +step:1150/20000 train_loss:2.1286 grad_norm:0.0721 train_time:1592948ms step_avg:1385.17ms +step:1200/20000 train_loss:2.1760 grad_norm:0.0782 train_time:1662222ms step_avg:1385.19ms +step:1250/20000 train_loss:2.2002 grad_norm:0.0675 train_time:1731489ms step_avg:1385.19ms +step:1300/20000 train_loss:2.1691 grad_norm:0.0856 train_time:1800790ms step_avg:1385.22ms +step:1350/20000 train_loss:2.1443 grad_norm:0.0825 train_time:1870108ms step_avg:1385.27ms +step:1400/20000 train_loss:2.1563 grad_norm:0.0820 train_time:1939417ms step_avg:1385.30ms +step:1450/20000 train_loss:2.1534 grad_norm:0.0778 train_time:2008708ms step_avg:1385.32ms +step:1500/20000 train_loss:2.1264 grad_norm:0.1636 train_time:2077976ms step_avg:1385.32ms +step:1500/20000 val_loss:2.1131 val_bpb:1.2515 train_time:2077988ms step_avg:1385.33ms +step:1550/20000 train_loss:2.0937 grad_norm:0.0739 train_time:2148175ms step_avg:1385.92ms +step:1600/20000 train_loss:2.1730 grad_norm:0.0742 train_time:2217401ms step_avg:1385.88ms +step:1650/20000 train_loss:1.9579 grad_norm:0.0866 train_time:2286669ms step_avg:1385.86ms +step:1700/20000 train_loss:2.0866 grad_norm:0.0640 train_time:2356000ms step_avg:1385.88ms +step:1750/20000 train_loss:2.0575 grad_norm:0.0784 train_time:2425290ms step_avg:1385.88ms +step:1800/20000 train_loss:2.0953 grad_norm:0.0589 train_time:2495476ms step_avg:1386.38ms +step:1850/20000 train_loss:2.1090 grad_norm:0.1099 train_time:2564723ms step_avg:1386.34ms +step:1900/20000 train_loss:2.0553 grad_norm:0.0538 train_time:2633991ms step_avg:1386.31ms +step:1950/20000 train_loss:2.0417 grad_norm:0.0691 train_time:2703295ms step_avg:1386.31ms +step:2000/20000 train_loss:2.2933 grad_norm:0.0634 train_time:2772559ms step_avg:1386.28ms +step:2000/20000 val_loss:2.0736 val_bpb:1.2281 train_time:2772571ms step_avg:1386.29ms +step:2050/20000 train_loss:2.0610 grad_norm:0.0643 train_time:2841839ms step_avg:1386.26ms +step:2100/20000 train_loss:2.0352 grad_norm:0.0542 train_time:2911097ms step_avg:1386.24ms +step:2150/20000 train_loss:2.0150 grad_norm:0.0748 train_time:2980382ms step_avg:1386.22ms +step:2200/20000 train_loss:2.1675 grad_norm:0.0647 train_time:3050764ms step_avg:1386.71ms +step:2250/20000 train_loss:2.0588 grad_norm:0.0651 train_time:3120289ms step_avg:1386.80ms +step:2300/20000 train_loss:2.0371 grad_norm:0.0742 train_time:3189823ms step_avg:1386.88ms +step:2350/20000 train_loss:1.9911 grad_norm:0.0819 train_time:3259324ms step_avg:1386.95ms +step:2400/20000 train_loss:2.1049 grad_norm:0.0508 train_time:3329685ms step_avg:1387.37ms +step:2450/20000 train_loss:2.0658 grad_norm:0.0537 train_time:3398968ms step_avg:1387.33ms +step:2500/20000 train_loss:2.0210 grad_norm:0.0627 train_time:3468256ms step_avg:1387.30ms +step:2500/20000 val_loss:2.0271 val_bpb:1.2005 train_time:3468267ms step_avg:1387.31ms +step:2550/20000 train_loss:2.0220 grad_norm:0.0459 train_time:3537589ms step_avg:1387.29ms +step:2600/20000 train_loss:1.9997 grad_norm:0.0445 train_time:3606904ms step_avg:1387.27ms +step:2650/20000 train_loss:2.0041 grad_norm:0.0439 train_time:3676209ms step_avg:1387.25ms +step:2700/20000 train_loss:2.0259 grad_norm:0.0450 train_time:3745493ms step_avg:1387.22ms +step:2750/20000 train_loss:2.0067 grad_norm:0.0443 train_time:3814755ms step_avg:1387.18ms +step:2800/20000 train_loss:2.0409 grad_norm:0.0486 train_time:3884987ms step_avg:1387.50ms +step:2850/20000 train_loss:1.9897 grad_norm:0.0474 train_time:3954266ms step_avg:1387.46ms +step:2900/20000 train_loss:2.0047 grad_norm:0.0898 train_time:4023547ms step_avg:1387.43ms +step:2950/20000 train_loss:2.0428 grad_norm:0.0410 train_time:4092882ms step_avg:1387.42ms +step:3000/20000 train_loss:1.9290 grad_norm:0.0543 train_time:4163141ms step_avg:1387.71ms +step:3000/20000 val_loss:1.9832 val_bpb:1.1746 train_time:4163153ms step_avg:1387.72ms +step:3050/20000 train_loss:1.9349 grad_norm:0.0457 train_time:4232654ms step_avg:1387.76ms +step:3100/20000 train_loss:1.9977 grad_norm:0.0425 train_time:4302197ms step_avg:1387.81ms +swa:start step:3150 +step:3150/20000 train_loss:2.0068 grad_norm:0.0401 train_time:4371737ms step_avg:1387.85ms +step:3200/20000 train_loss:1.9809 grad_norm:0.0467 train_time:4441129ms step_avg:1387.85ms +late_qat:enabled step:3204 scale:0.1497 core_quant:on +step:3250/20000 train_loss:1.9488 grad_norm:0.0413 train_time:4597242ms step_avg:1414.54ms +step:3300/20000 train_loss:1.9274 grad_norm:0.0417 train_time:4666636ms step_avg:1414.13ms +step:3350/20000 train_loss:1.9650 grad_norm:0.0352 train_time:4736129ms step_avg:1413.77ms +step:3396/20000 val_loss:1.9539 val_bpb:1.1572 train_time:4800035ms step_avg:1413.44ms +stopping_early: wallclock_cap train_time:4800035ms step:3396/20000 +peak memory allocated: 50639 MiB reserved: 50682 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9508 val_bpb:1.1554 eval_time:33632ms +Serialized model: 110942659 bytes +Code size: 102570 bytes +Serialized model int6+lzma: 17439360 bytes +Total submission size int6+lzma: 17541930 bytes +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] torch._dynamo hit config.recompile_limit (8) +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] function: 'forward' (/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py:1054) +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] last reason: 0/7: self._modules['blocks']._modules['0']._modules['attn']._modules['rotary']._cos_cached is None # self._cos_cached is None # records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py:590 in forward +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] User stack trace: +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1004, in _forward_hidden +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] x, raw_v = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 783, in forward +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] attn_out, raw_v = self.attn(self.attn_norm(x_in) * self.ln_scale_factor, q_w, k_w, v_w, out_w, v_embed=v_embed, v0=v0) +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 678, in forward +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] cos, sin = self.rotary(seqlen, x.device, q.dtype) +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 590, in forward +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] self._cos_cached is None +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". +W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] To diagnose recompilation issues, see https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html +Traceback (most recent call last): + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1773, in _compile + raise_unimplemented_cache_limit_exceeded() + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1757, in raise_unimplemented_cache_limit_exceeded + unimplemented( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/exc.py", line 634, in unimplemented + raise Unsupported(msg, gb_type, skip_frame) +torch._dynamo.exc.Unsupported: Dynamo recompile limit exceeded + Explanation: Dynamo attempted to recompile the code object too many times, exceeding the recompile_limit cache size limit (currently set to 8). Excessive recompilations can degrade performance due to the compilation overhead of each recompilation. + Hint: To monitor recompilations, enable TORCH_LOGS=recompiles. If recompilations are expected, consider increasing torch._dynamo.config.recompile_limit to an appropriate value. + Hint: See https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html for tips on dealing with recompilations. + + Developer debug context: Limit type: recompile_limit + + For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0039.html + +The above exception was the direct cause of the following exception: + +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2146, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2084, in main + q_val_loss, q_val_bpb = eval_val( + ^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 359, in eval_val + batch_loss = model(x, y).detach() + ^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 2316, in __call__ + result = self._torchdynamo_orig_backend( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 729, in __call__ + result = _compile( + ^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1780, in _compile + raise FailOnRecompileLimitHit( +torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/wandb-metadata.json new file mode 100644 index 0000000000..837dbf0b25 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/wandb-metadata.json @@ -0,0 +1,55 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T20:51:39.232921Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "8", + "--lora-warmup-steps", + "1500" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40561086464" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "1izso2rrzq197f6m1lqb13cnbo1l86f0" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/wandb-summary.json new file mode 100644 index 0000000000..d4c44256cc --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/wandb-summary.json @@ -0,0 +1 @@ +{"step_avg_ms":1413.769916971644,"val_bpb":1.1572276575893548,"grad_norm":0.035237833857536316,"_runtime":5434.912608135,"_step":3396,"train_loss":1.9649734497070312,"_wandb":{"runtime":5434},"_timestamp":1.7745636406532588e+09,"lr_scale":0.027152011510183278,"val_loss":1.9539304255431496} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/config.yaml new file mode 100644 index 0000000000..197fe403b2 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + adlct7preaiqxn7mxg83p4xd196yjo4e: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40804425728" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T22:30:30.931868Z" + writerId: adlct7preaiqxn7mxg83p4xd196yjo4e + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/output.log new file mode 100644 index 0000000000..62f3920e45 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/output.log @@ -0,0 +1,41 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main + (warmup_loss * grad_scale).backward() + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward + torch.autograd.backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward + _engine_run_backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward + return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply + return user_fn(self, *args) + ^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward + return impl_fn() + ^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn + out = CompiledFunction._backward_impl(ctx, all_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl + out = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call + buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 71.88 MiB is free. Process 438880 has 48.42 GiB memory in use. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Process 438888 has 42.87 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/wandb-metadata.json new file mode 100644 index 0000000000..2c907a4c1d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T22:30:30.931868Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40804425728" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "adlct7preaiqxn7mxg83p4xd196yjo4e" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/wandb-summary.json new file mode 100644 index 0000000000..583ae8d107 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":25,"_wandb":{"runtime":25}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/config.yaml new file mode 100644 index 0000000000..f857385967 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + zqtwbvg1heqed3e66dzpgg33jzmausne: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40804429824" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T22:30:30.934540Z" + writerId: zqtwbvg1heqed3e66dzpgg33jzmausne + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/output.log new file mode 100644 index 0000000000..c5ed5853d1 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/output.log @@ -0,0 +1,41 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main + (warmup_loss * grad_scale).backward() + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward + torch.autograd.backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward + _engine_run_backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward + return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply + return user_fn(self, *args) + ^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward + return impl_fn() + ^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn + out = CompiledFunction._backward_impl(ctx, all_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl + out = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call + buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 71.88 MiB is free. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Process 438887 has 48.42 GiB memory in use. Process 438888 has 42.87 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/wandb-metadata.json new file mode 100644 index 0000000000..6b9182fcb1 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T22:30:30.934540Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40804429824" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "zqtwbvg1heqed3e66dzpgg33jzmausne" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/wandb-summary.json new file mode 100644 index 0000000000..583ae8d107 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":25,"_wandb":{"runtime":25}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/config.yaml new file mode 100644 index 0000000000..01b0b298e3 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + lhaxso6avidv8ae7vept3hphk5p3aarq: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40804413440" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T22:30:30.933872Z" + writerId: lhaxso6avidv8ae7vept3hphk5p3aarq + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/output.log new file mode 100644 index 0000000000..8daabbc0b7 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/output.log @@ -0,0 +1,109 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward + def forward(self, input_ids: Tensor, target_ids: Tensor, + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward + return compiled_fn(full_args) + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper + all_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g + return f(*args) + ^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply + return super().apply(*args, **kwargs) # type: ignore[misc] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward + fw_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper + return compiled_fn(runtime_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn + outs = compiled_fn(args) + ^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 9307, in call + buf866 = torch.ops.flash_attn_3._flash_attn_forward.default(buf865, buf864, reinterpret_tensor(buf850, (48, 2048, 4, 64), (524288, 256, 64, 1), 0), None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0.125, True, window_size_left=-1, window_size_right=-1, attention_chunk=0, softcap=0.0, rotary_interleaved=True, scheduler_metadata=None, num_splits=1, pack_gqa=None, sm_margin=0) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__ + return self._op(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/autograd.py", line 112, in autograd_impl + result = forward_no_grad(*args, Metadata(keyset, keyword_only_args)) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/autograd.py", line 41, in forward_no_grad + result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 872, in redispatch + return self._handle.redispatch_boxed(keyset, *args, **kwargs) # type: ignore[return-value] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner + return disable_fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__ + res = func(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__ + return self._op(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 347, in backend_impl + result = self._backend_fns[device_type](*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner + return disable_fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 382, in wrapped_fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/flash_attention_hopper/flash_attn_interface.py", line 93, in _flash_attn_forward + out, softmax_lse, out_accum, softmax_lse_accum = flash_attn_3_cuda.fwd( + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1269, in __call__ + return self._op(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^ +RuntimeError: torch_call_dispatcher( "aten::new_empty", "", stack.data(), TORCH_ABI_VERSION) API call failed at /root/.venv/lib/python3.12/site-packages/torch/include/torch/csrc/stable/ops.h, line 579 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/wandb-metadata.json new file mode 100644 index 0000000000..39b55d0876 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T22:30:30.933872Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40804413440" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "lhaxso6avidv8ae7vept3hphk5p3aarq" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/wandb-summary.json new file mode 100644 index 0000000000..1b9bd7c2db --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":24,"_wandb":{"runtime":24}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/config.yaml new file mode 100644 index 0000000000..8df337b67d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + 9agqyslentaakuy1fpz0qzu1hu5kmu0l: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40805670912" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T22:44:39.861248Z" + writerId: 9agqyslentaakuy1fpz0qzu1hu5kmu0l + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/output.log new file mode 100644 index 0000000000..bbe357f37b --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/output.log @@ -0,0 +1,41 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main + (warmup_loss * grad_scale).backward() + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward + torch.autograd.backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward + _engine_run_backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward + return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply + return user_fn(self, *args) + ^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward + return impl_fn() + ^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn + out = CompiledFunction._backward_impl(ctx, all_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl + out = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call + buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 71.88 MiB is free. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Process 443507 has 42.87 GiB memory in use. Process 443517 has 48.42 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/wandb-metadata.json new file mode 100644 index 0000000000..ac97ee8b8a --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T22:44:39.861248Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40805670912" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "9agqyslentaakuy1fpz0qzu1hu5kmu0l" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/wandb-summary.json new file mode 100644 index 0000000000..583ae8d107 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":25,"_wandb":{"runtime":25}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/config.yaml new file mode 100644 index 0000000000..7c610c10cd --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + f0sjpamcmkq5pwzdhixg6s73n17qk8gf: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40805675008" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T22:44:39.939938Z" + writerId: f0sjpamcmkq5pwzdhixg6s73n17qk8gf + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/output.log new file mode 100644 index 0000000000..7c33dc22e1 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/output.log @@ -0,0 +1,41 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main + (warmup_loss * grad_scale).backward() + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward + torch.autograd.backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward + _engine_run_backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward + return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply + return user_fn(self, *args) + ^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward + return impl_fn() + ^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn + out = CompiledFunction._backward_impl(ctx, all_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl + out = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call + buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 71.88 MiB is free. Process 443506 has 48.42 GiB memory in use. Process 443507 has 42.87 GiB memory in use. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/wandb-metadata.json new file mode 100644 index 0000000000..628ec3f286 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T22:44:39.939938Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40805675008" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "f0sjpamcmkq5pwzdhixg6s73n17qk8gf" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/wandb-summary.json new file mode 100644 index 0000000000..2d7b734886 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/wandb-summary.json @@ -0,0 +1 @@ +{"_wandb":{"runtime":25},"_runtime":25} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/config.yaml new file mode 100644 index 0000000000..2595a840a3 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + axpacu681idgzbzy7k8dl2xn3o7luysd: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40805695488" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T22:44:40.221244Z" + writerId: axpacu681idgzbzy7k8dl2xn3o7luysd + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/output.log new file mode 100644 index 0000000000..8daabbc0b7 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/output.log @@ -0,0 +1,109 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward + def forward(self, input_ids: Tensor, target_ids: Tensor, + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward + return compiled_fn(full_args) + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper + all_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g + return f(*args) + ^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply + return super().apply(*args, **kwargs) # type: ignore[misc] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward + fw_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper + return compiled_fn(runtime_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn + outs = compiled_fn(args) + ^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 9307, in call + buf866 = torch.ops.flash_attn_3._flash_attn_forward.default(buf865, buf864, reinterpret_tensor(buf850, (48, 2048, 4, 64), (524288, 256, 64, 1), 0), None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0.125, True, window_size_left=-1, window_size_right=-1, attention_chunk=0, softcap=0.0, rotary_interleaved=True, scheduler_metadata=None, num_splits=1, pack_gqa=None, sm_margin=0) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__ + return self._op(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/autograd.py", line 112, in autograd_impl + result = forward_no_grad(*args, Metadata(keyset, keyword_only_args)) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/autograd.py", line 41, in forward_no_grad + result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 872, in redispatch + return self._handle.redispatch_boxed(keyset, *args, **kwargs) # type: ignore[return-value] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner + return disable_fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__ + res = func(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__ + return self._op(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 347, in backend_impl + result = self._backend_fns[device_type](*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner + return disable_fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 382, in wrapped_fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/flash_attention_hopper/flash_attn_interface.py", line 93, in _flash_attn_forward + out, softmax_lse, out_accum, softmax_lse_accum = flash_attn_3_cuda.fwd( + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1269, in __call__ + return self._op(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^ +RuntimeError: torch_call_dispatcher( "aten::new_empty", "", stack.data(), TORCH_ABI_VERSION) API call failed at /root/.venv/lib/python3.12/site-packages/torch/include/torch/csrc/stable/ops.h, line 579 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/wandb-metadata.json new file mode 100644 index 0000000000..7e88c9474d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T22:44:40.221244Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40805695488" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "axpacu681idgzbzy7k8dl2xn3o7luysd" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/wandb-summary.json new file mode 100644 index 0000000000..1b9bd7c2db --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":24,"_wandb":{"runtime":24}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/config.yaml new file mode 100644 index 0000000000..04f424ddbc --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/config.yaml @@ -0,0 +1,100 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + anl00anvcf1sn6vkqysn8e39mj31kpzb: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40807182336" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T23:01:56.269199Z" + writerId: anl00anvcf1sn6vkqysn8e39mj31kpzb + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/output.log new file mode 100644 index 0000000000..af28a40e2a --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/output.log @@ -0,0 +1,319 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9310 grad_norm:0.3717 train_time:1291ms step_avg:1291.09ms +step:2/20000 train_loss:8.3536 grad_norm:3.5393 train_time:2598ms step_avg:1298.81ms +step:3/20000 train_loss:7.5089 grad_norm:1.8069 train_time:3954ms step_avg:1318.13ms +step:4/20000 train_loss:7.5822 grad_norm:1.8725 train_time:5317ms step_avg:1329.29ms +step:5/20000 train_loss:7.3524 grad_norm:1.8843 train_time:6673ms step_avg:1334.64ms +step:6/20000 train_loss:7.0868 grad_norm:1.7131 train_time:8028ms step_avg:1338.00ms +step:7/20000 train_loss:6.9401 grad_norm:2.0897 train_time:9384ms step_avg:1340.63ms +step:8/20000 train_loss:6.8952 grad_norm:1.4534 train_time:10745ms step_avg:1343.15ms +step:9/20000 train_loss:6.5431 grad_norm:1.0222 train_time:12102ms step_avg:1344.70ms +step:10/20000 train_loss:6.1427 grad_norm:0.9715 train_time:13466ms step_avg:1346.55ms +step:50/20000 train_loss:3.6903 grad_norm:0.9422 train_time:68054ms step_avg:1361.07ms +step:100/20000 train_loss:3.1184 grad_norm:0.5410 train_time:136293ms step_avg:1362.93ms +step:150/20000 train_loss:2.7752 grad_norm:0.3613 train_time:205070ms step_avg:1367.13ms +step:200/20000 train_loss:2.5614 grad_norm:0.2693 train_time:273305ms step_avg:1366.53ms +step:250/20000 train_loss:2.5709 grad_norm:0.2522 train_time:341556ms step_avg:1366.22ms +step:300/20000 train_loss:2.4364 grad_norm:0.2295 train_time:409825ms step_avg:1366.08ms +step:350/20000 train_loss:2.4859 grad_norm:0.2104 train_time:478072ms step_avg:1365.92ms +step:400/20000 train_loss:2.3988 grad_norm:0.1555 train_time:546341ms step_avg:1365.85ms +step:450/20000 train_loss:2.2317 grad_norm:0.1958 train_time:614614ms step_avg:1365.81ms +step:500/20000 train_loss:2.2898 grad_norm:0.1775 train_time:682900ms step_avg:1365.80ms +step:500/20000 val_loss:2.3130 val_bpb:1.3699 train_time:682945ms step_avg:1365.89ms +step:550/20000 train_loss:2.3492 grad_norm:0.1559 train_time:751209ms step_avg:1365.83ms +step:600/20000 train_loss:2.2513 grad_norm:0.1438 train_time:819544ms step_avg:1365.91ms +step:650/20000 train_loss:2.2323 grad_norm:0.1536 train_time:888368ms step_avg:1366.72ms +step:700/20000 train_loss:2.3026 grad_norm:0.1020 train_time:956783ms step_avg:1366.83ms +step:750/20000 train_loss:2.2750 grad_norm:0.1105 train_time:1025183ms step_avg:1366.91ms +step:800/20000 train_loss:2.2546 grad_norm:0.1031 train_time:1093599ms step_avg:1367.00ms +step:850/20000 train_loss:2.1799 grad_norm:0.0737 train_time:1162084ms step_avg:1367.16ms +step:900/20000 train_loss:2.0960 grad_norm:0.0817 train_time:1230597ms step_avg:1367.33ms +step:950/20000 train_loss:2.2968 grad_norm:0.0953 train_time:1299094ms step_avg:1367.47ms +step:1000/20000 train_loss:2.2247 grad_norm:0.0713 train_time:1367589ms step_avg:1367.59ms +step:1000/20000 val_loss:2.1722 val_bpb:1.2865 train_time:1367633ms step_avg:1367.63ms +step:1050/20000 train_loss:2.1500 grad_norm:0.1469 train_time:1436112ms step_avg:1367.73ms +step:1100/20000 train_loss:2.1744 grad_norm:0.0794 train_time:1504991ms step_avg:1368.17ms +step:1150/20000 train_loss:2.1290 grad_norm:0.0672 train_time:1573762ms step_avg:1368.49ms +step:1200/20000 train_loss:2.1756 grad_norm:0.0636 train_time:1642514ms step_avg:1368.76ms +step:1250/20000 train_loss:2.1991 grad_norm:0.0599 train_time:1711283ms step_avg:1369.03ms +step:1300/20000 train_loss:2.1695 grad_norm:0.1132 train_time:1780070ms step_avg:1369.28ms +step:1350/20000 train_loss:2.1436 grad_norm:0.1200 train_time:1848866ms step_avg:1369.53ms +step:1400/20000 train_loss:2.1553 grad_norm:0.0700 train_time:1917654ms step_avg:1369.75ms +step:1450/20000 train_loss:2.1501 grad_norm:0.0631 train_time:1986442ms step_avg:1369.96ms +step:1500/20000 train_loss:2.1193 grad_norm:0.0733 train_time:2055220ms step_avg:1370.15ms +step:1500/20000 val_loss:2.1071 val_bpb:1.2479 train_time:2055264ms step_avg:1370.18ms +step:1550/20000 train_loss:2.0928 grad_norm:0.0758 train_time:2124013ms step_avg:1370.33ms +step:1600/20000 train_loss:2.1722 grad_norm:0.0814 train_time:2193129ms step_avg:1370.71ms +step:1650/20000 train_loss:1.9557 grad_norm:0.0655 train_time:2261915ms step_avg:1370.86ms +step:1700/20000 train_loss:2.0848 grad_norm:0.0634 train_time:2330710ms step_avg:1371.01ms +step:1750/20000 train_loss:2.0562 grad_norm:0.0759 train_time:2399493ms step_avg:1371.14ms +step:1800/20000 train_loss:2.0964 grad_norm:0.0645 train_time:2468259ms step_avg:1371.26ms +step:1850/20000 train_loss:2.1107 grad_norm:0.0831 train_time:2537046ms step_avg:1371.38ms +step:1900/20000 train_loss:2.0580 grad_norm:0.0648 train_time:2605824ms step_avg:1371.49ms +step:1950/20000 train_loss:2.0431 grad_norm:0.0981 train_time:2674651ms step_avg:1371.62ms +step:2000/20000 train_loss:2.2944 grad_norm:0.0838 train_time:2743419ms step_avg:1371.71ms +step:2000/20000 val_loss:2.0763 val_bpb:1.2297 train_time:2743463ms step_avg:1371.73ms +step:2050/20000 train_loss:2.0607 grad_norm:0.1013 train_time:2812501ms step_avg:1371.95ms +step:2100/20000 train_loss:2.0358 grad_norm:0.0558 train_time:2881257ms step_avg:1372.03ms +step:2150/20000 train_loss:2.0142 grad_norm:0.0526 train_time:2950035ms step_avg:1372.11ms +step:2200/20000 train_loss:2.1668 grad_norm:0.0614 train_time:3018808ms step_avg:1372.19ms +step:2250/20000 train_loss:2.0604 grad_norm:0.0644 train_time:3087562ms step_avg:1372.25ms +step:2300/20000 train_loss:2.0377 grad_norm:0.1123 train_time:3156291ms step_avg:1372.30ms +step:2350/20000 train_loss:1.9923 grad_norm:0.0511 train_time:3225042ms step_avg:1372.36ms +step:2400/20000 train_loss:2.1062 grad_norm:0.0682 train_time:3293804ms step_avg:1372.42ms +step:2450/20000 train_loss:2.0650 grad_norm:0.0639 train_time:3362565ms step_avg:1372.48ms +step:2500/20000 train_loss:2.0208 grad_norm:0.0580 train_time:3431320ms step_avg:1372.53ms +step:2500/20000 val_loss:2.0279 val_bpb:1.2010 train_time:3431364ms step_avg:1372.55ms +step:2550/20000 train_loss:2.0211 grad_norm:0.0558 train_time:3500393ms step_avg:1372.70ms +step:2600/20000 train_loss:2.0001 grad_norm:0.0479 train_time:3569165ms step_avg:1372.76ms +step:2650/20000 train_loss:2.0040 grad_norm:0.0582 train_time:3637929ms step_avg:1372.80ms +step:2700/20000 train_loss:2.0265 grad_norm:0.0542 train_time:3706703ms step_avg:1372.85ms +step:2750/20000 train_loss:2.0077 grad_norm:0.0457 train_time:3775459ms step_avg:1372.89ms +step:2800/20000 train_loss:2.0415 grad_norm:0.0569 train_time:3844241ms step_avg:1372.94ms +step:2850/20000 train_loss:1.9900 grad_norm:0.0487 train_time:3913011ms step_avg:1372.99ms +step:2900/20000 train_loss:2.0045 grad_norm:0.0438 train_time:3981769ms step_avg:1373.02ms +step:2950/20000 train_loss:2.0440 grad_norm:0.0447 train_time:4050513ms step_avg:1373.06ms +step:3000/20000 train_loss:1.9316 grad_norm:0.0567 train_time:4119545ms step_avg:1373.18ms +step:3000/20000 val_loss:1.9838 val_bpb:1.1749 train_time:4119590ms step_avg:1373.20ms +step:3050/20000 train_loss:1.9372 grad_norm:0.0506 train_time:4188300ms step_avg:1373.21ms +step:3100/20000 train_loss:1.9990 grad_norm:0.0465 train_time:4257075ms step_avg:1373.25ms +step:3150/20000 train_loss:2.0077 grad_norm:0.0401 train_time:4325837ms step_avg:1373.28ms +swa:start step:3200 +step:3200/20000 train_loss:1.9812 grad_norm:0.0445 train_time:4394566ms step_avg:1373.30ms +late_qat:enabled step:3241 scale:0.1495 core_quant:on +step:3250/20000 train_loss:1.9531 grad_norm:0.0567 train_time:4519079ms step_avg:1390.49ms +step:3300/20000 train_loss:1.9296 grad_norm:0.0386 train_time:4587540ms step_avg:1390.16ms +step:3350/20000 train_loss:1.9653 grad_norm:0.0394 train_time:4655858ms step_avg:1389.81ms +step:3400/20000 train_loss:2.0099 grad_norm:0.0483 train_time:4724204ms step_avg:1389.47ms +step:3450/20000 train_loss:1.9637 grad_norm:0.0369 train_time:4792535ms step_avg:1389.14ms +step:3456/20000 val_loss:1.9505 val_bpb:1.1552 train_time:4800814ms step_avg:1389.12ms +stopping_early: wallclock_cap train_time:4800814ms step:3456/20000 +peak memory allocated: 50545 MiB reserved: 50594 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9472 val_bpb:1.1532 eval_time:32839ms +Serialized model: 106023671 bytes +Code size: 102633 bytes +Serialized model int6+lzma: 16373548 bytes +Total submission size int6+lzma: 16476181 bytes +final_int6_roundtrip val_loss:1.9574 val_bpb:1.1593 eval_time:39862ms +final_int6_roundtrip_exact val_loss:1.95735252 val_bpb:1.15925441 +final_int6_sliding_window val_loss:1.9164 val_bpb:1.1350 stride:64 eval_time:1105486ms +final_int6_sliding_window_exact val_loss:1.91642779 val_bpb:1.13501949 +final_int8_zlib_roundtrip_exact val_loss:1.91642779 val_bpb:1.13501949 +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=2 +ttt_sliding:params unfrozen=26923088 frozen=4112 + ttt_chunk [1/1893] bpb=1.226275 time=1.9s + ttt_chunk [11/1893] bpb=1.128206 time=20.5s + ttt_chunk [21/1893] bpb=1.137378 time=39.0s + ttt_chunk [31/1893] bpb=1.142175 time=57.6s + ttt_chunk [41/1893] bpb=1.138228 time=76.1s + ttt_chunk [51/1893] bpb=1.139877 time=94.6s + ttt_chunk [61/1893] bpb=1.143695 time=113.2s + ttt_chunk [71/1893] bpb=1.141806 time=131.7s + ttt_chunk [81/1893] bpb=1.138175 time=150.2s + ttt_chunk [91/1893] bpb=1.137107 time=168.8s + ttt_chunk [101/1893] bpb=1.138115 time=187.3s + ttt_chunk [111/1893] bpb=1.138295 time=205.9s + ttt_chunk [121/1893] bpb=1.134671 time=224.4s + ttt_chunk [131/1893] bpb=1.133939 time=242.9s + ttt_chunk [141/1893] bpb=1.132766 time=261.5s + ttt_chunk [151/1893] bpb=1.132980 time=280.0s + ttt_chunk [161/1893] bpb=1.133800 time=298.6s + ttt_chunk [171/1893] bpb=1.135874 time=317.1s + ttt_chunk [181/1893] bpb=1.135884 time=335.6s + ttt_chunk [191/1893] bpb=1.138340 time=354.2s + ttt_chunk [201/1893] bpb=1.137866 time=372.7s + ttt_chunk [211/1893] bpb=1.136957 time=391.2s + ttt_chunk [221/1893] bpb=1.137842 time=409.8s + ttt_chunk [231/1893] bpb=1.137565 time=428.3s + ttt_chunk [241/1893] bpb=1.137849 time=446.8s + ttt_chunk [251/1893] bpb=1.137360 time=465.4s + ttt_chunk [261/1893] bpb=1.136692 time=483.9s + ttt_chunk [271/1893] bpb=1.135780 time=502.5s + ttt_chunk [281/1893] bpb=1.137389 time=521.0s + ttt_chunk [291/1893] bpb=1.137018 time=539.5s + ttt_chunk [301/1893] bpb=1.137918 time=558.1s + ttt_chunk [311/1893] bpb=1.138001 time=576.6s + ttt_chunk [321/1893] bpb=1.138708 time=595.1s + ttt_chunk [331/1893] bpb=1.138179 time=613.7s + ttt_chunk [341/1893] bpb=1.137832 time=632.2s + ttt_chunk [351/1893] bpb=1.138543 time=650.8s + ttt_chunk [361/1893] bpb=1.139301 time=669.3s + ttt_chunk [371/1893] bpb=1.139185 time=687.8s + ttt_chunk [381/1893] bpb=1.138924 time=706.4s + ttt_chunk [391/1893] bpb=1.139607 time=724.9s + ttt_chunk [401/1893] bpb=1.139172 time=743.4s + ttt_chunk [411/1893] bpb=1.138218 time=762.0s + ttt_chunk [421/1893] bpb=1.138334 time=780.5s + ttt_chunk [431/1893] bpb=1.138777 time=799.1s + ttt_chunk [441/1893] bpb=1.138161 time=817.6s + ttt_chunk [451/1893] bpb=1.138301 time=836.1s + ttt_chunk [461/1893] bpb=1.138190 time=854.7s + ttt_chunk [471/1893] bpb=1.137746 time=873.2s + ttt_chunk [481/1893] bpb=1.137597 time=891.8s + ttt_chunk [491/1893] bpb=1.137722 time=910.3s + ttt_chunk [501/1893] bpb=1.137492 time=928.8s + ttt_chunk [511/1893] bpb=1.137017 time=947.4s + ttt_chunk [521/1893] bpb=1.136714 time=965.9s + ttt_chunk [531/1893] bpb=1.137443 time=984.5s + ttt_chunk [541/1893] bpb=1.137557 time=1003.0s + ttt_chunk [551/1893] bpb=1.137019 time=1021.5s + ttt_chunk [561/1893] bpb=1.136885 time=1040.1s + ttt_chunk [571/1893] bpb=1.136621 time=1058.6s + ttt_chunk [581/1893] bpb=1.136257 time=1077.2s + ttt_chunk [591/1893] bpb=1.135719 time=1095.7s + ttt_chunk [601/1893] bpb=1.135711 time=1114.2s + ttt_chunk [611/1893] bpb=1.135386 time=1132.8s + ttt_chunk [621/1893] bpb=1.135235 time=1151.3s + ttt_chunk [631/1893] bpb=1.134973 time=1169.9s + ttt_chunk [641/1893] bpb=1.134519 time=1188.4s + ttt_chunk [651/1893] bpb=1.134057 time=1206.9s + ttt_chunk [661/1893] bpb=1.133947 time=1225.5s + ttt_chunk [671/1893] bpb=1.133482 time=1244.0s + ttt_chunk [681/1893] bpb=1.132918 time=1262.6s + ttt_chunk [691/1893] bpb=1.132994 time=1281.1s + ttt_chunk [701/1893] bpb=1.132163 time=1299.6s + ttt_chunk [711/1893] bpb=1.132176 time=1318.2s + ttt_chunk [721/1893] bpb=1.132090 time=1336.7s + ttt_chunk [731/1893] bpb=1.132331 time=1355.2s + ttt_chunk [741/1893] bpb=1.132205 time=1373.8s + ttt_chunk [751/1893] bpb=1.131884 time=1392.3s + ttt_chunk [761/1893] bpb=1.132028 time=1410.8s + ttt_chunk [771/1893] bpb=1.131860 time=1429.4s + ttt_chunk [781/1893] bpb=1.132024 time=1447.9s + ttt_chunk [791/1893] bpb=1.131869 time=1466.4s + ttt_chunk [801/1893] bpb=1.131804 time=1485.0s + ttt_chunk [811/1893] bpb=1.131817 time=1503.5s + ttt_chunk [821/1893] bpb=1.131702 time=1522.1s + ttt_chunk [831/1893] bpb=1.131418 time=1540.6s + ttt_chunk [841/1893] bpb=1.131180 time=1559.1s + ttt_chunk [851/1893] bpb=1.131241 time=1577.7s + ttt_chunk [861/1893] bpb=1.131312 time=1596.2s + ttt_chunk [871/1893] bpb=1.131521 time=1614.7s + ttt_chunk [881/1893] bpb=1.131519 time=1633.3s + ttt_chunk [891/1893] bpb=1.130978 time=1651.8s + ttt_chunk [901/1893] bpb=1.130995 time=1670.3s + ttt_chunk [911/1893] bpb=1.130849 time=1688.9s + ttt_chunk [921/1893] bpb=1.130984 time=1707.4s + ttt_chunk [931/1893] bpb=1.130928 time=1726.0s + ttt_chunk [941/1893] bpb=1.131129 time=1744.5s + ttt_chunk [951/1893] bpb=1.131431 time=1763.0s + ttt_chunk [961/1893] bpb=1.131741 time=1781.6s + ttt_chunk [971/1893] bpb=1.132107 time=1800.1s + ttt_chunk [981/1893] bpb=1.132319 time=1818.6s + ttt_chunk [991/1893] bpb=1.132236 time=1837.2s + ttt_chunk [1001/1893] bpb=1.132567 time=1855.7s + ttt_chunk [1011/1893] bpb=1.132723 time=1874.3s + ttt_chunk [1021/1893] bpb=1.133011 time=1892.8s + ttt_chunk [1031/1893] bpb=1.133400 time=1911.3s + ttt_chunk [1041/1893] bpb=1.133897 time=1929.9s + ttt_chunk [1051/1893] bpb=1.133756 time=1948.4s + ttt_chunk [1061/1893] bpb=1.133865 time=1967.0s + ttt_chunk [1071/1893] bpb=1.134029 time=1985.5s + ttt_chunk [1081/1893] bpb=1.134076 time=2004.1s + ttt_chunk [1091/1893] bpb=1.134326 time=2022.7s + ttt_chunk [1101/1893] bpb=1.134469 time=2041.2s + ttt_chunk [1111/1893] bpb=1.134274 time=2059.8s + ttt_chunk [1121/1893] bpb=1.134049 time=2078.3s + ttt_chunk [1131/1893] bpb=1.133943 time=2096.9s + ttt_chunk [1141/1893] bpb=1.133705 time=2115.4s + ttt_chunk [1151/1893] bpb=1.133733 time=2134.0s + ttt_chunk [1161/1893] bpb=1.133569 time=2152.5s + ttt_chunk [1171/1893] bpb=1.133389 time=2171.1s + ttt_chunk [1181/1893] bpb=1.133164 time=2189.6s + ttt_chunk [1191/1893] bpb=1.133317 time=2208.2s + ttt_chunk [1201/1893] bpb=1.133519 time=2226.8s + ttt_chunk [1211/1893] bpb=1.133117 time=2245.3s + ttt_chunk [1221/1893] bpb=1.133455 time=2263.9s + ttt_chunk [1231/1893] bpb=1.133394 time=2282.4s + ttt_chunk [1241/1893] bpb=1.133104 time=2300.9s + ttt_chunk [1251/1893] bpb=1.132567 time=2319.5s + ttt_chunk [1261/1893] bpb=1.132300 time=2338.0s + ttt_chunk [1271/1893] bpb=1.132047 time=2356.6s + ttt_chunk [1281/1893] bpb=1.131738 time=2375.1s + ttt_chunk [1291/1893] bpb=1.131494 time=2393.7s + ttt_chunk [1301/1893] bpb=1.131443 time=2412.2s + ttt_chunk [1311/1893] bpb=1.131173 time=2430.7s + ttt_chunk [1321/1893] bpb=1.130872 time=2449.3s + ttt_chunk [1331/1893] bpb=1.130632 time=2467.8s + ttt_chunk [1341/1893] bpb=1.130505 time=2486.4s + ttt_chunk [1351/1893] bpb=1.130352 time=2504.9s + ttt_chunk [1361/1893] bpb=1.130484 time=2523.5s + ttt_chunk [1371/1893] bpb=1.130705 time=2542.0s + ttt_chunk [1381/1893] bpb=1.130910 time=2560.5s + ttt_chunk [1391/1893] bpb=1.130695 time=2579.1s + ttt_chunk [1401/1893] bpb=1.130724 time=2597.6s + ttt_chunk [1411/1893] bpb=1.130831 time=2616.2s + ttt_chunk [1421/1893] bpb=1.130815 time=2634.7s + ttt_chunk [1431/1893] bpb=1.130791 time=2653.3s + ttt_chunk [1441/1893] bpb=1.131256 time=2671.8s + ttt_chunk [1451/1893] bpb=1.131119 time=2691.1s + ttt_chunk [1461/1893] bpb=1.131048 time=2709.6s + ttt_chunk [1471/1893] bpb=1.131643 time=2728.2s + ttt_chunk [1481/1893] bpb=1.131517 time=2746.7s + ttt_chunk [1491/1893] bpb=1.131890 time=2765.3s + ttt_chunk [1501/1893] bpb=1.131872 time=2783.8s + ttt_chunk [1511/1893] bpb=1.131833 time=2802.3s + ttt_chunk [1521/1893] bpb=1.131945 time=2820.9s + ttt_chunk [1531/1893] bpb=1.132160 time=2839.4s + ttt_chunk [1541/1893] bpb=1.132230 time=2858.0s + ttt_chunk [1551/1893] bpb=1.132470 time=2876.5s + ttt_chunk [1561/1893] bpb=1.132554 time=2895.1s + ttt_chunk [1571/1893] bpb=1.132686 time=2913.6s + ttt_chunk [1581/1893] bpb=1.132836 time=2932.1s + ttt_chunk [1591/1893] bpb=1.132902 time=2950.7s + ttt_chunk [1601/1893] bpb=1.133020 time=2969.2s + ttt_chunk [1611/1893] bpb=1.133281 time=2987.8s + ttt_chunk [1621/1893] bpb=1.133141 time=3006.3s + ttt_chunk [1631/1893] bpb=1.133187 time=3024.8s + ttt_chunk [1641/1893] bpb=1.133212 time=3043.4s + ttt_chunk [1651/1893] bpb=1.133269 time=3061.9s + ttt_chunk [1661/1893] bpb=1.133410 time=3080.5s + ttt_chunk [1671/1893] bpb=1.133595 time=3099.0s + ttt_chunk [1681/1893] bpb=1.133686 time=3117.5s + ttt_chunk [1691/1893] bpb=1.133787 time=3136.1s + ttt_chunk [1701/1893] bpb=1.133884 time=3154.6s + ttt_chunk [1711/1893] bpb=1.133862 time=3173.2s + ttt_chunk [1721/1893] bpb=1.133701 time=3191.7s + ttt_chunk [1731/1893] bpb=1.133797 time=3210.2s + ttt_chunk [1741/1893] bpb=1.133534 time=3228.8s + ttt_chunk [1751/1893] bpb=1.133407 time=3247.3s + ttt_chunk [1761/1893] bpb=1.133444 time=3265.9s + ttt_chunk [1771/1893] bpb=1.133395 time=3284.4s + ttt_chunk [1781/1893] bpb=1.133298 time=3303.0s + ttt_chunk [1791/1893] bpb=1.132959 time=3321.5s + ttt_chunk [1801/1893] bpb=1.132941 time=3340.0s + ttt_chunk [1811/1893] bpb=1.132795 time=3358.6s + ttt_chunk [1821/1893] bpb=1.132853 time=3377.1s + ttt_chunk [1831/1893] bpb=1.132699 time=3395.7s + ttt_chunk [1841/1893] bpb=1.132738 time=3414.2s + ttt_chunk [1851/1893] bpb=1.132559 time=3432.7s + ttt_chunk [1861/1893] bpb=1.132478 time=3451.3s + ttt_chunk [1871/1893] bpb=1.132413 time=3469.8s + ttt_chunk [1881/1893] bpb=1.132170 time=3488.4s + ttt_chunk [1891/1893] bpb=1.132153 time=3506.9s + ttt_chunk [1893/1893] bpb=1.132184 time=3509.9s +ttt_sliding:done val_loss=1.911640 val_bpb=1.132184 elapsed=3510.0s +legal_ttt val_loss:1.9116 val_bpb:1.1322 eval_time:3510399ms +legal_ttt_exact val_loss:1.91163996 val_bpb:1.13218386 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/wandb-metadata.json new file mode 100644 index 0000000000..88cfe10ecc --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T23:01:56.269199Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40807182336" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "anl00anvcf1sn6vkqysn8e39mj31kpzb" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/wandb-summary.json new file mode 100644 index 0000000000..fec6c414c9 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/wandb-summary.json @@ -0,0 +1 @@ +{"val_bpb":1.1552031360520438,"grad_norm":0.036941949278116226,"_wandb":{"runtime":9854},"val_loss":1.9505121057311605,"train_loss":1.9637089967727661,"_step":3456,"_runtime":9854.375477361,"_timestamp":1.7745712654533072e+09,"step_avg_ms":1389.1407154272545,"lr_scale":0.0037399857268641938} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/config.yaml new file mode 100644 index 0000000000..4281bc9e86 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + 89ifin0p6a6n764umjq94imdjdo3oavd: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40807403520" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T23:01:58.938056Z" + writerId: 89ifin0p6a6n764umjq94imdjdo3oavd + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/output.log new file mode 100644 index 0000000000..804e8ee9f1 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/output.log @@ -0,0 +1,67 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward + def forward(self, input_ids: Tensor, target_ids: Tensor, + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward + return compiled_fn(full_args) + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper + all_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g + return f(*args) + ^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply + return super().apply(*args, **kwargs) # type: ignore[misc] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward + fw_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper + return compiled_fn(runtime_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn + outs = compiled_fn(args) + ^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 7110, in call + buf5 = empty_strided_cuda((98304, 512), (512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 18.69 MiB is free. Process 448538 has 758.00 MiB memory in use. Including non-PyTorch memory, this process has 758.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 127.50 MiB is allocated by PyTorch, and 18.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/wandb-metadata.json new file mode 100644 index 0000000000..c34ff0f1da --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T23:01:58.938056Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40807403520" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "89ifin0p6a6n764umjq94imdjdo3oavd" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/wandb-summary.json new file mode 100644 index 0000000000..878f970977 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/wandb-summary.json @@ -0,0 +1 @@ +{"_wandb":{"runtime":24},"_runtime":24} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/config.yaml new file mode 100644 index 0000000000..53147aa88c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + kdbak0rtti6ej00l2ly46s696m03mosu: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40807370752" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T23:01:58.577514Z" + writerId: kdbak0rtti6ej00l2ly46s696m03mosu + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/output.log new file mode 100644 index 0000000000..90f71e3c3b --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/output.log @@ -0,0 +1,67 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward + def forward(self, input_ids: Tensor, target_ids: Tensor, + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward + return compiled_fn(full_args) + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper + all_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g + return f(*args) + ^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply + return super().apply(*args, **kwargs) # type: ignore[misc] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward + fw_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper + return compiled_fn(runtime_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn + outs = compiled_fn(args) + ^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 7110, in call + buf5 = empty_strided_cuda((98304, 512), (512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 20.69 MiB is free. Including non-PyTorch memory, this process has 758.00 MiB memory in use. Process 448539 has 756.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 127.50 MiB is allocated by PyTorch, and 18.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/wandb-metadata.json new file mode 100644 index 0000000000..c3eeb710e6 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T23:01:58.577514Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40807370752" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "kdbak0rtti6ej00l2ly46s696m03mosu" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/wandb-summary.json new file mode 100644 index 0000000000..1b9bd7c2db --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":24,"_wandb":{"runtime":24}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/config.yaml new file mode 100644 index 0000000000..983e5823e9 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + ttse5eng2ea37eubnzrl3mdow2wcm0qo: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40807399424" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T23:01:58.980459Z" + writerId: ttse5eng2ea37eubnzrl3mdow2wcm0qo + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/output.log new file mode 100644 index 0000000000..9a78b745cc --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/output.log @@ -0,0 +1,67 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ + return super().__call__(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl + return self._call_impl(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl + return forward_call(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward + def forward(self, input_ids: Tensor, target_ids: Tensor, + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward + return compiled_fn(full_args) + ^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper + all_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g + return f(*args) + ^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply + return super().apply(*args, **kwargs) # type: ignore[misc] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward + fw_outs = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper + return compiled_fn(runtime_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn + outs = compiled_fn(args) + ^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 9077, in call + buf776 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 24.69 MiB is free. Process 448538 has 754.00 MiB memory in use. Process 448539 has 756.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Including non-PyTorch memory, this process has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 38.01 GiB is allocated by PyTorch, and 15.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/wandb-metadata.json new file mode 100644 index 0000000000..5ffd205aaf --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T23:01:58.980459Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40807399424" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "ttse5eng2ea37eubnzrl3mdow2wcm0qo" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/wandb-summary.json new file mode 100644 index 0000000000..1b9bd7c2db --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/wandb-summary.json @@ -0,0 +1 @@ +{"_runtime":24,"_wandb":{"runtime":24}} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/config.yaml new file mode 100644 index 0000000000..15ebd9f96c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + mp6843cl3s2nwlrmg1chkb6ac5g1amp8: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40807366656" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-26T23:01:58.565014Z" + writerId: mp6843cl3s2nwlrmg1chkb6ac5g1amp8 + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 13 + - 16 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 8 +core_start: + value: 3 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927200 +num_layers: + value: 11 +num_passes: + value: 4 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/output.log new file mode 100644 index 0000000000..ca77ba8372 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/output.log @@ -0,0 +1,41 @@ +wandb:initialized +Traceback (most recent call last): + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in + main() + File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main + (warmup_loss * grad_scale).backward() + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward + torch.autograd.backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward + _engine_run_backward( + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward + return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply + return user_fn(self, *args) + ^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward + return impl_fn() + ^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn + out = CompiledFunction._backward_impl(ctx, all_args) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl + out = call_func_at_runtime_with_args( + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args + out = normalize_as_list(f(args)) + ^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn + return fn(*args, **kwargs) + ^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ + return self.current_callable(inputs) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run + out = model(new_inputs) + ^^^^^^^^^^^^^^^^^ + File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call + buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 18.69 MiB is free. Process 448538 has 758.00 MiB memory in use. Process 448539 has 758.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/wandb-metadata.json new file mode 100644 index 0000000000..73249a2f8a --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-26T23:01:58.565014Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40807366656" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "mp6843cl3s2nwlrmg1chkb6ac5g1amp8" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/wandb-summary.json new file mode 100644 index 0000000000..2d7b734886 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/wandb-summary.json @@ -0,0 +1 @@ +{"_wandb":{"runtime":25},"_runtime":25} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/config.yaml new file mode 100644 index 0000000000..87db347a14 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/config.yaml @@ -0,0 +1,100 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + fs7tgl75pdr37e41mtkm0hv9p03z7tmv: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + - --lora-rank + - "0" + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "40990945280" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python3 + git: + commit: e07e44321b5c5af051343a3a16d83f0766e85597 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-27T08:09:59.765769Z" + writerId: fs7tgl75pdr37e41mtkm0hv9p03z7tmv + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 7 +core_start: + value: 4 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 20000 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927710 +num_layers: + value: 11 +num_passes: + value: 2 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/output.log new file mode 100644 index 0000000000..50a741a513 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/output.log @@ -0,0 +1,379 @@ +wandb:initialized +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9310 grad_norm:0.3715 train_time:739ms step_avg:739.22ms +step:2/20000 train_loss:8.4759 grad_norm:3.5698 train_time:1441ms step_avg:720.50ms +step:3/20000 train_loss:7.5787 grad_norm:2.0259 train_time:2207ms step_avg:735.61ms +step:4/20000 train_loss:7.3563 grad_norm:1.4981 train_time:2972ms step_avg:742.93ms +step:5/20000 train_loss:7.1725 grad_norm:1.6464 train_time:3729ms step_avg:745.72ms +step:6/20000 train_loss:7.1055 grad_norm:1.5402 train_time:4489ms step_avg:748.20ms +step:7/20000 train_loss:7.0940 grad_norm:1.7366 train_time:5253ms step_avg:750.42ms +step:8/20000 train_loss:6.9891 grad_norm:1.2439 train_time:6012ms step_avg:751.51ms +step:9/20000 train_loss:6.6063 grad_norm:0.9271 train_time:6770ms step_avg:752.25ms +step:10/20000 train_loss:6.2335 grad_norm:0.8702 train_time:7539ms step_avg:753.87ms +step:50/20000 train_loss:3.7232 grad_norm:0.7000 train_time:38194ms step_avg:763.87ms +step:100/20000 train_loss:3.2045 grad_norm:0.8824 train_time:76601ms step_avg:766.01ms +step:150/20000 train_loss:2.8496 grad_norm:0.3729 train_time:114956ms step_avg:766.37ms +step:200/20000 train_loss:2.6184 grad_norm:0.3442 train_time:153346ms step_avg:766.73ms +step:250/20000 train_loss:2.6059 grad_norm:0.2754 train_time:191710ms step_avg:766.84ms +step:300/20000 train_loss:2.4690 grad_norm:0.3187 train_time:230112ms step_avg:767.04ms +step:350/20000 train_loss:2.5041 grad_norm:0.1730 train_time:268510ms step_avg:767.17ms +step:400/20000 train_loss:2.4246 grad_norm:0.2139 train_time:306906ms step_avg:767.26ms +step:450/20000 train_loss:2.2470 grad_norm:0.1624 train_time:345332ms step_avg:767.40ms +step:500/20000 train_loss:2.3033 grad_norm:0.1807 train_time:383744ms step_avg:767.49ms +step:500/20000 val_loss:2.3232 val_bpb:1.3759 train_time:383791ms step_avg:767.58ms +step:550/20000 train_loss:2.3638 grad_norm:0.1696 train_time:422169ms step_avg:767.58ms +step:600/20000 train_loss:2.2655 grad_norm:0.1474 train_time:460616ms step_avg:767.69ms +step:650/20000 train_loss:2.2434 grad_norm:0.1741 train_time:499080ms step_avg:767.82ms +step:700/20000 train_loss:2.3153 grad_norm:0.1303 train_time:537548ms step_avg:767.93ms +step:750/20000 train_loss:2.2883 grad_norm:0.0978 train_time:576022ms step_avg:768.03ms +step:800/20000 train_loss:2.2638 grad_norm:0.0930 train_time:614511ms step_avg:768.14ms +step:850/20000 train_loss:2.1916 grad_norm:0.0668 train_time:653031ms step_avg:768.27ms +step:900/20000 train_loss:2.1043 grad_norm:0.0706 train_time:691568ms step_avg:768.41ms +step:950/20000 train_loss:2.3072 grad_norm:0.0614 train_time:730098ms step_avg:768.52ms +step:1000/20000 train_loss:2.2373 grad_norm:0.0756 train_time:768630ms step_avg:768.63ms +step:1000/20000 val_loss:2.1827 val_bpb:1.2927 train_time:768677ms step_avg:768.68ms +step:1050/20000 train_loss:2.1577 grad_norm:0.0518 train_time:807163ms step_avg:768.73ms +step:1100/20000 train_loss:2.1862 grad_norm:0.1172 train_time:845684ms step_avg:768.80ms +step:1150/20000 train_loss:2.1385 grad_norm:0.0562 train_time:884226ms step_avg:768.89ms +step:1200/20000 train_loss:2.1870 grad_norm:0.1121 train_time:922748ms step_avg:768.96ms +step:1250/20000 train_loss:2.2093 grad_norm:0.0998 train_time:961291ms step_avg:769.03ms +step:1300/20000 train_loss:2.1788 grad_norm:0.0968 train_time:999829ms step_avg:769.10ms +step:1350/20000 train_loss:2.1574 grad_norm:0.1471 train_time:1038366ms step_avg:769.16ms +step:1400/20000 train_loss:2.1680 grad_norm:0.0608 train_time:1076900ms step_avg:769.21ms +step:1450/20000 train_loss:2.1623 grad_norm:0.0881 train_time:1115450ms step_avg:769.28ms +step:1500/20000 train_loss:2.1325 grad_norm:0.0715 train_time:1153976ms step_avg:769.32ms +step:1500/20000 val_loss:2.1196 val_bpb:1.2554 train_time:1154023ms step_avg:769.35ms +step:1550/20000 train_loss:2.1034 grad_norm:0.0555 train_time:1192498ms step_avg:769.35ms +step:1600/20000 train_loss:2.1841 grad_norm:0.0690 train_time:1231045ms step_avg:769.40ms +step:1650/20000 train_loss:1.9679 grad_norm:0.0623 train_time:1269584ms step_avg:769.44ms +step:1700/20000 train_loss:2.0979 grad_norm:0.1024 train_time:1308158ms step_avg:769.50ms +step:1750/20000 train_loss:2.0668 grad_norm:0.0916 train_time:1346716ms step_avg:769.55ms +step:1800/20000 train_loss:2.1070 grad_norm:0.0947 train_time:1385264ms step_avg:769.59ms +step:1850/20000 train_loss:2.1232 grad_norm:0.0632 train_time:1423808ms step_avg:769.63ms +step:1900/20000 train_loss:2.0734 grad_norm:0.0745 train_time:1462341ms step_avg:769.65ms +step:1950/20000 train_loss:2.0600 grad_norm:0.1123 train_time:1500897ms step_avg:769.69ms +step:2000/20000 train_loss:2.3179 grad_norm:0.0744 train_time:1539440ms step_avg:769.72ms +step:2000/20000 val_loss:2.0970 val_bpb:1.2420 train_time:1539487ms step_avg:769.74ms +step:2050/20000 train_loss:2.0834 grad_norm:0.0576 train_time:1577995ms step_avg:769.75ms +step:2100/20000 train_loss:2.0619 grad_norm:0.0535 train_time:1616543ms step_avg:769.78ms +step:2150/20000 train_loss:2.0442 grad_norm:0.0773 train_time:1655080ms step_avg:769.80ms +step:2200/20000 train_loss:2.2001 grad_norm:0.0771 train_time:1693618ms step_avg:769.83ms +step:2250/20000 train_loss:2.0929 grad_norm:0.0580 train_time:1732173ms step_avg:769.85ms +step:2300/20000 train_loss:2.0744 grad_norm:0.0679 train_time:1770710ms step_avg:769.87ms +step:2350/20000 train_loss:2.0331 grad_norm:0.1111 train_time:1809258ms step_avg:769.90ms +step:2400/20000 train_loss:2.1487 grad_norm:0.0711 train_time:1847814ms step_avg:769.92ms +step:2450/20000 train_loss:2.1108 grad_norm:0.0571 train_time:1886353ms step_avg:769.94ms +step:2500/20000 train_loss:2.0727 grad_norm:0.0554 train_time:1924887ms step_avg:769.95ms +step:2500/20000 val_loss:2.0778 val_bpb:1.2306 train_time:1924935ms step_avg:769.97ms +step:2550/20000 train_loss:2.0748 grad_norm:0.0686 train_time:1963451ms step_avg:769.98ms +step:2600/20000 train_loss:2.0557 grad_norm:0.0604 train_time:2001997ms step_avg:770.00ms +step:2650/20000 train_loss:2.0616 grad_norm:0.0686 train_time:2040534ms step_avg:770.01ms +step:2700/20000 train_loss:2.0896 grad_norm:0.1260 train_time:2079065ms step_avg:770.02ms +step:2750/20000 train_loss:2.0721 grad_norm:0.0683 train_time:2117609ms step_avg:770.04ms +step:2800/20000 train_loss:2.1120 grad_norm:0.0617 train_time:2156137ms step_avg:770.05ms +step:2850/20000 train_loss:2.0672 grad_norm:0.0618 train_time:2194684ms step_avg:770.06ms +step:2900/20000 train_loss:2.0771 grad_norm:0.0606 train_time:2233233ms step_avg:770.08ms +step:2950/20000 train_loss:2.1240 grad_norm:0.0555 train_time:2271768ms step_avg:770.09ms +step:3000/20000 train_loss:2.0159 grad_norm:0.1451 train_time:2310313ms step_avg:770.10ms +step:3000/20000 val_loss:2.0668 val_bpb:1.2241 train_time:2310360ms step_avg:770.12ms +step:3050/20000 train_loss:2.0182 grad_norm:0.0595 train_time:2348849ms step_avg:770.11ms +step:3100/20000 train_loss:2.0905 grad_norm:0.1292 train_time:2387390ms step_avg:770.13ms +step:3150/20000 train_loss:2.1075 grad_norm:0.0625 train_time:2425926ms step_avg:770.14ms +step:3200/20000 train_loss:2.0823 grad_norm:0.0562 train_time:2464469ms step_avg:770.15ms +step:3250/20000 train_loss:2.0517 grad_norm:0.0675 train_time:2502996ms step_avg:770.15ms +step:3300/20000 train_loss:2.0328 grad_norm:0.0879 train_time:2541545ms step_avg:770.17ms +step:3350/20000 train_loss:2.0720 grad_norm:0.0532 train_time:2580093ms step_avg:770.18ms +step:3400/20000 train_loss:2.1303 grad_norm:0.1521 train_time:2618607ms step_avg:770.18ms +step:3450/20000 train_loss:2.0795 grad_norm:0.0880 train_time:2657152ms step_avg:770.19ms +step:3500/20000 train_loss:2.0592 grad_norm:0.0568 train_time:2695692ms step_avg:770.20ms +step:3500/20000 val_loss:2.0566 val_bpb:1.2180 train_time:2695739ms step_avg:770.21ms +step:3550/20000 train_loss:2.0263 grad_norm:0.1106 train_time:2734239ms step_avg:770.21ms +step:3600/20000 train_loss:2.0242 grad_norm:0.0546 train_time:2772777ms step_avg:770.22ms +step:3650/20000 train_loss:2.0434 grad_norm:0.0792 train_time:2811292ms step_avg:770.22ms +step:3700/20000 train_loss:2.0376 grad_norm:0.0698 train_time:2849835ms step_avg:770.23ms +step:3750/20000 train_loss:2.0472 grad_norm:0.0748 train_time:2888385ms step_avg:770.24ms +step:3800/20000 train_loss:2.0429 grad_norm:0.0871 train_time:2926943ms step_avg:770.25ms +step:3850/20000 train_loss:2.0756 grad_norm:0.0799 train_time:2965493ms step_avg:770.26ms +step:3900/20000 train_loss:2.0692 grad_norm:0.0643 train_time:3004035ms step_avg:770.27ms +step:3950/20000 train_loss:2.0267 grad_norm:0.0667 train_time:3042573ms step_avg:770.27ms +step:4000/20000 train_loss:2.0478 grad_norm:0.1092 train_time:3081111ms step_avg:770.28ms +step:4000/20000 val_loss:2.0501 val_bpb:1.2142 train_time:3081158ms step_avg:770.29ms +step:4050/20000 train_loss:2.0524 grad_norm:0.0595 train_time:3119653ms step_avg:770.28ms +step:4100/20000 train_loss:1.9268 grad_norm:0.0611 train_time:3158208ms step_avg:770.29ms +step:4150/20000 train_loss:2.0567 grad_norm:0.0683 train_time:3196731ms step_avg:770.30ms +step:4200/20000 train_loss:2.0971 grad_norm:0.0701 train_time:3235281ms step_avg:770.30ms +step:4250/20000 train_loss:2.0465 grad_norm:0.0670 train_time:3273811ms step_avg:770.31ms +step:4300/20000 train_loss:2.0333 grad_norm:0.0543 train_time:3312336ms step_avg:770.31ms +step:4350/20000 train_loss:2.0258 grad_norm:0.0616 train_time:3350887ms step_avg:770.32ms +step:4400/20000 train_loss:2.0344 grad_norm:0.0614 train_time:3389423ms step_avg:770.32ms +step:4450/20000 train_loss:2.0645 grad_norm:0.0506 train_time:3427944ms step_avg:770.32ms +step:4500/20000 train_loss:2.0679 grad_norm:0.0550 train_time:3466466ms step_avg:770.33ms +step:4500/20000 val_loss:2.0466 val_bpb:1.2121 train_time:3466512ms step_avg:770.34ms +step:4550/20000 train_loss:2.0316 grad_norm:0.0607 train_time:3505002ms step_avg:770.33ms +step:4600/20000 train_loss:1.9490 grad_norm:0.0543 train_time:3543530ms step_avg:770.33ms +step:4650/20000 train_loss:2.0275 grad_norm:0.0590 train_time:3582052ms step_avg:770.33ms +step:4700/20000 train_loss:2.0597 grad_norm:0.1022 train_time:3620609ms step_avg:770.34ms +step:4750/20000 train_loss:2.0241 grad_norm:0.1000 train_time:3659144ms step_avg:770.35ms +step:4800/20000 train_loss:2.0349 grad_norm:0.0580 train_time:3697672ms step_avg:770.35ms +step:4850/20000 train_loss:2.0473 grad_norm:0.0532 train_time:3736199ms step_avg:770.35ms +step:4900/20000 train_loss:2.0297 grad_norm:0.0634 train_time:3774723ms step_avg:770.35ms +step:4950/20000 train_loss:1.9799 grad_norm:0.0504 train_time:3813238ms step_avg:770.35ms +step:5000/20000 train_loss:2.0735 grad_norm:0.0549 train_time:3851805ms step_avg:770.36ms +step:5000/20000 val_loss:2.0184 val_bpb:1.1954 train_time:3851852ms step_avg:770.37ms +step:5050/20000 train_loss:1.9940 grad_norm:0.0666 train_time:3890332ms step_avg:770.36ms +step:5100/20000 train_loss:1.9998 grad_norm:0.0478 train_time:3928885ms step_avg:770.37ms +step:5150/20000 train_loss:2.0985 grad_norm:0.0906 train_time:3967424ms step_avg:770.37ms +step:5200/20000 train_loss:2.0041 grad_norm:0.0450 train_time:4005964ms step_avg:770.38ms +step:5250/20000 train_loss:1.9757 grad_norm:0.0451 train_time:4044484ms step_avg:770.38ms +step:5300/20000 train_loss:1.9557 grad_norm:0.0544 train_time:4083005ms step_avg:770.38ms +step:5350/20000 train_loss:1.9972 grad_norm:0.0399 train_time:4121527ms step_avg:770.38ms +step:5400/20000 train_loss:2.0035 grad_norm:0.0433 train_time:4160051ms step_avg:770.38ms +step:5450/20000 train_loss:2.0130 grad_norm:0.0411 train_time:4198561ms step_avg:770.38ms +step:5500/20000 train_loss:2.0100 grad_norm:0.0376 train_time:4237081ms step_avg:770.38ms +step:5500/20000 val_loss:1.9818 val_bpb:1.1737 train_time:4237128ms step_avg:770.39ms +step:5550/20000 train_loss:1.9694 grad_norm:0.0464 train_time:4275608ms step_avg:770.38ms +step:5600/20000 train_loss:1.9396 grad_norm:0.0419 train_time:4314139ms step_avg:770.38ms +step:5650/20000 train_loss:2.0040 grad_norm:0.0377 train_time:4352662ms step_avg:770.38ms +step:5700/20000 train_loss:1.9579 grad_norm:0.0492 train_time:4391196ms step_avg:770.39ms +step:5750/20000 train_loss:1.9341 grad_norm:0.0370 train_time:4429712ms step_avg:770.38ms +step:5800/20000 train_loss:1.8494 grad_norm:0.0476 train_time:4468236ms step_avg:770.39ms +step:5850/20000 train_loss:1.8418 grad_norm:0.0404 train_time:4506763ms step_avg:770.39ms +swa:start step:5900 +step:5900/20000 train_loss:1.9171 grad_norm:0.0399 train_time:4545281ms step_avg:770.39ms +step:5950/20000 train_loss:1.9844 grad_norm:0.0432 train_time:4583947ms step_avg:770.41ms +late_qat:enabled step:5976 scale:0.1496 core_quant:on +step:6000/20000 train_loss:1.9423 grad_norm:0.0376 train_time:4656702ms step_avg:776.12ms +step:6000/20000 val_loss:1.9425 val_bpb:1.1505 train_time:4656808ms step_avg:776.13ms +step:6050/20000 train_loss:1.9037 grad_norm:0.0447 train_time:4695045ms step_avg:776.04ms +step:6100/20000 train_loss:1.9130 grad_norm:0.0341 train_time:4733411ms step_avg:775.97ms +step:6150/20000 train_loss:1.9282 grad_norm:0.0331 train_time:4771676ms step_avg:775.88ms +step:6188/20000 val_loss:1.9310 val_bpb:1.1437 train_time:4800792ms step_avg:775.82ms +stopping_early: wallclock_cap train_time:4800792ms step:6188/20000 +peak memory allocated: 28656 MiB reserved: 28704 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9264 val_bpb:1.1409 eval_time:18187ms +Serialized model: 106025719 bytes +Code size: 105268 bytes +Serialized model int6+lzma: 16459152 bytes +Total submission size int6+lzma: 16564420 bytes +final_int6_roundtrip val_loss:1.9355 val_bpb:1.1463 eval_time:36060ms +final_int6_roundtrip_exact val_loss:1.93549859 val_bpb:1.14631129 +final_int6_sliding_window val_loss:1.8956 val_bpb:1.1227 stride:64 eval_time:643423ms +final_int6_sliding_window_exact val_loss:1.89556561 val_bpb:1.12266370 +final_int8_zlib_roundtrip_exact val_loss:1.89556561 val_bpb:1.12266370 +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=2 +ttt_sliding:params unfrozen=26923598 frozen=4112 + ttt_chunk [1/1893] bpb=1.213507 time=1.2s + ttt_chunk [11/1893] bpb=1.115005 time=11.9s + ttt_chunk [21/1893] bpb=1.124115 time=22.5s + ttt_chunk [31/1893] bpb=1.129630 time=33.2s + ttt_chunk [41/1893] bpb=1.125661 time=43.9s + ttt_chunk [51/1893] bpb=1.127271 time=54.6s + ttt_chunk [61/1893] bpb=1.131076 time=65.2s + ttt_chunk [71/1893] bpb=1.129445 time=75.9s + ttt_chunk [81/1893] bpb=1.125956 time=86.6s + ttt_chunk [91/1893] bpb=1.125065 time=97.3s + ttt_chunk [101/1893] bpb=1.126017 time=107.9s + ttt_chunk [111/1893] bpb=1.126115 time=118.6s + ttt_chunk [121/1893] bpb=1.122503 time=129.3s + ttt_chunk [131/1893] bpb=1.121744 time=139.9s + ttt_chunk [141/1893] bpb=1.120618 time=150.6s + ttt_chunk [151/1893] bpb=1.120790 time=161.3s + ttt_chunk [161/1893] bpb=1.121623 time=172.0s + ttt_chunk [171/1893] bpb=1.123693 time=182.6s + ttt_chunk [181/1893] bpb=1.123772 time=193.3s + ttt_chunk [191/1893] bpb=1.126204 time=204.0s + ttt_chunk [201/1893] bpb=1.125769 time=214.6s + ttt_chunk [211/1893] bpb=1.124798 time=225.3s + ttt_chunk [221/1893] bpb=1.125694 time=236.0s + ttt_chunk [231/1893] bpb=1.125431 time=246.7s + ttt_chunk [241/1893] bpb=1.125671 time=257.3s + ttt_chunk [251/1893] bpb=1.125230 time=268.0s + ttt_chunk [261/1893] bpb=1.124541 time=278.7s + ttt_chunk [271/1893] bpb=1.123621 time=289.3s + ttt_chunk [281/1893] bpb=1.125199 time=300.0s + ttt_chunk [291/1893] bpb=1.124816 time=310.7s + ttt_chunk [301/1893] bpb=1.125690 time=321.3s + ttt_chunk [311/1893] bpb=1.125780 time=332.0s + ttt_chunk [321/1893] bpb=1.126532 time=342.7s + ttt_chunk [331/1893] bpb=1.125992 time=353.4s + ttt_chunk [341/1893] bpb=1.125619 time=364.0s + ttt_chunk [351/1893] bpb=1.126331 time=374.7s + ttt_chunk [361/1893] bpb=1.127106 time=385.4s + ttt_chunk [371/1893] bpb=1.126994 time=396.0s + ttt_chunk [381/1893] bpb=1.126771 time=406.7s + ttt_chunk [391/1893] bpb=1.127473 time=417.4s + ttt_chunk [401/1893] bpb=1.127019 time=428.0s + ttt_chunk [411/1893] bpb=1.126072 time=438.7s + ttt_chunk [421/1893] bpb=1.126211 time=449.4s + ttt_chunk [431/1893] bpb=1.126627 time=460.1s + ttt_chunk [441/1893] bpb=1.126026 time=470.7s + ttt_chunk [451/1893] bpb=1.126178 time=481.4s + ttt_chunk [461/1893] bpb=1.126064 time=492.1s + ttt_chunk [471/1893] bpb=1.125649 time=502.7s + ttt_chunk [481/1893] bpb=1.125465 time=513.4s + ttt_chunk [491/1893] bpb=1.125641 time=524.1s + ttt_chunk [501/1893] bpb=1.125403 time=534.8s + ttt_chunk [511/1893] bpb=1.124922 time=545.4s + ttt_chunk [521/1893] bpb=1.124606 time=556.1s + ttt_chunk [531/1893] bpb=1.125329 time=566.8s + ttt_chunk [541/1893] bpb=1.125457 time=577.5s + ttt_chunk [551/1893] bpb=1.124939 time=588.1s + ttt_chunk [561/1893] bpb=1.124808 time=598.8s + ttt_chunk [571/1893] bpb=1.124539 time=609.5s + ttt_chunk [581/1893] bpb=1.124189 time=620.1s + ttt_chunk [591/1893] bpb=1.123646 time=630.8s + ttt_chunk [601/1893] bpb=1.123663 time=641.5s + ttt_chunk [611/1893] bpb=1.123360 time=652.2s + ttt_chunk [621/1893] bpb=1.123207 time=662.8s + ttt_chunk [631/1893] bpb=1.122962 time=673.5s + ttt_chunk [641/1893] bpb=1.122517 time=684.2s + ttt_chunk [651/1893] bpb=1.122074 time=694.8s + ttt_chunk [661/1893] bpb=1.121970 time=705.5s + ttt_chunk [671/1893] bpb=1.121500 time=716.2s + ttt_chunk [681/1893] bpb=1.120954 time=726.9s + ttt_chunk [691/1893] bpb=1.121049 time=737.5s + ttt_chunk [701/1893] bpb=1.120237 time=748.2s + ttt_chunk [711/1893] bpb=1.120250 time=758.9s + ttt_chunk [721/1893] bpb=1.120159 time=769.5s + ttt_chunk [731/1893] bpb=1.120390 time=780.2s + ttt_chunk [741/1893] bpb=1.120277 time=790.9s + ttt_chunk [751/1893] bpb=1.119980 time=801.6s + ttt_chunk [761/1893] bpb=1.120120 time=812.2s + ttt_chunk [771/1893] bpb=1.119952 time=822.9s + ttt_chunk [781/1893] bpb=1.120123 time=833.6s + ttt_chunk [791/1893] bpb=1.119975 time=844.2s + ttt_chunk [801/1893] bpb=1.119907 time=854.9s + ttt_chunk [811/1893] bpb=1.119919 time=865.6s + ttt_chunk [821/1893] bpb=1.119806 time=876.3s + ttt_chunk [831/1893] bpb=1.119523 time=886.9s + ttt_chunk [841/1893] bpb=1.119272 time=897.6s + ttt_chunk [851/1893] bpb=1.119326 time=908.3s + ttt_chunk [861/1893] bpb=1.119397 time=919.0s + ttt_chunk [871/1893] bpb=1.119597 time=929.6s + ttt_chunk [881/1893] bpb=1.119595 time=940.3s + ttt_chunk [891/1893] bpb=1.119057 time=951.0s + ttt_chunk [901/1893] bpb=1.119066 time=961.6s + ttt_chunk [911/1893] bpb=1.118915 time=972.3s + ttt_chunk [921/1893] bpb=1.119057 time=983.0s + ttt_chunk [931/1893] bpb=1.119000 time=993.7s + ttt_chunk [941/1893] bpb=1.119211 time=1004.3s + ttt_chunk [951/1893] bpb=1.119510 time=1015.0s + ttt_chunk [961/1893] bpb=1.119814 time=1025.7s + ttt_chunk [971/1893] bpb=1.120172 time=1036.4s + ttt_chunk [981/1893] bpb=1.120386 time=1047.0s + ttt_chunk [991/1893] bpb=1.120296 time=1057.7s + ttt_chunk [1001/1893] bpb=1.120622 time=1068.4s + ttt_chunk [1011/1893] bpb=1.120769 time=1079.0s + ttt_chunk [1021/1893] bpb=1.121058 time=1089.7s + ttt_chunk [1031/1893] bpb=1.121451 time=1100.4s + ttt_chunk [1041/1893] bpb=1.121962 time=1111.0s + ttt_chunk [1051/1893] bpb=1.121834 time=1121.7s + ttt_chunk [1061/1893] bpb=1.121931 time=1132.4s + ttt_chunk [1071/1893] bpb=1.122074 time=1143.1s + ttt_chunk [1081/1893] bpb=1.122119 time=1153.7s + ttt_chunk [1091/1893] bpb=1.122380 time=1164.4s + ttt_chunk [1101/1893] bpb=1.122531 time=1175.1s + ttt_chunk [1111/1893] bpb=1.122350 time=1185.7s + ttt_chunk [1121/1893] bpb=1.122124 time=1196.4s + ttt_chunk [1131/1893] bpb=1.122020 time=1207.1s + ttt_chunk [1141/1893] bpb=1.121778 time=1217.7s + ttt_chunk [1151/1893] bpb=1.121805 time=1228.4s + ttt_chunk [1161/1893] bpb=1.121646 time=1239.1s + ttt_chunk [1171/1893] bpb=1.121470 time=1249.7s + ttt_chunk [1181/1893] bpb=1.121248 time=1260.4s + ttt_chunk [1191/1893] bpb=1.121404 time=1271.1s + ttt_chunk [1201/1893] bpb=1.121610 time=1281.8s + ttt_chunk [1211/1893] bpb=1.121209 time=1292.4s + ttt_chunk [1221/1893] bpb=1.121550 time=1303.1s + ttt_chunk [1231/1893] bpb=1.121493 time=1313.8s + ttt_chunk [1241/1893] bpb=1.121200 time=1324.4s + ttt_chunk [1251/1893] bpb=1.120654 time=1335.1s + ttt_chunk [1261/1893] bpb=1.120400 time=1345.8s + ttt_chunk [1271/1893] bpb=1.120154 time=1356.4s + ttt_chunk [1281/1893] bpb=1.119845 time=1367.1s + ttt_chunk [1291/1893] bpb=1.119603 time=1377.8s + ttt_chunk [1301/1893] bpb=1.119559 time=1388.4s + ttt_chunk [1311/1893] bpb=1.119294 time=1399.1s + ttt_chunk [1321/1893] bpb=1.119006 time=1409.8s + ttt_chunk [1331/1893] bpb=1.118778 time=1420.5s + ttt_chunk [1341/1893] bpb=1.118650 time=1431.1s + ttt_chunk [1351/1893] bpb=1.118499 time=1441.8s + ttt_chunk [1361/1893] bpb=1.118620 time=1452.5s + ttt_chunk [1371/1893] bpb=1.118833 time=1463.1s + ttt_chunk [1381/1893] bpb=1.119039 time=1473.8s + ttt_chunk [1391/1893] bpb=1.118831 time=1484.5s + ttt_chunk [1401/1893] bpb=1.118873 time=1495.1s + ttt_chunk [1411/1893] bpb=1.118989 time=1505.8s + ttt_chunk [1421/1893] bpb=1.118980 time=1516.5s + ttt_chunk [1431/1893] bpb=1.118955 time=1527.1s + ttt_chunk [1441/1893] bpb=1.119428 time=1537.8s + ttt_chunk [1451/1893] bpb=1.119298 time=1548.5s + ttt_chunk [1461/1893] bpb=1.119224 time=1559.2s + ttt_chunk [1471/1893] bpb=1.119815 time=1569.8s + ttt_chunk [1481/1893] bpb=1.119692 time=1580.5s + ttt_chunk [1491/1893] bpb=1.120064 time=1591.2s + ttt_chunk [1501/1893] bpb=1.120046 time=1601.9s + ttt_chunk [1511/1893] bpb=1.119995 time=1612.5s + ttt_chunk [1521/1893] bpb=1.120109 time=1623.2s + ttt_chunk [1531/1893] bpb=1.120314 time=1633.9s + ttt_chunk [1541/1893] bpb=1.120389 time=1644.6s + ttt_chunk [1551/1893] bpb=1.120624 time=1655.2s + ttt_chunk [1561/1893] bpb=1.120711 time=1665.9s + ttt_chunk [1571/1893] bpb=1.120856 time=1676.6s + ttt_chunk [1581/1893] bpb=1.121011 time=1687.2s + ttt_chunk [1591/1893] bpb=1.121070 time=1697.9s + ttt_chunk [1601/1893] bpb=1.121194 time=1708.6s + ttt_chunk [1611/1893] bpb=1.121454 time=1719.3s + ttt_chunk [1621/1893] bpb=1.121318 time=1729.9s + ttt_chunk [1631/1893] bpb=1.121361 time=1740.6s + ttt_chunk [1641/1893] bpb=1.121385 time=1751.3s + ttt_chunk [1651/1893] bpb=1.121438 time=1762.0s + ttt_chunk [1661/1893] bpb=1.121577 time=1772.6s + ttt_chunk [1671/1893] bpb=1.121754 time=1783.3s + ttt_chunk [1681/1893] bpb=1.121845 time=1794.0s + ttt_chunk [1691/1893] bpb=1.121951 time=1804.6s + ttt_chunk [1701/1893] bpb=1.122049 time=1815.3s + ttt_chunk [1711/1893] bpb=1.122031 time=1826.0s + ttt_chunk [1721/1893] bpb=1.121864 time=1836.7s + ttt_chunk [1731/1893] bpb=1.121961 time=1847.3s + ttt_chunk [1741/1893] bpb=1.121701 time=1858.0s + ttt_chunk [1751/1893] bpb=1.121579 time=1868.7s + ttt_chunk [1761/1893] bpb=1.121622 time=1879.3s + ttt_chunk [1771/1893] bpb=1.121568 time=1890.0s + ttt_chunk [1781/1893] bpb=1.121464 time=1900.7s + ttt_chunk [1791/1893] bpb=1.121129 time=1911.3s + ttt_chunk [1801/1893] bpb=1.121118 time=1922.0s + ttt_chunk [1811/1893] bpb=1.120975 time=1932.7s + ttt_chunk [1821/1893] bpb=1.121035 time=1943.3s + ttt_chunk [1831/1893] bpb=1.120887 time=1954.0s + ttt_chunk [1841/1893] bpb=1.120931 time=1964.7s + ttt_chunk [1851/1893] bpb=1.120761 time=1975.4s + ttt_chunk [1861/1893] bpb=1.120682 time=1986.0s + ttt_chunk [1871/1893] bpb=1.120616 time=1996.7s + ttt_chunk [1881/1893] bpb=1.120376 time=2007.4s + ttt_chunk [1891/1893] bpb=1.120360 time=2018.0s + ttt_chunk [1893/1893] bpb=1.120391 time=2019.8s +ttt_sliding:done val_loss=1.891728 val_bpb=1.120391 elapsed=2019.8s +legal_ttt val_loss:1.8917 val_bpb:1.1204 eval_time:2020286ms +legal_ttt_exact val_loss:1.89172798 val_bpb:1.12039083 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/requirements.txt new file mode 100644 index 0000000000..e3d59eea39 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/requirements.txt @@ -0,0 +1,101 @@ +mdurl==0.1.2 +nvidia-cudnn-cu13==9.19.0.56 +aiohttp==3.13.3 +nvidia-cufile==1.15.1.6 +charset-normalizer==3.4.6 +Jinja2==3.1.6 +hf-xet==1.4.2 +nvidia-cuda-nvrtc-cu12==12.8.93 +typer==0.24.1 +attrs==26.1.0 +certifi==2026.2.25 +triton==3.6.0 +nvidia-nccl-cu12==2.29.7 +wheel==0.46.3 +nvidia-nvtx-cu12==12.8.90 +gitdb==4.0.12 +dill==0.4.1 +nvidia-cuda-cupti==13.0.85 +tqdm==4.67.3 +pandas==3.0.1 +PyYAML==6.0.3 +annotated-types==0.7.0 +annotated-doc==0.0.4 +nvidia-nccl-cu13==2.28.9 +nvidia-cufft-cu12==11.3.3.83 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cudnn-cu12==9.20.0.48 +httpx==0.28.1 +packaging==26.0 +einops==0.8.2 +xxhash==3.6.0 +huggingface_hub==1.8.0 +Pygments==2.19.2 +markdown-it-py==4.0.0 +pydantic_core==2.41.5 +nvidia-cusparse-cu12==12.5.8.93 +cuda-toolkit==13.0.2 +rich==14.3.3 +six==1.17.0 +python-dateutil==2.9.0.post0 +nvidia-cusolver==12.0.4.66 +nvidia-nvshmem-cu13==3.4.5 +setuptools==81.0.0 +pyarrow==23.0.1 +typing_extensions==4.15.0 +MarkupSafe==3.0.3 +smmap==5.0.3 +filelock==3.25.2 +nvidia-nvtx==13.0.85 +multiprocess==0.70.19 +networkx==3.6.1 +pydantic==2.12.5 +nvidia-nvshmem-cu12==3.4.5 +nvidia-cublas-cu12==12.8.4.1 +anyio==4.13.0 +nvidia-cufft==12.0.0.61 +cuda-pathfinder==1.5.0 +mpmath==1.3.0 +cuda-bindings==13.2.0 +propcache==0.4.1 +yarl==1.23.0 +ninja==1.13.0 +typing-inspection==0.4.2 +idna==3.11 +h11==0.16.0 +urllib3==2.6.3 +multidict==6.7.1 +aiosignal==1.4.0 +nvidia-nvjitlink-cu12==12.8.93 +nvidia-cusparse==12.6.3.3 +aiohappyeyeballs==2.6.1 +psutil==7.2.2 +wandb==0.25.1 +protobuf==6.33.6 +click==8.3.1 +nvidia-cufile-cu12==1.13.1.3 +httpcore==1.0.9 +sentencepiece==0.2.1 +fsspec==2026.2.0 +nvidia-curand-cu12==10.3.9.90 +nvidia-curand==10.4.0.35 +GitPython==3.1.46 +pip==26.0.1 +platformdirs==4.9.4 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti-cu12==12.8.90 +flash_attn==2.8.3 +nvidia-cusolver-cu12==11.7.3.90 +sympy==1.14.0 +torch==2.11.0 +numpy==2.4.3 +nvidia-cuda-runtime-cu12==12.8.90 +nvidia-cusparselt-cu13==0.8.0 +frozenlist==1.8.0 +sentry-sdk==2.56.0 +requests==2.33.0 +nvidia-cuda-runtime==13.0.96 +nvidia-nvjitlink==13.0.88 +nvidia-cusparselt-cu12==0.7.1 +shellingham==1.5.4 +datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/wandb-metadata.json new file mode 100644 index 0000000000..c079c01836 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/wandb-metadata.json @@ -0,0 +1,53 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-27T08:09:59.765769Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm", + "--lora-rank", + "0" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "e07e44321b5c5af051343a3a16d83f0766e85597" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python3", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "40990945280" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "fs7tgl75pdr37e41mtkm0hv9p03z7tmv" +} \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/wandb-summary.json new file mode 100644 index 0000000000..06a4f283c4 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/wandb-summary.json @@ -0,0 +1 @@ +{"_step":6188,"val_loss":1.9310067692893829,"grad_norm":0.03314758464694023,"_timestamp":1.7746041721319935e+09,"train_loss":1.92824387550354,"_wandb":{"runtime":7907},"val_bpb":1.1436509771287104,"lr_scale":0.02205410137804706,"step_avg_ms":775.8823216956074,"_runtime":7907.805803859} \ No newline at end of file diff --git a/sweep_5pass.log b/sweep_5pass.log new file mode 100644 index 0000000000..bee4819580 --- /dev/null +++ b/sweep_5pass.log @@ -0,0 +1,78 @@ +logs/6b6eee3d-3f48-4b4f-b563-24382676c38d.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=5 stem=3 core=5 tail=3 +model_params:26927201 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run ce1th47g +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run sweep_5pass_noRMS_j0.1 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/ce1th47g +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['9697.0', '10451.6', '11267.2', '12136.6', '13066.3', '14092.3', '15164.6', '16287.1', '17473.2', '18730.2', '17176.6', '18430.5', '19751.7', '21150.8', '22632.8', '20820.9', '22301.9', '23864.0', '25513.0', '27242.6', '25115.3', '26855.7', '28683.3', '30605.1', '32620.9'] growth=['1.076', '1.078', '1.078', '1.077', '1.077', '1.079', '1.076', '1.074', '1.073', '1.072', '1.075', '1.073', '1.072', '1.071', '1.070', '1.072', '1.071', '1.070', '1.069', '1.068', '1.070', '1.069', '1.068', '1.067', '1.066'] +step:1/50 train_loss:6.9310 train_time:3595ms step_avg:3594.71ms +step:2/50 train_loss:8.2519 train_time:7178ms step_avg:3588.94ms +step:3/50 train_loss:7.4903 train_time:10793ms step_avg:3597.54ms +step:4/50 train_loss:7.6972 train_time:14409ms step_avg:3602.19ms +step:5/50 train_loss:7.4012 train_time:18024ms step_avg:3604.78ms +step:6/50 train_loss:7.0545 train_time:21640ms step_avg:3606.66ms +step:7/50 train_loss:6.8316 train_time:25257ms step_avg:3608.12ms +step:8/50 train_loss:6.7797 train_time:28873ms step_avg:3609.17ms +step:9/50 train_loss:6.4829 train_time:32490ms step_avg:3610.01ms +step:10/50 train_loss:6.1437 train_time:36108ms step_avg:3610.80ms +step:20/50 train_loss:4.7380 train_time:72276ms step_avg:3613.81ms +step:25/50 val_loss:4.3185 val_bpb:2.5577 train_time:90403ms step_avg:3616.13ms h_norms=['12234.1', '10923.7', '9965.4', '9287.7', '8892.1', '8588.1', '8375.3', '8299.3', '8339.8', '8524.9', '8358.5', '8246.4', '8260.4', '8384.4', '8644.6', '8290.1', '8234.7', '8305.1', '8486.1', '8800.2', '8326.9', '8299.6', '8403.4', '8622.6', '8975.6'] growth=['0.878', '0.893', '0.912', '0.932', '0.957', '0.966', '0.975', '0.991', '1.005', '1.022', '0.978', '0.987', '1.002', '1.015', '1.031', '0.985', '0.993', '1.009', '1.022', '1.037', '0.988', '0.997', '1.013', '1.026', '1.041'] +step:30/50 train_loss:4.1582 train_time:108463ms step_avg:3615.44ms +step:40/50 train_loss:3.8947 train_time:144657ms step_avg:3616.43ms +step:50/50 train_loss:3.7328 train_time:180990ms step_avg:3619.81ms +step:50/50 val_loss:3.6922 val_bpb:2.1867 train_time:181025ms step_avg:3620.49ms h_norms=['17283.6', '14695.4', '12995.7', '11901.6', '11252.4', '11294.1', '11357.9', '11432.8', '11508.8', '11635.1', '11340.6', '11532.3', '11678.7', '11798.8', '11950.4', '11501.0', '11757.0', '11936.4', '12077.0', '12238.6', '11722.3', '12001.2', '12189.6', '12335.8', '12498.2'] growth=['0.819', '0.850', '0.884', '0.916', '0.945', '1.004', '1.006', '1.007', '1.007', '1.011', '1.023', '1.017', '1.013', '1.010', '1.013', '1.034', '1.022', '1.015', '1.012', '1.013', '1.038', '1.024', '1.016', '1.012', '1.013'] +peak memory allocated: 78629 MiB reserved: 79984 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:5.9839 val_bpb:3.5440 eval_time:98671ms +Serialized model: 106023671 bytes +Code size: 99082 bytes +Serialized model int6+lzma: 4803484 bytes +Total submission size int6+lzma: 4902566 bytes +final_int6_roundtrip val_loss:6.1921 val_bpb:3.6673 eval_time:98102ms +final_int6_roundtrip_exact val_loss:6.19214207 val_bpb:3.66733532 +wandb: updating run metadata +wandb: uploading history steps 15-15, summary, console lines 30-31 +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▂▁▃▄▅▅▅▆▆▆▇▇▇█ +wandb: train_loss ▆█▇▇▇▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▂▁ +wandb: val_loss █▂▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 3619.80867 +wandb: train_loss 3.73278 +wandb: val_bpb 2.18674 +wandb: val_loss 3.69221 +wandb: +wandb: 🚀 View run sweep_5pass_noRMS_j0.1 at: https://wandb.ai/propensity/parameter-golf/runs/ce1th47g +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_130804-ce1th47g/logs diff --git a/sweep_6pass.log b/sweep_6pass.log new file mode 100644 index 0000000000..6c6f26804b --- /dev/null +++ b/sweep_6pass.log @@ -0,0 +1,44 @@ +logs/ee0de337-f03e-40d8-a0f6-5b16e06ada0f.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:10 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=3 core_end=8 num_passes=6 stem=3 core=5 tail=3 +model_params:26927202 +mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0 +XSA:last_4 active_layers:[8, 9, 10] +world_size:1 grad_accum_steps:8 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:50 warmup_steps:5 max_wallclock_seconds:900.000 +seed:1337 +wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY. +wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead. +wandb: setting up run 5rznhtcu +wandb: Tracking run with wandb version 0.25.1 +wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run sweep_6pass_noRMS_j0.1 +wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf +wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/5rznhtcu +wandb:initialized +warmup_step:1/5 +warmup_step:2/5 +warmup_step:3/5 +warmup_step:4/5 +warmup_step:5/5 +step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['9459.8', '10026.0', '10638.5', '11293.0', '11994.1', '12754.8', '13546.4', '14370.5', '15234.0', '16144.6', '14936.2', '15835.7', '16776.4', '17764.6', '18807.4', '17439.0', '18467.7', '19543.9', '20664.7', '21838.3', '20287.3', '21442.5', '22656.2', '23915.0', '25239.7', '23491.8', '24792.6', '26161.8', '27579.4', '29075.0'] growth=['1.057', '1.060', '1.061', '1.062', '1.062', '1.063', '1.062', '1.061', '1.060', '1.060', '1.061', '1.060', '1.059', '1.059', '1.059', '1.060', '1.059', '1.058', '1.057', '1.057', '1.058', '1.057', '1.057', '1.056', '1.055', '1.057', '1.055', '1.055', '1.054', '1.054'] +step:1/50 train_loss:6.9310 train_time:4157ms step_avg:4156.91ms +step:2/50 train_loss:8.1611 train_time:8307ms step_avg:4153.58ms +step:3/50 train_loss:7.5076 train_time:12490ms step_avg:4163.29ms +step:4/50 train_loss:7.6959 train_time:16672ms step_avg:4167.94ms +step:5/50 train_loss:7.4174 train_time:20854ms step_avg:4170.71ms +step:6/50 train_loss:7.1131 train_time:25035ms step_avg:4172.58ms +step:7/50 train_loss:6.9487 train_time:29219ms step_avg:4174.14ms +step:8/50 train_loss:6.7735 train_time:33402ms step_avg:4175.31ms +step:9/50 train_loss:6.4261 train_time:37586ms step_avg:4176.27ms +step:10/50 train_loss:6.0743 train_time:41771ms step_avg:4177.07ms +step:20/50 train_loss:4.7079 train_time:83751ms step_avg:4187.54ms +step:25/50 val_loss:4.2787 val_bpb:2.5341 train_time:104722ms step_avg:4188.87ms h_norms=['11702.3', '10378.0', '9422.4', '8775.5', '8411.0', '8171.4', '8013.7', '7986.7', '8068.5', '8259.8', '8039.9', '7983.9', '8037.6', '8189.1', '8437.5', '8051.7', '8055.2', '8157.9', '8354.3', '8641.4', '8155.2', '8190.2', '8320.5', '8544.8', '8857.6', '8314.2', '8362.6', '8506.3', '8746.7', '9076.3'] growth=['0.872', '0.887', '0.908', '0.931', '0.958', '0.972', '0.981', '0.997', '1.010', '1.024', '0.986', '0.993', '1.007', '1.019', '1.030', '0.995', '1.000', '1.013', '1.024', '1.034', '0.999', '1.004', '1.016', '1.027', '1.037', '1.000', '1.006', '1.017', '1.028', '1.038'] diff --git a/sweep_passes_results.txt b/sweep_passes_results.txt new file mode 100644 index 0000000000..015cd7f187 --- /dev/null +++ b/sweep_passes_results.txt @@ -0,0 +1,2 @@ +=== Pass count sweep: noRMS, jac=0.1 === +5-pass: bpb=2.1867 int6=3.66733532 step=3619.81ms mem=78629MiB diff --git a/sweep_stdout.log b/sweep_stdout.log new file mode 100644 index 0000000000..6aaf876682 --- /dev/null +++ b/sweep_stdout.log @@ -0,0 +1,3 @@ +[5-pass] START (13:08:00) +[5-pass] DONE => bpb@50=2.1867 int6=3.66733532 step=3619.81ms mem=78629MiB +[6-pass] START (13:20:07) diff --git a/test_4pass_qat.log b/test_4pass_qat.log index d40d4965ef..8dbb58f5df 100644 --- a/test_4pass_qat.log +++ b/test_4pass_qat.log @@ -53,3 +53,26 @@ Serialized model: 106023671 bytes Code size: 99082 bytes Serialized model int6+lzma: 4795396 bytes Total submission size int6+lzma: 4894478 bytes +final_int6_roundtrip val_loss:6.1634 val_bpb:3.6503 eval_time:82960ms +final_int6_roundtrip_exact val_loss:6.16344777 val_bpb:3.65034094 +wandb: updating run metadata +wandb: uploading history steps 15-15, summary, console lines 30-31 +wandb: +wandb: Run history: +wandb: lr_scale ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ +wandb: step_avg_ms ▆▁▃▃▄▄▄▄▄▄▅▆▇█ +wandb: train_loss ▆█▇▆▆▆▆▆▅▅▃▂▁▁ +wandb: val_bpb █▂▁ +wandb: val_loss █▂▁ +wandb: +wandb: Run summary: +wandb: lr_scale 1 +wandb: step_avg_ms 3156.97551 +wandb: train_loss 3.76531 +wandb: val_bpb 2.20808 +wandb: val_loss 3.72825 +wandb: +wandb: 🚀 View run test_4pass_noRMS_j0.1_QAT at: https://wandb.ai/propensity/parameter-golf/runs/meaoom9b +wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf +wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) +wandb: Find logs at: ./wandb/run-20260326_125242-meaoom9b/logs diff --git a/test_4pass_qat_stdout.log b/test_4pass_qat_stdout.log index c2f75f265e..23b4784782 100644 --- a/test_4pass_qat_stdout.log +++ b/test_4pass_qat_stdout.log @@ -1 +1,3 @@ START 4-pass no-RMSnorm jac=0.1 QAT, 80GB cap (12:52:38) +DONE => bpb@50=2.2081 int6=3.65034094 step=3156.98ms mem=66515MiB +FINISHED (13:02:54) From 05bee2974eb67a387f31987e1c805c5301bd7e97 Mon Sep 17 00:00:00 2001 From: nesta Date: Fri, 27 Mar 2026 10:47:20 +0000 Subject: [PATCH 05/23] dont commit lgos --- .gitignore | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index 3423c416a7..b05328f4f9 100644 --- a/.gitignore +++ b/.gitignore @@ -8,4 +8,5 @@ data/manifest.json data/docs_selected.jsonl .mypy_cache/ .venv -logs/ \ No newline at end of file +logs/ +*.log \ No newline at end of file From 94ad908f7e0b4bee7bec127263517e72c9f40457 Mon Sep 17 00:00:00 2001 From: nesta Date: Fri, 27 Mar 2026 10:47:28 +0000 Subject: [PATCH 06/23] amend --- .gitignore | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index b05328f4f9..8272195c85 100644 --- a/.gitignore +++ b/.gitignore @@ -9,4 +9,5 @@ data/docs_selected.jsonl .mypy_cache/ .venv logs/ -*.log \ No newline at end of file +*.log +*.txt \ No newline at end of file From 241e1db9e5cb7c8bd80b1c805f5e503d9e9a17c4 Mon Sep 17 00:00:00 2001 From: nesta Date: Fri, 27 Mar 2026 14:35:55 +0000 Subject: [PATCH 07/23] simplify the submission folder --- eval_2p3c_ttt_4pass.log | 126 + .../README.md | 205 ++ .../run_submission.sh | 122 + .../submission.json | 9 + .../train_gpt_recurrent.py | 118 +- .../wandb/debug-internal.log | 1 - .../wandb/debug.log | 1 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 3 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 22 - .../files/requirements.txt | 101 - .../files/output.log | 18 - .../files/requirements.txt | 101 - .../files/output.log | 50 - .../files/requirements.txt | 101 - .../files/output.log | 50 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 19 - .../files/requirements.txt | 101 - .../files/output.log | 35 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 67 - .../files/requirements.txt | 101 - .../files/output.log | 122 - .../files/requirements.txt | 101 - .../files/output.log | 42 - .../files/requirements.txt | 101 - .../files/output.log | 51 - .../files/requirements.txt | 101 - .../files/output.log | 100 - .../files/requirements.txt | 101 - .../files/output.log | 93 - .../files/requirements.txt | 101 - .../files/output.log | 35 - .../files/requirements.txt | 101 - .../files/output.log | 42 - .../files/requirements.txt | 101 - .../files/output.log | 42 - .../files/requirements.txt | 101 - .../files/output.log | 32 - .../files/requirements.txt | 101 - .../files/output.log | 23 - .../files/requirements.txt | 101 - .../files/output.log | 42 - .../files/requirements.txt | 101 - .../files/output.log | 36 - .../files/requirements.txt | 101 - .../files/output.log | 48 - .../files/requirements.txt | 101 - .../files/output.log | 35 - .../files/requirements.txt | 101 - .../files/output.log | 39 - .../files/requirements.txt | 101 - .../files/output.log | 39 - .../files/requirements.txt | 101 - .../files/output.log | 100 - .../files/requirements.txt | 101 - .../files/output.log | 50 - .../files/requirements.txt | 101 - .../files/output.log | 14 - .../files/requirements.txt | 101 - .../files/output.log | 50 - .../files/requirements.txt | 101 - .../files/output.log | 63 - .../files/requirements.txt | 101 - .../files/output.log | 65 - .../files/requirements.txt | 101 - .../files/output.log | 88 - .../files/requirements.txt | 101 - .../files/output.log | 186 -- .../files/requirements.txt | 101 - .../files/output.log | 41 - .../files/requirements.txt | 101 - .../files/output.log | 41 - .../files/requirements.txt | 101 - .../files/output.log | 109 - .../files/requirements.txt | 101 - .../files/output.log | 41 - .../files/requirements.txt | 101 - .../files/output.log | 41 - .../files/requirements.txt | 101 - .../files/output.log | 109 - .../files/requirements.txt | 101 - .../files/output.log | 319 --- .../files/requirements.txt | 101 - .../files/output.log | 67 - .../files/requirements.txt | 101 - .../files/output.log | 67 - .../files/requirements.txt | 101 - .../files/output.log | 67 - .../files/requirements.txt | 101 - .../files/output.log | 41 - .../files/requirements.txt | 101 - .../files/output.log | 379 --- .../files/requirements.txt | 101 - .../ablation_no_rmsnorm.sh | 0 .../eval_ttt_passes.sh | 0 .../feedback.py | 138 ++ .../grid_search.sh | 0 .../lora-fix-plan.md | 0 .../recurrence-fixes.md | 0 .../run_2pass_3core.sh | 0 .../run_3pass.sh | 0 .../run_4pass_qat.sh | 0 .../run_4pass_test.sh | 0 .../run_4pass_ttt.sh | 0 .../run_baseline_4pass.sh | 0 .../run_full_1gpu.sh | 0 .../run_full_4pass.sh | 0 .../run_lora_test.sh | 0 .../run_lora_test_r8.sh | 0 .../smoke_passes.sh | 0 .../smoke_test.sh | 0 .../stability.py | 108 + .../sweep_passes.sh | 0 .../train_gpt_recurrent.py | 2194 +++++++++++++++++ .../wandb/latest-run | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/wandb-metadata.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/wandb-metadata.json | 0 .../files/wandb-metadata.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 .../files/config.yaml | 0 .../files/wandb-metadata.json | 0 .../files/wandb-summary.json | 0 282 files changed, 2917 insertions(+), 9833 deletions(-) create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json delete mode 120000 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug-internal.log delete mode 120000 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/requirements.txt delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/output.log delete mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/requirements.txt rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/ablation_no_rmsnorm.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/eval_ttt_passes.sh (100%) create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/feedback.py rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/grid_search.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/lora-fix-plan.md (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/recurrence-fixes.md (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/run_2pass_3core.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/run_3pass.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/run_4pass_qat.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/run_4pass_test.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/run_4pass_ttt.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/run_baseline_4pass.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/run_full_1gpu.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/run_full_4pass.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/run_lora_test.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/run_lora_test_r8.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/smoke_passes.sh (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/smoke_test.sh (100%) create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/stability.py rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/sweep_passes.sh (100%) create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/train_gpt_recurrent.py rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/latest-run (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_102057-7kghfexn/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_103645-sx3ojo04/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_103737-eswa6xue/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_104722-1bh0d9xu/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_104722-1bh0d9xu/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_104722-1bh0d9xu/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_105405-dd5xlg1l/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_105405-dd5xlg1l/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_105405-dd5xlg1l/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_110050-pmxdy841/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_110050-pmxdy841/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_110050-pmxdy841/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_110735-wq77le9z/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_110735-wq77le9z/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_110735-wq77le9z/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_111419-rr366tug/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_111419-rr366tug/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_111419-rr366tug/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_112256-w5b84094/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_112256-w5b84094/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_112256-w5b84094/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_113135-l86ibk0l/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_113135-l86ibk0l/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_113135-l86ibk0l/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_114011-n43r2rb3/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_114011-n43r2rb3/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_114011-n43r2rb3/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_114848-v63k38ck/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_115141-bq7ercgn/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_115741-648c04nz/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_115741-648c04nz/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_115741-648c04nz/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_115749-qjzpp5d7/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_115749-qjzpp5d7/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_115749-qjzpp5d7/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_115924-xtlv4t52/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_115924-xtlv4t52/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_115924-xtlv4t52/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_120745-6rfmco93/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_120745-6rfmco93/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_120745-6rfmco93/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_123015-qgrbnv6t/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_123015-qgrbnv6t/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_123015-qgrbnv6t/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_124119-cf7n2jes/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_125242-meaoom9b/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_125242-meaoom9b/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_125242-meaoom9b/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_130804-ce1th47g/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_130804-ce1th47g/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_130804-ce1th47g/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_132011-5rznhtcu/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_132634-x2sl50qr/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_133036-mo9kb26s/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_133353-atic3pnd/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_133353-atic3pnd/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_133353-atic3pnd/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_133558-nwftkz5m/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_151933-hmzzit2n/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_152723-snlqmhr2/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_154455-y5p28i5r/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_154455-y5p28i5r/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_154455-y5p28i5r/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_163311-3jt79ap8/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_171548-0nylbkj0/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_172142-fes18y77/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_173159-f48k3ztp/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_174230-5r5mblr1/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_174616-zd1j5kg5/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_174902-ponrb7vw/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_175915-2o340uez/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_180447-w7yechln/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_181843-pwdgzvp6/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_182338-0jlwabms/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_183144-z6wj6zap/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_184007-n9zy31jn/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_184007-n9zy31jn/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_184007-n9zy31jn/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_193530-8n0ize2o/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_193530-8n0ize2o/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_193530-8n0ize2o/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_193530-9ekyp2ua/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_193530-9ekyp2ua/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_193530-9ekyp2ua/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_193530-jsf1k0pv/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_193530-jsf1k0pv/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_193530-jsf1k0pv/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_194337-b8eb1lhl/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_200826-h4wnno7e/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_203403-fzb1y9o8/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_203403-fzb1y9o8/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_203403-fzb1y9o8/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_205139-3z8g4kez/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_205139-3z8g4kez/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_205139-3z8g4kez/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_223030-ngp8wevn/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_223030-ngp8wevn/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_223030-ngp8wevn/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_223030-w0ibl7rl/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_223030-w0ibl7rl/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_223030-w0ibl7rl/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_223030-z24l5l1s/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_223030-z24l5l1s/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_223030-z24l5l1s/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_224439-a7ap9e0a/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_224439-a7ap9e0a/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_224439-a7ap9e0a/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_224439-i5qgsj3x/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_224439-i5qgsj3x/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_224439-i5qgsj3x/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_224440-chtxlxg3/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_224440-chtxlxg3/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_224440-chtxlxg3/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230156-qltwebo4/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230156-qltwebo4/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230156-qltwebo4/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230158-43bipylb/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230158-43bipylb/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230158-43bipylb/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230158-fsi4c82a/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230158-fsi4c82a/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230158-fsi4c82a/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230158-jkh80zal/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230158-jkh80zal/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230158-jkh80zal/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230158-zcabiozu/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230158-zcabiozu/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260326_230158-zcabiozu/files/wandb-summary.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260327_080959-p8sqkbqa/files/config.yaml (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260327_080959-p8sqkbqa/files/wandb-metadata.json (100%) rename records/track_10min_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_RecurrentSOTA_Feedback_BACKUP}/wandb/run-20260327_080959-p8sqkbqa/files/wandb-summary.json (100%) diff --git a/eval_2p3c_ttt_4pass.log b/eval_2p3c_ttt_4pass.log index e8b2bd7106..69573584df 100644 --- a/eval_2p3c_ttt_4pass.log +++ b/eval_2p3c_ttt_4pass.log @@ -79,3 +79,129 @@ ttt_sliding:params unfrozen=26927712 frozen=0 ttt_chunk [651/1893] bpb=1.117944 time=957.2s ttt_chunk [661/1893] bpb=1.117832 time=971.9s ttt_chunk [671/1893] bpb=1.117352 time=986.5s + ttt_chunk [681/1893] bpb=1.116794 time=1001.2s + ttt_chunk [691/1893] bpb=1.116880 time=1015.9s + ttt_chunk [701/1893] bpb=1.116056 time=1030.6s + ttt_chunk [711/1893] bpb=1.116052 time=1045.3s + ttt_chunk [721/1893] bpb=1.115953 time=1060.0s + ttt_chunk [731/1893] bpb=1.116170 time=1074.7s + ttt_chunk [741/1893] bpb=1.116045 time=1089.4s + ttt_chunk [751/1893] bpb=1.115741 time=1104.1s + ttt_chunk [761/1893] bpb=1.115869 time=1118.8s + ttt_chunk [771/1893] bpb=1.115693 time=1133.5s + ttt_chunk [781/1893] bpb=1.115853 time=1148.2s + ttt_chunk [791/1893] bpb=1.115693 time=1162.9s + ttt_chunk [801/1893] bpb=1.115617 time=1177.6s + ttt_chunk [811/1893] bpb=1.115617 time=1192.3s + ttt_chunk [821/1893] bpb=1.115495 time=1207.0s + ttt_chunk [831/1893] bpb=1.115203 time=1221.7s + ttt_chunk [841/1893] bpb=1.114945 time=1236.4s + ttt_chunk [851/1893] bpb=1.114987 time=1251.2s + ttt_chunk [861/1893] bpb=1.115045 time=1265.9s + ttt_chunk [871/1893] bpb=1.115236 time=1280.6s + ttt_chunk [881/1893] bpb=1.115224 time=1295.2s + ttt_chunk [891/1893] bpb=1.114680 time=1309.9s + ttt_chunk [901/1893] bpb=1.114684 time=1324.6s + ttt_chunk [911/1893] bpb=1.114526 time=1339.3s + ttt_chunk [921/1893] bpb=1.114659 time=1354.0s + ttt_chunk [931/1893] bpb=1.114599 time=1368.7s + ttt_chunk [941/1893] bpb=1.114804 time=1383.4s + ttt_chunk [951/1893] bpb=1.115094 time=1398.1s + ttt_chunk [961/1893] bpb=1.115392 time=1412.8s + ttt_chunk [971/1893] bpb=1.115742 time=1427.5s + ttt_chunk [981/1893] bpb=1.115951 time=1442.2s + ttt_chunk [991/1893] bpb=1.115856 time=1456.9s + ttt_chunk [1001/1893] bpb=1.116177 time=1471.6s + ttt_chunk [1011/1893] bpb=1.116320 time=1486.3s + ttt_chunk [1021/1893] bpb=1.116605 time=1501.0s + ttt_chunk [1031/1893] bpb=1.116994 time=1515.7s + ttt_chunk [1041/1893] bpb=1.117500 time=1530.3s + ttt_chunk [1051/1893] bpb=1.117367 time=1545.0s + ttt_chunk [1061/1893] bpb=1.117461 time=1559.7s + ttt_chunk [1071/1893] bpb=1.117602 time=1574.4s + ttt_chunk [1081/1893] bpb=1.117644 time=1589.1s + ttt_chunk [1091/1893] bpb=1.117898 time=1603.8s + ttt_chunk [1101/1893] bpb=1.118047 time=1618.5s + ttt_chunk [1111/1893] bpb=1.117863 time=1633.2s + ttt_chunk [1121/1893] bpb=1.117633 time=1647.9s + ttt_chunk [1131/1893] bpb=1.117526 time=1662.6s + ttt_chunk [1141/1893] bpb=1.117282 time=1677.3s + ttt_chunk [1151/1893] bpb=1.117304 time=1692.0s + ttt_chunk [1161/1893] bpb=1.117144 time=1706.7s + ttt_chunk [1171/1893] bpb=1.116965 time=1721.4s + ttt_chunk [1181/1893] bpb=1.116743 time=1736.1s + ttt_chunk [1191/1893] bpb=1.116894 time=1750.8s + ttt_chunk [1201/1893] bpb=1.117095 time=1765.5s + ttt_chunk [1211/1893] bpb=1.116693 time=1780.2s + ttt_chunk [1221/1893] bpb=1.117030 time=1794.9s + ttt_chunk [1231/1893] bpb=1.116970 time=1809.6s + ttt_chunk [1241/1893] bpb=1.116674 time=1824.3s + ttt_chunk [1251/1893] bpb=1.116128 time=1839.0s + ttt_chunk [1261/1893] bpb=1.115870 time=1853.7s + ttt_chunk [1271/1893] bpb=1.115620 time=1868.5s + ttt_chunk [1281/1893] bpb=1.115309 time=1883.2s + ttt_chunk [1291/1893] bpb=1.115065 time=1897.8s + ttt_chunk [1301/1893] bpb=1.115022 time=1912.5s + ttt_chunk [1311/1893] bpb=1.114756 time=1927.2s + ttt_chunk [1321/1893] bpb=1.114465 time=1941.9s + ttt_chunk [1331/1893] bpb=1.114234 time=1956.6s + ttt_chunk [1341/1893] bpb=1.114106 time=1971.3s + ttt_chunk [1351/1893] bpb=1.113950 time=1986.0s + ttt_chunk [1361/1893] bpb=1.114068 time=2000.7s + ttt_chunk [1371/1893] bpb=1.114276 time=2015.4s + ttt_chunk [1381/1893] bpb=1.114476 time=2030.1s + ttt_chunk [1391/1893] bpb=1.114265 time=2044.8s + ttt_chunk [1401/1893] bpb=1.114305 time=2059.5s + ttt_chunk [1411/1893] bpb=1.114417 time=2074.2s + ttt_chunk [1421/1893] bpb=1.114402 time=2088.9s + ttt_chunk [1431/1893] bpb=1.114373 time=2103.6s + ttt_chunk [1441/1893] bpb=1.114842 time=2118.3s + ttt_chunk [1451/1893] bpb=1.114708 time=2132.9s + ttt_chunk [1461/1893] bpb=1.114629 time=2147.6s + ttt_chunk [1471/1893] bpb=1.115217 time=2162.3s + ttt_chunk [1481/1893] bpb=1.115092 time=2177.0s + ttt_chunk [1491/1893] bpb=1.115463 time=2191.7s + ttt_chunk [1501/1893] bpb=1.115441 time=2206.4s + ttt_chunk [1511/1893] bpb=1.115387 time=2221.1s + ttt_chunk [1521/1893] bpb=1.115500 time=2235.8s + ttt_chunk [1531/1893] bpb=1.115703 time=2250.5s + ttt_chunk [1541/1893] bpb=1.115777 time=2265.2s + ttt_chunk [1551/1893] bpb=1.116009 time=2279.9s + ttt_chunk [1561/1893] bpb=1.116094 time=2294.6s + ttt_chunk [1571/1893] bpb=1.116235 time=2309.3s + ttt_chunk [1581/1893] bpb=1.116386 time=2324.0s + ttt_chunk [1591/1893] bpb=1.116446 time=2338.7s + ttt_chunk [1601/1893] bpb=1.116569 time=2353.4s + ttt_chunk [1611/1893] bpb=1.116829 time=2368.1s + ttt_chunk [1621/1893] bpb=1.116690 time=2382.8s + ttt_chunk [1631/1893] bpb=1.116733 time=2397.5s + ttt_chunk [1641/1893] bpb=1.116754 time=2412.2s + ttt_chunk [1651/1893] bpb=1.116805 time=2426.9s + ttt_chunk [1661/1893] bpb=1.116943 time=2441.5s + ttt_chunk [1671/1893] bpb=1.117119 time=2456.2s + ttt_chunk [1681/1893] bpb=1.117209 time=2470.9s + ttt_chunk [1691/1893] bpb=1.117311 time=2485.7s + ttt_chunk [1701/1893] bpb=1.117409 time=2500.4s + ttt_chunk [1711/1893] bpb=1.117389 time=2515.1s + ttt_chunk [1721/1893] bpb=1.117225 time=2529.8s + ttt_chunk [1731/1893] bpb=1.117322 time=2544.5s + ttt_chunk [1741/1893] bpb=1.117062 time=2559.2s + ttt_chunk [1751/1893] bpb=1.116939 time=2573.9s + ttt_chunk [1761/1893] bpb=1.116981 time=2588.6s + ttt_chunk [1771/1893] bpb=1.116924 time=2603.3s + ttt_chunk [1781/1893] bpb=1.116820 time=2618.0s + ttt_chunk [1791/1893] bpb=1.116486 time=2632.7s + ttt_chunk [1801/1893] bpb=1.116475 time=2647.4s + ttt_chunk [1811/1893] bpb=1.116329 time=2662.1s + ttt_chunk [1821/1893] bpb=1.116389 time=2676.8s + ttt_chunk [1831/1893] bpb=1.116241 time=2691.5s + ttt_chunk [1841/1893] bpb=1.116282 time=2706.2s + ttt_chunk [1851/1893] bpb=1.116112 time=2720.9s + ttt_chunk [1861/1893] bpb=1.116034 time=2735.6s + ttt_chunk [1871/1893] bpb=1.115968 time=2750.3s + ttt_chunk [1881/1893] bpb=1.115727 time=2765.0s + ttt_chunk [1891/1893] bpb=1.115712 time=2779.7s + ttt_chunk [1893/1893] bpb=1.115741 time=2782.0s +ttt_sliding:done val_loss=1.883877 val_bpb=1.115741 elapsed=2782.1s +legal_ttt val_loss:1.8839 val_bpb:1.1157 +legal_ttt_exact val_loss:1.88387733 val_bpb:1.11574122 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md new file mode 100644 index 0000000000..5244d3f503 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md @@ -0,0 +1,205 @@ +# Recurrent Depth: 2-Pass Train + 4-Pass Eval with Error Feedback + +**val_bpb: **** (3-seed mean) | ****~16 MB** | 8xH100 SXM + +## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128) + + +| Seed | step_avg | steps | Pre-TTT bpb | **Post-TTT bpb** | TTT gain | TTT time | Artifact | +| -------- | -------- | ------- | ----------- | ---------------- | -------- | -------- | -------- | +| 1337 | TBD | TBD | TBD | **TBD** | TBD | TBD | TBD | +| 42 | TBD | TBD | TBD | **TBD** | TBD | TBD | TBD | +| 2025 | TBD | TBD | TBD | **TBD** | TBD | TBD | TBD | +| **Mean** | **TBD** | **TBD** | **TBD** | **TBD** | **TBD** | **TBD** | | + + +## The Problem: Depth Recurrence Fails Under Competition Constraints + +[PR #363](https://github.com/openai/parameter-golf/pull/363) demonstrated that depth recurrence -- reusing a shared block of transformer layers multiple times -- saves parameters but *hurts* bpb under the 10-minute / 16MB competition constraints. Their controlled experiments showed a **+0.025 bpb gap** (looped worse) due to two compounding taxes: + +1. **Quantization error amplification.** When shared weights are quantized to int6, the quantization error $\epsilon$ is injected at every pass. After $K$ passes through the same core, the cumulative error grows superlinearly. PR #363 measured this as a **0.37 bpb quantization gap** for a 3x-looped architecture vs near-zero for a flat model. +2. **Step time overhead.** Each additional recurrence pass adds forward/backward compute through the core layers. With 5 core layers and 4 passes, PR #363 observed +32ms/step, translating to ~1200 fewer training steps in the 600s budget. The capacity benefit of shared weights cannot overcome the lost training signal. + +## Our Solution: Contractive Recurrence + Inference-Time Depth Scaling + +We address both taxes with three architectural mechanisms and a decoupled train/eval strategy. + +### 1. Learnable Residual Scaling (ResidualScale) + +Per-pass learnable scalars $\alpha_k$ contract the residual update, preventing hidden state magnitude growth across passes: + +$$h_{k+1} = h_k + \alpha_k \cdot F(h_k + c_k)$$ + +where $\alpha_k$ is initialized to 0.5 and learned during training. This ensures the recurrent dynamics are contractive -- later passes refine rather than amplify. + +```python +class ResidualScale(nn.Module): + def __init__(self, num_passes: int, init_value: float = 1.0): + super().__init__() + self.scales = nn.Parameter( + torch.full((num_passes,), init_value, dtype=torch.float32) + ) + + def forward(self, residual: Tensor, pass_idx: int) -> Tensor: + return self.scales[pass_idx].to(dtype=residual.dtype) * residual +``` + +### 2. Error Feedback Module + +A low-rank residual approximation estimates the accumulated error, and a learned diagonal correction compensates for it before each pass: + +$$e_k = U(V^\top h_k), \qquad c_k = \mathrm{diag}(d) \cdot e_k$$ + +where $U, V \in \mathbb{R}^{d \times r}$ with rank $r=2$ and $d \in \mathbb{R}^d$ is a learnable diagonal. The correction is zero on pass 0 (no prior error to correct) and active on subsequent passes. Total parameter overhead: **2,560 params** (negligible vs 26.9M model params). + +```python +class ErrorFeedbackModule(nn.Module): + """Combined error-feedback path: residual -> correction. + + e_k = U (V^T h_k) -- low-rank residual approximation + c_k = diag(d) * e_k -- diagonal correction + """ + def forward(self, h: Tensor, pass_idx: int) -> Tensor: + e = self.residual(h) # Low-rank projection + c = self.correction(e) # Diagonal scaling + mask = 1.0 if pass_idx > 0 else 0.0 # Inactive on first pass + return c * mask +``` + +### 3. Jacobian Proxy Loss + +A regularization term penalizes hidden state growth ratio above 1.0, enforcing contractive dynamics without computing the full Jacobian: + +$$\mathcal{L}*J = \lambda \cdot \mathrm{ReLU}\left(\frac{h*{k+1} - h_k}{h_k + \epsilon} - 1\right)^{2}$$ + +with $\lambda = 0.1$. This is a cheap finite-difference proxy for the spectral norm of the Jacobian $\partial h_{k+1}/\partial h_k$, encouraging it to stay below 1 (contractive map). + +```python +def jacobian_proxy_loss(self, h_in: Tensor, h_out: Tensor) -> Tensor: + delta = h_out - h_in + ratio = delta.norm() / (h_in.norm() + self.eps) + return self.jacobian_proxy_weight * torch.relu(ratio - 1.0).square() +``` + +### 4. Train Cheap, Eval Deep + +The key insight: **train with 2 recurrence passes, evaluate with 4**. This completely sidesteps the step-time tax during training (our step time is only ~15% above the flat baseline, vs 30%+ for 4 passes), while still harvesting the depth benefit at inference time. The contractive mechanisms (ResidualScale + Jacobian proxy) ensure that adding passes at eval time does not cause hidden state blowup, since the learned dynamics are stable for arbitrary iteration counts. + +After training completes and the checkpoint is saved, we override `num_passes` from 2 to 4 and pad the `ResidualScale` parameters for the additional passes. The model then runs TTT and final evaluation with 4 effective core passes (20 effective layer evaluations: 4 stem + 3x4 core + 4 tail). + +``` +Architecture (11 unique layers, 20 effective at eval): + Stem [0-3] -> Core [4-6] x4 passes -> Tail [7-10] + ^^^^^^^^^^ + Shared weights, reused 4 times at eval +``` + +### Eval-time pass sweep (1-GPU development, seed 1337) + + +| Eval passes | TTT bpb | vs 2-pass | +| ----------- | ---------- | ----------- | +| 2 (=train) | 1.1204 | baseline | +| 4 | **1.1157** | **-0.0047** | +| 6 | 1.1166 | -0.0039 | +| 8 | 1.1176 | -0.0029 | + + +4 passes is the sweet spot: enough depth to improve token prediction, not so many that the contractive scaling dampens signal. This result shows that our stability mechanisms successfully enable inference-time compute scaling for recurrent transformers. + +## What Didn't Work + +### LoRA adapters for per-pass specialization + +We tried adding low-rank adapters (rank 2 and 8) to differentiate core layer behavior across passes. Results: + +- **No bpb improvement**: rank-2 and rank-8 LoRA produced nearly identical loss curves to the baseline, even with careful warmup scheduling. +- **Size constraint**: At rank 8, LoRA parameters pushed the total artifact over the 16MB limit. +- **Hypothesis**: The core layers already learn pass-invariant features through the ResidualScale mechanism; LoRA's per-pass deltas are redundant. + +### Training with more recurrence passes (4+) + +Direct training with 4 passes hits the step-time tax: + +- **4-pass training**: ~105ms/step on 8xH100 vs ~83ms for flat. In 600s: ~5700 steps vs ~7200. +- **Result**: The 1500 fewer training steps cost more bpb than the extra depth recovers. +- **2-pass training + 4-pass eval**: ~96ms/step, ~6250 steps. Nearly matches the flat model's step count while gaining inference-time depth. + +## Architecture + +Built on the [PR #414](https://github.com/openai/parameter-golf/pull/414) stack with [PR #399](https://github.com/openai/parameter-golf/pull/399) Parallel Muon: + + +| Component | Setting | +| ----------------------- | ------------------------------------------------- | +| Layers | 11 unique (512d, 8H, 4KV) | +| Effective layers (eval) | 20 (4 stem + 3 core x4 + 4 tail) | +| MLP | 3x with LeakyReLU(0.5)^2 | +| BigramHash | 1536 | +| XSA | Last 4 layers | +| RoPE | Partial (16/64 dims) | +| LN Scale | 1/sqrt(layer+1) | +| VE128 | Layers 9-10 | +| Recurrence core | Layers 4-6, 2 passes (train), 4 passes (eval) | +| ResidualScale | Per-pass learnable, init 0.5 | +| Error Feedback | Diagonal mode, rank 2 | +| Jacobian proxy | lambda=0.1 | +| Weight avg | EMA(0.997) + SWA(every 50) | +| Quantization | Late QAT (threshold 0.15) + GPTQ-lite int6 + lzma | +| Optimizer | Parameter Banking + Parallel Muon | + + +### TTT Configuration + +Score-first legal TTT following [PR #461](https://github.com/openai/parameter-golf/pull/461): + + +| Parameter | Value | +| ---------------- | ---------------------------------------- | +| Chunk size | 32,768 tokens | +| Optimizer | SGD + momentum(0.9) | +| Learning rate | 0.002 (cosine decay across chunks) | +| Epochs per chunk | 3 | +| Frozen blocks | None (all blocks adapt, freeze_blocks=0) | +| Gradient clip | 1.0 | +| Eval passes | 4 (overridden from training's 2) | + + +## Run Command + +```bash +cd records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback +bash run_submission.sh +``` + +Or for a single seed: + +```bash +NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 \ +EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \ +ROPE_DIMS=16 LN_SCALE=1 LATE_QAT_THRESHOLD=0.15 \ +VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \ +TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \ +TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \ +MUON_WD=0.04 ADAM_WD=0.04 \ +MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \ +MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \ +MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \ +ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \ +CORE_START=4 CORE_END=7 NUM_PASSES=2 EVAL_PASSES=4 \ +SEED=1337 \ +torchrun --standalone --nproc_per_node=8 train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm +``` + +## Credits + +- **Base model**: [PR #414](https://github.com/openai/parameter-golf/pull/414) by @signalrush +- **Optimizer (Parameter Banking + Parallel Muon)**: [PR #399](https://github.com/openai/parameter-golf/pull/399) by @abaybektursun +- **LeakyReLU^2 activation**: [PR #493](https://github.com/openai/parameter-golf/pull/493) by @parinzee +- **TTT recipe**: [PR #461](https://github.com/openai/parameter-golf/pull/461) by @Christopher-Lee-McClendon +- **Depth recurrence analysis**: [PR #363](https://github.com/openai/parameter-golf/pull/363) by @evangelinehelsinki (identified the quantization error amplification problem we solve here) + diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh new file mode 100755 index 0000000000..ebc722608d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh @@ -0,0 +1,122 @@ +#!/bin/bash +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +if [ -f /home/nesta/parameter-golf/.env ]; then + set -a; source /home/nesta/parameter-golf/.env; set +a +fi + +export PYTHONUNBUFFERED=1 +export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + +# --- Data paths --- +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" + +# --- Architecture (matches SOTA PR #549 / PR #414 stack) --- +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" + +# --- Training schedule (matches SOTA 8xH100 settings) --- +export ITERATIONS=9000 +export MAX_WALLCLOCK_SECONDS=600 +export VAL_LOSS_EVERY=500 +export TRAIN_LOG_EVERY=50 +export WARMUP_STEPS=20 +export WARMDOWN_ITERS=3500 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=64 + +# --- Optimizer (matches SOTA) --- +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=1500 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 + +# --- Weight averaging & quantization --- +export SWA_ENABLED=1 +export SWA_EVERY=50 +export LATE_QAT_THRESHOLD=0.15 + +# --- TTT (matches SOTA, freeze_blocks=0) --- +export TTT_ENABLED=1 +export TTT_LR=0.002 +export TTT_EPOCHS=3 +export TTT_CHUNK_TOKENS=32768 +export TTT_FREEZE_BLOCKS=0 +export TTT_MOMENTUM=0.9 +export TTT_BATCH_SEQS=32 +export TTT_GRAD_CLIP=1.0 + +# --- Recurrence (our contribution) --- +export CORE_START=4 +export CORE_END=7 +export NUM_PASSES=2 +export EVAL_PASSES=4 +export CORE_QUANT_ENABLED=0 + +# --- W&B --- +export WANDB_PROJECT="parameter-golf" + +echo "================================================================" +echo "Submission run: 2-pass train / 4-pass eval, 3 seeds" +echo "================================================================" + +for SEED in 1337 42 2025; do + export SEED + export WANDB_NAME="recurrent_2p4e_seed${SEED}" + LOG="${SCRIPT_DIR}/train_seed${SEED}.log" + + echo "" + echo "=== SEED=${SEED} started $(date) ===" + + torchrun --standalone --nproc_per_node=8 train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + 2>&1 | tee "$LOG" + + EXIT=${PIPESTATUS[0]} + echo "" + if [ $EXIT -ne 0 ]; then + echo "SEED=${SEED} FAILED (exit=$EXIT)" + tail -30 "$LOG" + else + echo "=== SEED=${SEED} RESULTS ===" + grep 'stopping_early\|peak memory' "$LOG" || true + grep 'Total submission size' "$LOG" || true + grep 'final_int6_sliding_window_exact' "$LOG" || true + grep 'legal_ttt_exact' "$LOG" || true + fi + echo "=== SEED=${SEED} finished $(date) ===" +done + +echo "" +echo "================================================================" +echo "All seeds complete. Results summary:" +echo "================================================================" +for SEED in 1337 42 2025; do + LOG="${SCRIPT_DIR}/train_seed${SEED}.log" + echo "--- Seed ${SEED} ---" + grep 'legal_ttt_exact' "$LOG" 2>/dev/null || echo " (no TTT result found)" + grep 'Total submission size' "$LOG" 2>/dev/null || echo " (no size found)" +done diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json new file mode 100644 index 0000000000..f6a76205c4 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json @@ -0,0 +1,9 @@ +{ + "name": "Recurrent Depth: 2-Pass Train + 4-Pass Eval with Error Feedback", + "val_bpb": "<3-seed mean -- fill after runs>", + "bytes_total": "", + "blurb": "Depth recurrence with contractive hidden states (ResidualScale + Jacobian proxy) and quantization error feedback. Train 2-pass core (layers 4-6), eval 4-pass with TTT (freeze=0). Solves PR #363's quantization error amplification via learned error correction. Built on PR #414 stack + PR #399 Parallel Muon.", + "author": "abaybektursun", + "github_id": "abaybektursun", + "date": "2026-03-27" +} diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py index a82ec73612..88aeda6f06 100644 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py @@ -116,7 +116,7 @@ class Hyperparameters: num_passes = int(os.environ.get("NUM_PASSES", 1)) core_quant_bits = int(os.environ.get("CORE_QUANT_BITS", 6)) core_quant_enabled = bool(int(os.environ.get("CORE_QUANT_ENABLED", "0"))) - lora_rank = int(os.environ.get("LORA_RANK", 0)) + eval_passes = int(os.environ.get("EVAL_PASSES", 0)) # --- Batched Newton-Schulz orthogonalization --- @@ -838,7 +838,6 @@ def __init__( core_quant_enabled: bool = False, residual_scale: nn.Module | None = None, interpass_rmsnorm: bool = True, - lora_rank: int = 0, ): super().__init__() self._ve_target_dim = num_kv_heads * (model_dim // num_heads) @@ -860,7 +859,6 @@ def __init__( self.num_core = self.core_end - core_start self.num_tail = num_layers - self.core_end self.residual_scale = residual_scale - self.lora_rank = lora_rank self.tok_emb = nn.Embedding(vocab_size, model_dim) self.bigram = BigramHashEmbedding(bigram_vocab_size, bigram_dim, model_dim) if bigram_vocab_size > 0 else None self.smear = SmearGate(model_dim) @@ -875,21 +873,6 @@ def __init__( self.kv_bank = nn.Parameter(torch.empty(2 * num_layers, kv_dim, model_dim)) self.mlp_up_bank = nn.Parameter(torch.empty(num_layers, mlp_dim, model_dim)) self.mlp_down_bank = nn.Parameter(torch.empty(num_layers, model_dim, mlp_dim)) - # Per-pass LoRA adapters for recurrent core (scaled B @ A added to bank weights) - self._lora_scale = 1.0 / math.sqrt(lora_rank) if lora_rank > 0 else 1.0 - self.register_buffer('_lora_step_mul', torch.ones((), dtype=torch.float32), persistent=False) - if lora_rank > 0 and self.num_core > 0 and num_passes > 1: - nc, np_, r = self.num_core, num_passes, lora_rank - for wname, in_d, out_d in [ - ("q", model_dim, model_dim), ("out", model_dim, model_dim), - ("k", model_dim, kv_dim), ("v", model_dim, kv_dim), - ("up", model_dim, mlp_dim), ("down", mlp_dim, model_dim), - ]: - A = nn.Parameter(torch.empty(np_, nc, r, in_d)) - nn.init.normal_(A, mean=0.0, std=0.01) - B = nn.Parameter(torch.zeros(np_, nc, out_d, r)) - setattr(self, f"lora_A_{wname}", A) - setattr(self, f"lora_B_{wname}", B) self.blocks = nn.ModuleList( [ Block( @@ -1022,15 +1005,6 @@ def _forward_hidden(self, input_ids: Tensor, feedback_fn=None, h_prev = x ve = self._get_ve(j, input_ids, ve_cache) q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(j) - if self.lora_rank > 0: - ci = j - self.core_start - s = self._lora_scale * self._lora_step_mul - q_w = q_w + s * (self.lora_B_q[k, ci] @ self.lora_A_q[k, ci]) - k_w = k_w + s * (self.lora_B_k[k, ci] @ self.lora_A_k[k, ci]) - v_w = v_w + s * (self.lora_B_v[k, ci] @ self.lora_A_v[k, ci]) - out_w = out_w + s * (self.lora_B_out[k, ci] @ self.lora_A_out[k, ci]) - up_w = up_w + s * (self.lora_B_up[k, ci] @ self.lora_A_up[k, ci]) - down_w = down_w + s * (self.lora_B_down[k, ci] @ self.lora_A_down[k, ci]) x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, v_embed=ve, v0=v0) if v0 is None and raw_v is not None: @@ -1492,14 +1466,6 @@ def parse_args() -> argparse.Namespace: g.add_argument("--residual-scale-init", type=float, default=0.5) g.add_argument("--jacobian-proxy-weight", type=float, default=0.01) g.add_argument("--no-interpass-rmsnorm", action="store_true") - g.add_argument("--lora-rank", type=int, default=0) - g.add_argument("--lora-warmup-steps", type=int, default=0, - help="Linearly ramp LoRA scale from 0 to 1 over this many steps.") - g = parser.add_argument_group("eval-only") - g.add_argument("--eval-only-passes", type=int, default=None, - help="Skip training; load final_model.pt and run TTT eval with this many passes.") - g.add_argument("--eval-only-checkpoint", type=str, default="final_model.pt", - help="Checkpoint path for --eval-only-passes mode.") return parser.parse_args() def main() -> None: @@ -1608,7 +1574,6 @@ def log0(msg: str, console: bool = True) -> None: core_quant_enabled=args.core_quant_enabled, residual_scale=None, interpass_rmsnorm=not cli.no_interpass_rmsnorm, - lora_rank=cli.lora_rank or args.lora_rank, ).to(device).bfloat16() # Banks stay FP32 (like CastedLinear weights), cast to BF16 in forward base_model.qo_bank.data = base_model.qo_bank.data.float() @@ -1647,56 +1612,10 @@ def feedback_fn(h, pass_idx): residual_scale = ResidualScale(args.num_passes, cli.residual_scale_init).to(device) base_model.residual_scale = residual_scale extra_scalar_params.extend(residual_scale.parameters()) - lora_params: list[nn.Parameter] = [] - if base_model.lora_rank > 0: - lora_params = [p for n, p in base_model.named_parameters() if "lora_" in n] - for p in lora_params: - p.data = p.data.float() - log0(f"lora: rank={base_model.lora_rank} params={sum(p.numel() for p in lora_params)}") log0(f"recurrence: core_start={args.core_start} core_end={args.core_end} " f"num_passes={args.num_passes} stem={base_model.num_stem} " f"core={base_model.num_core} tail={base_model.num_tail}") - # --- Eval-only mode: load checkpoint, override passes, run TTT, exit --- - if cli.eval_only_passes is not None: - ckpt_path = cli.eval_only_checkpoint - log0(f"eval_only: loading checkpoint {ckpt_path}") - ckpt_sd = torch.load(ckpt_path, map_location=device, weights_only=True) - base_model.load_state_dict(ckpt_sd, strict=True) - base_model.qo_bank.data = base_model.qo_bank.data.float() - base_model.kv_bank.data = base_model.kv_bank.data.float() - base_model.mlp_up_bank.data = base_model.mlp_up_bank.data.float() - base_model.mlp_down_bank.data = base_model.mlp_down_bank.data.float() - for module in base_model.modules(): - if isinstance(module, CastedLinear): - module.float() - restore_low_dim_params_to_fp32(base_model) - target_passes = cli.eval_only_passes - trained_passes = base_model.num_passes - log0(f"eval_only: overriding num_passes {trained_passes} -> {target_passes}") - base_model.num_passes = target_passes - if base_model.residual_scale is not None: - old_scales = base_model.residual_scale.scales.data - if target_passes != old_scales.shape[0]: - new_scales = torch.full((target_passes,), cli.residual_scale_init, - dtype=torch.float32, device=old_scales.device) - copy_len = min(target_passes, old_scales.shape[0]) - new_scales[:copy_len] = old_scales[:copy_len] - base_model.residual_scale.scales = nn.Parameter(new_scales) - log0(f"eval_only: ResidualScale padded/trimmed {old_scales.shape[0]} -> {target_passes}") - base_model.eval() - log0(f"eval_only: running TTT with {target_passes} passes") - ttt_loss, ttt_bpb = eval_val_sliding_ttt( - args, base_model, rank, world_size, device, - val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, - stride=args.eval_stride, log0=log0, - ) - log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f}") - log0(f"legal_ttt_exact val_loss:{ttt_loss:.8f} val_bpb:{ttt_bpb:.8f}") - if distributed: - dist.destroy_process_group() - return - # No DDP -- Parallel Muon handles bank grad communication via reduce-scatter, # and non-bank grads are manually all-reduced before Adam steps. compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) @@ -1774,23 +1693,9 @@ def feedback_fn(h, pass_idx): fused=True, ) replicated_params.append(base_model.lm_head.weight) - optimizer_lora = None - if lora_params: - lora_lr = args.scalar_lr * 0.1 - optimizer_lora = torch.optim.AdamW( - [{"params": lora_params, "lr": lora_lr, "base_lr": lora_lr}], - betas=(args.beta1, args.beta2), - eps=args.adam_eps, - weight_decay=args.adam_wd, - fused=True, - ) - replicated_params.extend(lora_params) - log0(f"lora_optimizer: lr={lora_lr} (scalar_lr * 0.1)") optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar] if optimizer_head is not None: optimizers.append(optimizer_head) - if optimizer_lora is not None: - optimizers.append(optimizer_lora) n_params = sum(p.numel() for p in base_model.parameters()) mtp_params = sum(p.numel() for p in base_model.mtp_heads.parameters()) log0(f"model_params:{n_params}") @@ -1941,8 +1846,6 @@ def lr_mul(step: int, elapsed_ms: float) -> float: CastedLinear._qat_enabled = True base_model.core_quant_enabled = True log0(f"late_qat:enabled step:{step} scale:{scale:.4f} core_quant:on") - if base_model.lora_rank > 0 and cli.lora_warmup_steps > 0: - base_model._lora_step_mul.fill_(min(step / cli.lora_warmup_steps, 1.0)) zero_grad_all() train_loss = torch.zeros((), device=device) for micro_step in range(grad_accum_steps): @@ -1974,8 +1877,6 @@ def lr_mul(step: int, elapsed_ms: float) -> float: optimizer_scalar.step() if optimizer_head is not None: optimizer_head.step() - if optimizer_lora is not None: - optimizer_lora.step() # Phase 3: Wait for RS, local NS5, all-gather (banks processed last) optimizer_muon.step() zero_grad_all() @@ -2072,6 +1973,18 @@ def lr_mul(step: int, elapsed_ms: float) -> float: code_bytes = len(code.encode("utf-8")) log0(f"Serialized model: {model_bytes} bytes") log0(f"Code size: {code_bytes} bytes") + # Override passes for eval phase (train cheap, eval deep) + eval_num_passes = args.eval_passes if args.eval_passes > 0 else args.num_passes + if eval_num_passes != args.num_passes: + log0(f"eval_override: num_passes {args.num_passes} -> {eval_num_passes}") + base_model.num_passes = eval_num_passes + if base_model.residual_scale is not None: + old_s = base_model.residual_scale.scales.data + new_s = torch.full((eval_num_passes,), cli.residual_scale_init, + dtype=torch.float32, device=old_s.device) + copy_len = min(eval_num_passes, old_s.shape[0]) + new_s[:copy_len] = old_s[:copy_len] + base_model.residual_scale.scales = nn.Parameter(new_s) # Unbank 3D tensors into individual 2D tensors for quantization sd_cpu = {k: v.detach().cpu() for k, v in export_sd.items()} unbanked_sd = _unbank_state_dict(sd_cpu, args.num_layers) @@ -2110,12 +2023,11 @@ def lr_mul(step: int, elapsed_ms: float) -> float: ve_enabled=args.ve_enabled, ve_dim=args.ve_dim, ve_layers=args.ve_layers, gated_attention=args.gated_attention, value_residual=args.value_residual, core_start=args.core_start, core_end=args.core_end, - num_passes=args.num_passes, + num_passes=eval_num_passes, interpass_rmsnorm=not cli.no_interpass_rmsnorm, - lora_rank=cli.lora_rank or args.lora_rank, ).to(device).bfloat16() if residual_scale is not None: - eval_rs = ResidualScale(args.num_passes, cli.residual_scale_init).to(device) + eval_rs = ResidualScale(eval_num_passes, cli.residual_scale_init).to(device) eval_model.residual_scale = eval_rs eval_model.qo_bank.data = eval_model.qo_bank.data.float() eval_model.kv_bank.data = eval_model.kv_bank.data.float() diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug-internal.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug-internal.log deleted file mode 120000 index 4ddba16660..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug-internal.log +++ /dev/null @@ -1 +0,0 @@ -run-20260327_080959-p8sqkbqa/logs/debug-internal.log \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug.log deleted file mode 120000 index 749bc8ab5b..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/debug.log +++ /dev/null @@ -1 +0,0 @@ -run-20260327_080959-p8sqkbqa/logs/debug.log \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/output.log deleted file mode 100644 index c553d12bd9..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10184.1', '11333.6', '12573.3', '13910.9', '15223.2', '8143.8', '9246.6', '10423.0', '11669.9', '12876.4'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] -step:1/50 train_loss:6.9310 train_time:1952ms step_avg:1952.03ms -step:2/50 train_loss:8.5267 train_time:3880ms step_avg:1940.18ms -step:3/50 train_loss:7.6283 train_time:5841ms step_avg:1947.02ms -step:4/50 train_loss:7.3205 train_time:7802ms step_avg:1950.38ms -step:5/50 train_loss:7.1281 train_time:9762ms step_avg:1952.40ms -step:6/50 train_loss:7.0824 train_time:11723ms step_avg:1953.81ms -step:7/50 train_loss:7.0693 train_time:13683ms step_avg:1954.74ms -step:8/50 train_loss:6.9484 train_time:15645ms step_avg:1955.59ms -step:9/50 train_loss:6.6018 train_time:17606ms step_avg:1956.19ms -step:10/50 train_loss:6.2455 train_time:19568ms step_avg:1956.75ms -step:20/50 train_loss:4.9604 train_time:39172ms step_avg:1958.61ms -step:25/50 val_loss:4.4390 val_bpb:2.6290 train_time:49012ms step_avg:1960.49ms h_norms=['14873.4', '16984.5', '19609.5', '22769.3', '26596.0', '9176.2', '11567.0', '14326.1', '17549.3', '21209.4'] growth=['1.122', '1.142', '1.155', '1.161', '1.168', '1.293', '1.261', '1.239', '1.225', '1.209'] -step:30/50 train_loss:4.2813 train_time:58788ms step_avg:1959.60ms -step:40/50 train_loss:3.9445 train_time:78404ms step_avg:1960.11ms -step:50/50 train_loss:3.7524 train_time:98153ms step_avg:1963.05ms -step:50/50 val_loss:3.7228 val_bpb:2.2048 train_time:98186ms step_avg:1963.73ms h_norms=['24291.9', '27271.2', '31301.3', '36517.1', '43661.8', '10419.3', '14412.6', '19169.9', '24972.2', '31506.7'] growth=['1.087', '1.123', '1.148', '1.167', '1.196', '1.469', '1.383', '1.330', '1.303', '1.262'] -peak memory allocated: 42589 MiB reserved: 43756 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9163 val_bpb:3.5040 eval_time:53998ms -Serialized model: 106023671 bytes -Code size: 98482 bytes -Serialized model int6+lzma: 4808664 bytes -Total submission size int6+lzma: 4907146 bytes -final_int6_roundtrip val_loss:6.1192 val_bpb:3.6242 eval_time:53678ms -final_int6_roundtrip_exact val_loss:6.11924756 val_bpb:3.62416309 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/output.log deleted file mode 100644 index 355a614aaf..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/output.log +++ /dev/null @@ -1,3 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/output.log deleted file mode 100644 index 59b3f03cc7..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.6', '11333.2', '12572.8', '13910.4', '15222.6', '8143.9', '9247.0', '10423.5', '11670.2', '12876.4'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] -step:1/50 train_loss:6.9310 train_time:1949ms step_avg:1949.49ms -step:2/50 train_loss:8.5267 train_time:3878ms step_avg:1938.79ms -step:3/50 train_loss:7.6283 train_time:5839ms step_avg:1946.17ms -step:4/50 train_loss:7.3204 train_time:7799ms step_avg:1949.66ms -step:5/50 train_loss:7.1281 train_time:9759ms step_avg:1951.73ms -step:6/50 train_loss:7.0824 train_time:11719ms step_avg:1953.12ms -step:7/50 train_loss:7.0695 train_time:13679ms step_avg:1954.13ms -step:8/50 train_loss:6.9486 train_time:15639ms step_avg:1954.91ms -step:9/50 train_loss:6.6021 train_time:17599ms step_avg:1955.45ms -step:10/50 train_loss:6.2460 train_time:19559ms step_avg:1955.87ms -step:20/50 train_loss:4.9600 train_time:39160ms step_avg:1958.01ms -step:25/50 val_loss:4.4394 val_bpb:2.6293 train_time:48999ms step_avg:1959.95ms h_norms=['14881.1', '16985.9', '19606.6', '22793.0', '26627.8', '9172.8', '11563.4', '14325.3', '17588.9', '21265.7'] growth=['1.122', '1.141', '1.154', '1.163', '1.168', '1.293', '1.261', '1.239', '1.228', '1.209'] -step:30/50 train_loss:4.2822 train_time:58772ms step_avg:1959.06ms -step:40/50 train_loss:3.9688 train_time:78387ms step_avg:1959.68ms -step:50/50 train_loss:3.7881 train_time:98126ms step_avg:1962.52ms -step:50/50 val_loss:3.7348 val_bpb:2.2120 train_time:98160ms step_avg:1963.20ms h_norms=['24218.6', '27362.1', '31613.8', '37171.5', '44612.3', '10585.9', '14797.2', '19826.0', '26015.5', '32857.0'] growth=['1.092', '1.130', '1.155', '1.176', '1.200', '1.492', '1.398', '1.340', '1.312', '1.263'] -peak memory allocated: 42589 MiB reserved: 43756 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9155 val_bpb:3.5035 eval_time:53996ms -Serialized model: 106023671 bytes -Code size: 98482 bytes -Serialized model int6+lzma: 4809880 bytes -Total submission size int6+lzma: 4908362 bytes -final_int6_roundtrip val_loss:6.1179 val_bpb:3.6234 eval_time:53675ms -final_int6_roundtrip_exact val_loss:6.11791815 val_bpb:3.62337574 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/output.log deleted file mode 100644 index 094928789d..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.4', '11332.7', '12572.5', '13910.0', '15222.2', '8143.5', '9246.5', '10422.9', '11669.4', '12875.7'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] -step:1/50 train_loss:6.9310 train_time:1951ms step_avg:1951.04ms -step:2/50 train_loss:8.5267 train_time:3880ms step_avg:1940.13ms -step:3/50 train_loss:7.6283 train_time:5841ms step_avg:1947.13ms -step:4/50 train_loss:7.3204 train_time:7802ms step_avg:1950.55ms -step:5/50 train_loss:7.1281 train_time:9763ms step_avg:1952.51ms -step:6/50 train_loss:7.0824 train_time:11723ms step_avg:1953.92ms -step:7/50 train_loss:7.0689 train_time:13684ms step_avg:1954.91ms -step:8/50 train_loss:6.9485 train_time:15646ms step_avg:1955.69ms -step:9/50 train_loss:6.6014 train_time:17607ms step_avg:1956.32ms -step:10/50 train_loss:6.2455 train_time:19569ms step_avg:1956.88ms -step:20/50 train_loss:4.9608 train_time:39179ms step_avg:1958.94ms -step:25/50 val_loss:4.4416 val_bpb:2.6306 train_time:49019ms step_avg:1960.76ms h_norms=['14908.8', '17039.5', '19688.6', '22908.3', '26782.7', '9197.9', '11626.4', '14426.1', '17733.4', '21464.3'] growth=['1.123', '1.143', '1.155', '1.164', '1.169', '1.296', '1.264', '1.241', '1.229', '1.210'] -step:30/50 train_loss:4.2882 train_time:58795ms step_avg:1959.83ms -step:40/50 train_loss:3.9513 train_time:78416ms step_avg:1960.41ms -step:50/50 train_loss:3.7675 train_time:98151ms step_avg:1963.03ms -step:50/50 val_loss:3.7271 val_bpb:2.2074 train_time:98185ms step_avg:1963.71ms h_norms=['24528.6', '27553.7', '31725.9', '37229.2', '44851.6', '10551.0', '14775.8', '19896.1', '26232.7', '33500.8'] growth=['1.086', '1.123', '1.151', '1.173', '1.205', '1.487', '1.400', '1.347', '1.318', '1.277'] -peak memory allocated: 42589 MiB reserved: 43756 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9153 val_bpb:3.5034 eval_time:53981ms -Serialized model: 106023671 bytes -Code size: 98507 bytes -Serialized model int6+lzma: 4812040 bytes -Total submission size int6+lzma: 4910547 bytes -final_int6_roundtrip val_loss:6.1184 val_bpb:3.6236 eval_time:53666ms -final_int6_roundtrip_exact val_loss:6.11836447 val_bpb:3.62364007 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/output.log deleted file mode 100644 index 5d14acfddd..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.3', '11332.9', '12572.4', '13909.9', '15221.9', '8143.5', '9246.2', '10422.5', '11669.1', '12875.2'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] -step:1/50 train_loss:6.9310 train_time:1969ms step_avg:1968.99ms -step:2/50 train_loss:8.5267 train_time:3913ms step_avg:1956.75ms -step:3/50 train_loss:7.6283 train_time:5891ms step_avg:1963.51ms -step:4/50 train_loss:7.3204 train_time:7867ms step_avg:1966.84ms -step:5/50 train_loss:7.1282 train_time:9845ms step_avg:1969.03ms -step:6/50 train_loss:7.0824 train_time:11823ms step_avg:1970.45ms -step:7/50 train_loss:7.0693 train_time:13800ms step_avg:1971.43ms -step:8/50 train_loss:6.9483 train_time:15777ms step_avg:1972.14ms -step:9/50 train_loss:6.6021 train_time:17755ms step_avg:1972.76ms -step:10/50 train_loss:6.2458 train_time:19732ms step_avg:1973.24ms -step:20/50 train_loss:4.9604 train_time:39495ms step_avg:1974.76ms -step:25/50 val_loss:4.4376 val_bpb:2.6282 train_time:49413ms step_avg:1976.52ms h_norms=['14875.2', '16993.9', '19634.3', '22841.7', '26706.4', '9197.5', '11608.3', '14391.8', '17687.0', '21405.1'] growth=['1.123', '1.142', '1.155', '1.163', '1.169', '1.296', '1.262', '1.240', '1.229', '1.210'] -step:30/50 train_loss:4.2700 train_time:59266ms step_avg:1975.53ms -step:40/50 train_loss:3.9376 train_time:79041ms step_avg:1976.01ms -step:50/50 train_loss:3.7755 train_time:98947ms step_avg:1978.94ms -step:50/50 val_loss:3.7384 val_bpb:2.2141 train_time:98981ms step_avg:1979.62ms h_norms=['24517.2', '27557.3', '31658.9', '36931.0', '44051.8', '10467.3', '14516.5', '19325.2', '25155.9', '31522.6'] growth=['1.087', '1.124', '1.149', '1.167', '1.193', '1.475', '1.387', '1.331', '1.302', '1.253'] -peak memory allocated: 42782 MiB reserved: 44140 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9157 val_bpb:3.5036 eval_time:54130ms -Serialized model: 106023671 bytes -Code size: 98507 bytes -Serialized model int6+lzma: 4808692 bytes -Total submission size int6+lzma: 4907199 bytes -final_int6_roundtrip val_loss:6.1186 val_bpb:3.6238 eval_time:53816ms -final_int6_roundtrip_exact val_loss:6.11859679 val_bpb:3.62377766 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/output.log deleted file mode 100644 index a929e78e0c..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.7', '11333.2', '12572.7', '13910.3', '15222.4', '8143.9', '9246.6', '10423.1', '11669.9', '12876.1'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] -step:1/50 train_loss:6.9310 train_time:1965ms step_avg:1964.66ms -step:2/50 train_loss:8.5267 train_time:3905ms step_avg:1952.55ms -step:3/50 train_loss:7.6283 train_time:5878ms step_avg:1959.29ms -step:4/50 train_loss:7.3204 train_time:7850ms step_avg:1962.54ms -step:5/50 train_loss:7.1282 train_time:9822ms step_avg:1964.43ms -step:6/50 train_loss:7.0825 train_time:11795ms step_avg:1965.78ms -step:7/50 train_loss:7.0698 train_time:13768ms step_avg:1966.81ms -step:8/50 train_loss:6.9490 train_time:15741ms step_avg:1967.65ms -step:9/50 train_loss:6.6024 train_time:17715ms step_avg:1968.31ms -step:10/50 train_loss:6.2457 train_time:19688ms step_avg:1968.83ms -step:20/50 train_loss:4.9609 train_time:39413ms step_avg:1970.66ms -step:25/50 val_loss:4.4376 val_bpb:2.6282 train_time:49313ms step_avg:1972.51ms h_norms=['14907.9', '17014.7', '19640.8', '22830.5', '26674.6', '9180.5', '11569.7', '14334.2', '17600.8', '21284.5'] growth=['1.122', '1.141', '1.154', '1.162', '1.168', '1.294', '1.260', '1.239', '1.228', '1.209'] -step:30/50 train_loss:4.2769 train_time:59149ms step_avg:1971.62ms -step:40/50 train_loss:3.9519 train_time:78889ms step_avg:1972.21ms -step:50/50 train_loss:3.7667 train_time:98755ms step_avg:1975.11ms -step:50/50 val_loss:3.7317 val_bpb:2.2101 train_time:98790ms step_avg:1975.79ms h_norms=['24105.4', '27189.2', '31353.5', '36701.3', '43829.7', '10500.5', '14546.0', '19432.5', '25342.5', '31770.4'] growth=['1.093', '1.128', '1.153', '1.171', '1.194', '1.480', '1.385', '1.336', '1.304', '1.254'] -peak memory allocated: 42782 MiB reserved: 44140 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9158 val_bpb:3.5037 eval_time:54005ms -Serialized model: 106023671 bytes -Code size: 98507 bytes -Serialized model int6+lzma: 4808548 bytes -Total submission size int6+lzma: 4907055 bytes -final_int6_roundtrip val_loss:6.1173 val_bpb:3.6230 eval_time:53697ms -final_int6_roundtrip_exact val_loss:6.11731613 val_bpb:3.62301918 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/output.log deleted file mode 100644 index 7b128f42dc..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10183.5', '11332.8', '12572.3', '13909.9', '15221.9', '8143.4', '9246.3', '10422.6', '11669.3', '12875.4'] growth=['1.115', '1.113', '1.109', '1.106', '1.094', '1.148', '1.135', '1.127', '1.120', '1.103'] -step:1/50 train_loss:6.9310 train_time:1964ms step_avg:1963.98ms -step:2/50 train_loss:8.5267 train_time:3904ms step_avg:1952.08ms -step:3/50 train_loss:7.6283 train_time:5877ms step_avg:1958.95ms -step:4/50 train_loss:7.3203 train_time:7849ms step_avg:1962.28ms -step:5/50 train_loss:7.1281 train_time:9821ms step_avg:1964.23ms -step:6/50 train_loss:7.0823 train_time:11793ms step_avg:1965.52ms -step:7/50 train_loss:7.0690 train_time:13766ms step_avg:1966.50ms -step:8/50 train_loss:6.9488 train_time:15738ms step_avg:1967.24ms -step:9/50 train_loss:6.6003 train_time:17711ms step_avg:1967.83ms -step:10/50 train_loss:6.2452 train_time:19683ms step_avg:1968.32ms -step:20/50 train_loss:4.9631 train_time:39401ms step_avg:1970.07ms -step:25/50 val_loss:4.4339 val_bpb:2.6260 train_time:49298ms step_avg:1971.91ms h_norms=['14892.2', '17012.7', '19653.8', '22861.1', '26720.5', '9181.8', '11585.2', '14367.3', '17660.0', '21364.8'] growth=['1.122', '1.142', '1.155', '1.163', '1.169', '1.294', '1.262', '1.240', '1.229', '1.210'] -step:30/50 train_loss:4.2683 train_time:59129ms step_avg:1970.98ms -step:40/50 train_loss:3.9478 train_time:78860ms step_avg:1971.49ms -step:50/50 train_loss:3.7715 train_time:98733ms step_avg:1974.66ms -step:50/50 val_loss:3.7726 val_bpb:2.2343 train_time:98767ms step_avg:1975.34ms h_norms=['23953.3', '26694.0', '30396.2', '35111.3', '41668.7', '10222.5', '13936.6', '18323.8', '23608.5', '29594.6'] growth=['1.079', '1.114', '1.139', '1.155', '1.187', '1.441', '1.363', '1.315', '1.288', '1.254'] -peak memory allocated: 42782 MiB reserved: 44140 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9155 val_bpb:3.5035 eval_time:53980ms -Serialized model: 106023671 bytes -Code size: 98507 bytes -Serialized model int6+lzma: 4811364 bytes -Total submission size int6+lzma: 4909871 bytes -final_int6_roundtrip val_loss:6.1183 val_bpb:3.6236 eval_time:53671ms -final_int6_roundtrip_exact val_loss:6.11825506 val_bpb:3.62357527 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/output.log deleted file mode 100644 index a87a91730e..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10017.6', '11136.0', '12344.2', '13648.8', '14930.5', '8131.3', '9221.1', '10380.0', '11610.4', '12803.6', '8145.6', '9247.8', '10417.9', '11656.5', '12858.6'] growth=['1.114', '1.112', '1.108', '1.106', '1.094', '1.146', '1.134', '1.126', '1.119', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] -step:1/50 train_loss:6.9310 train_time:2550ms step_avg:2549.53ms -step:2/50 train_loss:8.4366 train_time:5080ms step_avg:2540.20ms -step:3/50 train_loss:7.5697 train_time:7644ms step_avg:2547.96ms -step:4/50 train_loss:7.3743 train_time:10207ms step_avg:2551.71ms -step:5/50 train_loss:7.1786 train_time:12769ms step_avg:2553.85ms -step:6/50 train_loss:7.0957 train_time:15332ms step_avg:2555.30ms -step:7/50 train_loss:7.1042 train_time:17895ms step_avg:2556.47ms -step:8/50 train_loss:6.9979 train_time:20459ms step_avg:2557.34ms -step:9/50 train_loss:6.6202 train_time:23024ms step_avg:2558.19ms -step:10/50 train_loss:6.2350 train_time:25588ms step_avg:2558.77ms -step:20/50 train_loss:4.9204 train_time:51213ms step_avg:2560.66ms -step:25/50 val_loss:4.4026 val_bpb:2.6074 train_time:64061ms step_avg:2562.45ms h_norms=['15315.5', '18476.1', '22903.1', '28959.2', '36915.4', '10165.2', '14051.6', '19152.7', '25903.6', '34591.3', '10305.9', '14457.4', '19935.1', '27281.8', '36795.6'] growth=['1.162', '1.206', '1.240', '1.264', '1.275', '1.433', '1.382', '1.363', '1.352', '1.335', '1.453', '1.403', '1.379', '1.369', '1.349'] -step:30/50 train_loss:4.2536 train_time:76845ms step_avg:2561.51ms -step:40/50 train_loss:3.9338 train_time:102476ms step_avg:2561.91ms -step:50/50 train_loss:3.7603 train_time:128239ms step_avg:2564.77ms -step:50/50 val_loss:3.7159 val_bpb:2.2007 train_time:128273ms step_avg:2565.45ms h_norms=['25099.4', '31959.1', '42336.1', '57728.8', '79426.7', '13463.9', '21452.0', '32001.9', '46436.8', '66208.5', '13616.6', '21790.7', '32681.5', '47611.1', '68100.6'] growth=['1.181', '1.273', '1.325', '1.364', '1.376', '1.898', '1.593', '1.492', '1.451', '1.426', '1.919', '1.600', '1.500', '1.457', '1.430'] -peak memory allocated: 54994 MiB reserved: 56152 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9270 val_bpb:3.5103 eval_time:70209ms -Serialized model: 106023671 bytes -Code size: 98507 bytes -Serialized model int6+lzma: 4807120 bytes -Total submission size int6+lzma: 4905627 bytes -final_int6_roundtrip val_loss:6.1461 val_bpb:3.6401 eval_time:69816ms -final_int6_roundtrip_exact val_loss:6.14613800 val_bpb:3.64008912 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/output.log deleted file mode 100644 index 4939084d7f..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10017.3', '11135.8', '12344.1', '13648.4', '14930.2', '8131.2', '9221.3', '10379.9', '11610.4', '12803.6', '8145.3', '9247.4', '10417.4', '11656.1', '12858.2'] growth=['1.114', '1.112', '1.109', '1.106', '1.094', '1.146', '1.134', '1.126', '1.119', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] -step:1/50 train_loss:6.9310 train_time:2562ms step_avg:2562.16ms -step:2/50 train_loss:8.4366 train_time:5105ms step_avg:2552.70ms -step:3/50 train_loss:7.5697 train_time:7680ms step_avg:2560.17ms -step:4/50 train_loss:7.3744 train_time:10255ms step_avg:2563.83ms -step:5/50 train_loss:7.1786 train_time:12830ms step_avg:2566.10ms -step:6/50 train_loss:7.0956 train_time:15406ms step_avg:2567.62ms -step:7/50 train_loss:7.1032 train_time:17983ms step_avg:2569.03ms -step:8/50 train_loss:6.9995 train_time:20559ms step_avg:2569.91ms -step:9/50 train_loss:6.6208 train_time:23136ms step_avg:2570.65ms -step:10/50 train_loss:6.2351 train_time:25713ms step_avg:2571.25ms -step:20/50 train_loss:4.9225 train_time:51465ms step_avg:2573.26ms -step:25/50 val_loss:4.4045 val_bpb:2.6086 train_time:64376ms step_avg:2575.05ms h_norms=['15389.1', '18568.1', '22944.3', '28900.3', '36680.3', '10100.2', '13942.7', '18871.1', '25382.8', '33689.8', '10227.6', '14320.0', '19633.2', '26714.6', '35810.5'] growth=['1.162', '1.207', '1.236', '1.260', '1.269', '1.424', '1.380', '1.353', '1.345', '1.327', '1.442', '1.400', '1.371', '1.361', '1.340'] -step:30/50 train_loss:4.2545 train_time:77216ms step_avg:2573.88ms -step:40/50 train_loss:3.9681 train_time:102972ms step_avg:2574.30ms -step:50/50 train_loss:3.7646 train_time:128885ms step_avg:2577.70ms -step:50/50 val_loss:3.7386 val_bpb:2.2142 train_time:128919ms step_avg:2578.39ms h_norms=['25102.1', '31613.3', '41339.7', '55618.3', '75871.6', '12896.3', '20216.5', '29704.2', '42853.9', '60665.7', '13066.2', '20603.3', '30423.4', '44103.2', '62670.9'] growth=['1.172', '1.259', '1.308', '1.345', '1.364', '1.818', '1.568', '1.469', '1.443', '1.416', '1.842', '1.577', '1.477', '1.450', '1.421'] -peak memory allocated: 55186 MiB reserved: 56536 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9267 val_bpb:3.5101 eval_time:70241ms -Serialized model: 106023671 bytes -Code size: 98507 bytes -Serialized model int6+lzma: 4808880 bytes -Total submission size int6+lzma: 4907387 bytes -final_int6_roundtrip val_loss:6.1461 val_bpb:3.6400 eval_time:69823ms -final_int6_roundtrip_exact val_loss:6.14605869 val_bpb:3.64004215 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/output.log deleted file mode 100644 index 68160f35fb..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10017.8', '11136.4', '12344.8', '13649.3', '14931.1', '8131.1', '9221.2', '10380.0', '11610.7', '12803.9', '8145.5', '9247.7', '10417.7', '11656.5', '12858.6'] growth=['1.114', '1.112', '1.109', '1.106', '1.094', '1.146', '1.134', '1.126', '1.119', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] -step:1/50 train_loss:6.9310 train_time:2558ms step_avg:2558.48ms -step:2/50 train_loss:8.4366 train_time:5098ms step_avg:2548.90ms -step:3/50 train_loss:7.5697 train_time:7670ms step_avg:2556.78ms -step:4/50 train_loss:7.3743 train_time:10241ms step_avg:2560.27ms -step:5/50 train_loss:7.1786 train_time:12811ms step_avg:2562.30ms -step:6/50 train_loss:7.0957 train_time:15383ms step_avg:2563.78ms -step:7/50 train_loss:7.1044 train_time:17953ms step_avg:2564.76ms -step:8/50 train_loss:6.9980 train_time:20525ms step_avg:2565.64ms -step:9/50 train_loss:6.6201 train_time:23097ms step_avg:2566.38ms -step:10/50 train_loss:6.2350 train_time:25670ms step_avg:2567.05ms -step:20/50 train_loss:4.9193 train_time:51381ms step_avg:2569.05ms -step:25/50 val_loss:4.4049 val_bpb:2.6088 train_time:64272ms step_avg:2570.88ms h_norms=['15720.0', '18809.2', '23004.6', '28594.8', '35769.8', '9791.7', '13137.0', '17413.1', '22947.7', '29845.4', '9846.4', '13323.4', '17863.5', '23724.8', '31147.9'] growth=['1.160', '1.197', '1.223', '1.243', '1.251', '1.380', '1.342', '1.325', '1.318', '1.301', '1.388', '1.353', '1.341', '1.328', '1.313'] -step:30/50 train_loss:4.2516 train_time:77099ms step_avg:2569.96ms -step:40/50 train_loss:3.9318 train_time:102821ms step_avg:2570.53ms -step:50/50 train_loss:3.7709 train_time:128698ms step_avg:2573.96ms -step:50/50 val_loss:3.7547 val_bpb:2.2238 train_time:128732ms step_avg:2574.64ms h_norms=['26472.4', '32406.0', '41016.4', '53232.8', '69910.2', '11501.1', '17001.4', '24006.2', '33299.6', '45175.2', '11702.2', '17433.6', '24741.2', '34474.7', '46973.8'] growth=['1.156', '1.224', '1.266', '1.298', '1.313', '1.621', '1.478', '1.412', '1.387', '1.357', '1.649', '1.490', '1.419', '1.393', '1.363'] -peak memory allocated: 55186 MiB reserved: 56536 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9290 val_bpb:3.5115 eval_time:70088ms -Serialized model: 106023671 bytes -Code size: 98507 bytes -Serialized model int6+lzma: 4808852 bytes -Total submission size int6+lzma: 4907359 bytes -final_int6_roundtrip val_loss:6.1489 val_bpb:3.6417 eval_time:69675ms -final_int6_roundtrip_exact val_loss:6.14887885 val_bpb:3.64171241 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/output.log deleted file mode 100644 index 3f49ae6f56..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10016.7', '11135.2', '12343.3', '13647.7', '14929.4', '8130.4', '9220.7', '10379.4', '11609.9', '12803.1', '8144.8', '9247.1', '10417.0', '11655.6', '12857.6'] growth=['1.114', '1.112', '1.108', '1.106', '1.094', '1.146', '1.134', '1.126', '1.119', '1.103', '1.148', '1.135', '1.127', '1.119', '1.103'] -step:1/50 train_loss:6.9310 train_time:2558ms step_avg:2558.36ms -step:2/50 train_loss:8.4366 train_time:5098ms step_avg:2548.82ms -step:3/50 train_loss:7.5697 train_time:7669ms step_avg:2556.49ms -step:4/50 train_loss:7.3744 train_time:10241ms step_avg:2560.13ms -step:5/50 train_loss:7.1785 train_time:12812ms step_avg:2562.33ms -step:6/50 train_loss:7.0958 train_time:15382ms step_avg:2563.72ms -step:7/50 train_loss:7.1043 train_time:17954ms step_avg:2564.81ms -step:8/50 train_loss:6.9978 train_time:20526ms step_avg:2565.71ms -step:9/50 train_loss:6.6203 train_time:23098ms step_avg:2566.44ms -step:10/50 train_loss:6.2351 train_time:25670ms step_avg:2566.98ms -step:20/50 train_loss:4.9173 train_time:51378ms step_avg:2568.92ms -step:25/50 val_loss:4.4081 val_bpb:2.6107 train_time:64271ms step_avg:2570.86ms h_norms=['15784.7', '18418.3', '21808.6', '26014.4', '31082.9', '9350.4', '11977.4', '15124.3', '18884.1', '23160.1', '9355.0', '12040.7', '15261.1', '19113.8', '23523.2'] growth=['1.142', '1.167', '1.184', '1.193', '1.195', '1.318', '1.281', '1.263', '1.249', '1.226', '1.319', '1.287', '1.267', '1.252', '1.231'] -step:30/50 train_loss:4.2496 train_time:77100ms step_avg:2569.99ms -step:40/50 train_loss:3.9324 train_time:102820ms step_avg:2570.49ms -step:50/50 train_loss:3.7771 train_time:128694ms step_avg:2573.88ms -step:50/50 val_loss:3.7294 val_bpb:2.2088 train_time:128728ms step_avg:2574.56ms h_norms=['25140.9', '28962.0', '34040.5', '40736.8', '49246.7', '9966.3', '13653.2', '17956.7', '23197.5', '28845.5', '9949.3', '13730.0', '18206.3', '23701.7', '29420.4'] growth=['1.103', '1.152', '1.175', '1.197', '1.209', '1.405', '1.370', '1.315', '1.292', '1.243', '1.402', '1.380', '1.326', '1.302', '1.241'] -peak memory allocated: 55186 MiB reserved: 56536 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9285 val_bpb:3.5112 eval_time:70073ms -Serialized model: 106023671 bytes -Code size: 98507 bytes -Serialized model int6+lzma: 4808068 bytes -Total submission size int6+lzma: 4906575 bytes -final_int6_roundtrip val_loss:6.1465 val_bpb:3.6403 eval_time:69664ms -final_int6_roundtrip_exact val_loss:6.14650992 val_bpb:3.64030939 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/output.log deleted file mode 100644 index 05fcf59779..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/output.log +++ /dev/null @@ -1,22 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['9829.5', '10906.5', '12074.5', '13339.5', '14586.6', '8110.7', '9182.3', '10321.5', '11532.3', '12712.6', '8125.6', '9210.3', '10362.9', '11582.1', '12773.0', '8128.7', '9216.6', '10372.5', '11592.1', '12787.0'] growth=['1.111', '1.110', '1.107', '1.105', '1.093', '1.143', '1.132', '1.124', '1.117', '1.102', '1.145', '1.133', '1.125', '1.118', '1.103', '1.146', '1.134', '1.125', '1.118', '1.103'] -step:1/50 train_loss:6.9310 train_time:3144ms step_avg:3143.65ms -step:2/50 train_loss:8.3561 train_time:6272ms step_avg:3135.84ms -step:3/50 train_loss:7.5194 train_time:9433ms step_avg:3144.22ms -step:4/50 train_loss:7.4222 train_time:12595ms step_avg:3148.75ms -step:5/50 train_loss:7.2305 train_time:15755ms step_avg:3151.08ms -step:6/50 train_loss:7.1013 train_time:18919ms step_avg:3153.09ms -step:7/50 train_loss:7.0637 train_time:22082ms step_avg:3154.61ms -step:8/50 train_loss:6.9894 train_time:25245ms step_avg:3155.59ms -step:9/50 train_loss:6.6143 train_time:28406ms step_avg:3156.21ms -step:10/50 train_loss:6.2371 train_time:31569ms step_avg:3156.90ms -step:20/50 train_loss:4.8567 train_time:63180ms step_avg:3158.98ms -step:25/50 val_loss:4.3859 val_bpb:2.5976 train_time:89760ms step_avg:3590.41ms h_norms=['15124.4', '18985.0', '24623.8', '32563.5', '43308.4', '10881.2', '15824.5', '22366.4', '31313.8', '43085.3', '10968.9', '16075.0', '22846.2', '32046.9', '44227.1', '10951.1', '16019.2', '22824.3', '32056.4', '44286.6'] growth=['1.193', '1.255', '1.297', '1.322', '1.330', '1.534', '1.454', '1.413', '1.400', '1.376', '1.546', '1.466', '1.421', '1.403', '1.380', '1.544', '1.463', '1.425', '1.404', '1.382'] -step:30/50 train_loss:4.2191 train_time:125306ms step_avg:4176.87ms -step:40/50 train_loss:3.9125 train_time:196653ms step_avg:4916.32ms -step:50/50 train_loss:3.7387 train_time:273924ms step_avg:5478.49ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/output.log deleted file mode 100644 index 8d2e7e8ff8..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/output.log +++ /dev/null @@ -1,18 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10161.7', '11313.4', '12557.3', '13898.0', '15219.9', '16827.9', '18528.4', '20318.6', '22230.4', '24153.9', '22046.5', '24114.3', '26314.0', '28667.9', '31058.9'] growth=['1.116', '1.113', '1.110', '1.107', '1.095', '1.106', '1.101', '1.097', '1.094', '1.087', '1.098', '1.094', '1.091', '1.089', '1.083'] -step:1/50 train_loss:6.9310 train_time:6304ms step_avg:6303.72ms -step:2/50 train_loss:8.4504 train_time:12557ms step_avg:6278.34ms -step:3/50 train_loss:7.5634 train_time:18879ms step_avg:6293.09ms -step:4/50 train_loss:7.3652 train_time:25104ms step_avg:6276.05ms -step:5/50 train_loss:7.1863 train_time:30783ms step_avg:6156.62ms -step:6/50 train_loss:7.1201 train_time:36471ms step_avg:6078.55ms -step:7/50 train_loss:7.1222 train_time:42201ms step_avg:6028.70ms -step:8/50 train_loss:7.0087 train_time:47839ms step_avg:5979.83ms -step:9/50 train_loss:6.6202 train_time:53571ms step_avg:5952.33ms -step:10/50 train_loss:6.2665 train_time:59252ms step_avg:5925.21ms -step:20/50 train_loss:5.1492 train_time:116187ms step_avg:5809.33ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/output.log deleted file mode 100644 index 274a9a3c1c..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/output.log +++ /dev/null @@ -1,50 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2081, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1757, in main - warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ - return super().__call__(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1026, in forward - x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1002, in _forward_hidden - x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 781, in forward - x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 742, in forward - x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 217.94 MiB is free. Process 241184 has 67.76 GiB memory in use. Process 242252 has 55.38 GiB memory in use. Including non-PyTorch memory, this process has 16.43 GiB memory in use. Of the allocated memory 15.67 GiB is allocated by PyTorch, and 99.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/output.log deleted file mode 100644 index 274a9a3c1c..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/output.log +++ /dev/null @@ -1,50 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2081, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1757, in main - warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ - return super().__call__(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1026, in forward - x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1002, in _forward_hidden - x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 781, in forward - x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 742, in forward - x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 217.94 MiB is free. Process 241184 has 67.76 GiB memory in use. Process 242252 has 55.38 GiB memory in use. Including non-PyTorch memory, this process has 16.43 GiB memory in use. Of the allocated memory 15.67 GiB is allocated by PyTorch, and 99.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/output.log deleted file mode 100644 index 5e93f4daea..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['10160.2', '11311.6', '12555.0', '13896.3', '15332.5', '16950.8', '18657.9', '20459.0', '22380.8', '24439.6', '22273.3', '24354.9', '26573.0', '28946.4', '31482.6'] growth=['1.116', '1.113', '1.110', '1.107', '1.103', '1.106', '1.101', '1.097', '1.094', '1.092', '1.098', '1.093', '1.091', '1.089', '1.088'] -step:1/50 train_loss:6.9310 train_time:2457ms step_avg:2456.70ms -step:2/50 train_loss:8.4480 train_time:4895ms step_avg:2447.55ms -step:3/50 train_loss:7.5656 train_time:7366ms step_avg:2455.22ms -step:4/50 train_loss:7.3715 train_time:9835ms step_avg:2458.84ms -step:5/50 train_loss:7.1882 train_time:12305ms step_avg:2460.94ms -step:6/50 train_loss:7.1200 train_time:14774ms step_avg:2462.35ms -step:7/50 train_loss:7.1275 train_time:17244ms step_avg:2463.46ms -step:8/50 train_loss:7.0234 train_time:19715ms step_avg:2464.42ms -step:9/50 train_loss:6.6287 train_time:22185ms step_avg:2465.05ms -step:10/50 train_loss:6.2775 train_time:24656ms step_avg:2465.57ms -step:20/50 train_loss:5.2073 train_time:49354ms step_avg:2467.72ms -step:25/50 val_loss:4.6001 val_bpb:2.7244 train_time:61743ms step_avg:2469.70ms h_norms=['18012.8', '20576.1', '23734.4', '27658.0', '32476.6', '39019.5', '47115.3', '57297.5', '70090.2', '85988.1', '66911.8', '82425.7', '101585.5', '125680.3', '156003.0'] growth=['1.133', '1.142', '1.153', '1.165', '1.174', '1.201', '1.207', '1.216', '1.223', '1.227', '1.228', '1.232', '1.232', '1.237', '1.241'] -step:30/50 train_loss:4.3938 train_time:74062ms step_avg:2468.73ms -step:40/50 train_loss:4.0561 train_time:98772ms step_avg:2469.31ms -step:50/50 train_loss:3.8233 train_time:123613ms step_avg:2472.25ms -step:50/50 val_loss:3.7814 val_bpb:2.2396 train_time:123647ms step_avg:2472.93ms h_norms=['31577.0', '34240.9', '37755.1', '42362.9', '48395.2', '56432.3', '66325.1', '79064.0', '95012.0', '114485.4', '84419.3', '101309.9', '122596.9', '148869.6', '181094.1'] growth=['1.068', '1.084', '1.103', '1.122', '1.142', '1.166', '1.175', '1.192', '1.202', '1.205', '1.193', '1.200', '1.210', '1.214', '1.216'] -peak memory allocated: 54207 MiB reserved: 55384 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9416 val_bpb:3.5190 eval_time:67959ms -Serialized model: 106023671 bytes -Code size: 98931 bytes -Serialized model int6+lzma: 4809652 bytes -Total submission size int6+lzma: 4908583 bytes -final_int6_roundtrip val_loss:6.1789 val_bpb:3.6595 eval_time:67564ms -final_int6_roundtrip_exact val_loss:6.17886280 val_bpb:3.65947059 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/output.log deleted file mode 100644 index 4f5b257c05..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['10143.6', '11247.2', '12436.8', '13721.2', '15093.2', '16638.6', '18260.9', '19974.1', '21802.2', '23760.8', '21655.4', '23628.3', '25729.0', '27978.2', '30382.6'] growth=['1.110', '1.109', '1.106', '1.103', '1.100', '1.102', '1.098', '1.094', '1.092', '1.090', '1.095', '1.091', '1.089', '1.087', '1.086'] -step:1/50 train_loss:6.9310 train_time:2474ms step_avg:2473.98ms -step:2/50 train_loss:8.4480 train_time:4928ms step_avg:2463.85ms -step:3/50 train_loss:7.5657 train_time:7414ms step_avg:2471.28ms -step:4/50 train_loss:7.4125 train_time:9901ms step_avg:2475.24ms -step:5/50 train_loss:7.2581 train_time:12387ms step_avg:2477.37ms -step:6/50 train_loss:7.1563 train_time:14873ms step_avg:2478.80ms -step:7/50 train_loss:7.1205 train_time:17358ms step_avg:2479.79ms -step:8/50 train_loss:7.0021 train_time:19845ms step_avg:2480.59ms -step:9/50 train_loss:6.6191 train_time:22332ms step_avg:2481.32ms -step:10/50 train_loss:6.2241 train_time:24818ms step_avg:2481.82ms -step:20/50 train_loss:4.8854 train_time:49674ms step_avg:2483.72ms -step:25/50 val_loss:4.4102 val_bpb:2.6119 train_time:62144ms step_avg:2485.74ms h_norms=['12925.3', '12168.2', '11607.0', '11186.7', '10890.7', '10691.3', '10518.0', '10429.2', '10384.9', '10395.0', '10468.4', '10350.9', '10317.8', '10323.0', '10377.5'] growth=['0.930', '0.941', '0.954', '0.964', '0.974', '0.982', '0.984', '0.992', '0.996', '1.001', '0.987', '0.989', '0.997', '1.001', '1.005'] -step:30/50 train_loss:4.2124 train_time:74549ms step_avg:2484.96ms -step:40/50 train_loss:3.9336 train_time:99426ms step_avg:2485.66ms -step:50/50 train_loss:3.7638 train_time:124432ms step_avg:2488.64ms -step:50/50 val_loss:3.7456 val_bpb:2.2184 train_time:124466ms step_avg:2489.33ms h_norms=['20394.8', '18235.1', '16671.4', '15574.6', '14825.8', '14555.8', '14297.6', '14121.5', '14031.0', '13984.3', '14335.7', '14174.8', '14069.5', '14035.7', '14026.9'] growth=['0.871', '0.894', '0.914', '0.934', '0.952', '0.982', '0.982', '0.988', '0.994', '0.997', '0.991', '0.989', '0.993', '0.998', '0.999'] -peak memory allocated: 54399 MiB reserved: 55768 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9394 val_bpb:3.5176 eval_time:68125ms -Serialized model: 106023671 bytes -Code size: 98931 bytes -Serialized model int6+lzma: 4804840 bytes -Total submission size int6+lzma: 4903771 bytes -final_int6_roundtrip val_loss:6.1350 val_bpb:3.6335 eval_time:67734ms -final_int6_roundtrip_exact val_loss:6.13503683 val_bpb:3.63351438 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/output.log deleted file mode 100644 index c5cb39b734..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['9918.1', '10850.5', '11857.1', '12933.0', '14082.2', '15372.6', '16714.7', '18128.7', '19628.1', '21227.1', '19383.5', '20984.5', '22685.3', '24491.8', '26411.9', '24206.9', '26133.0', '28177.6', '30331.6', '32611.0'] growth=['1.094', '1.094', '1.093', '1.091', '1.089', '1.092', '1.087', '1.085', '1.083', '1.081', '1.086', '1.083', '1.081', '1.080', '1.078', '1.082', '1.080', '1.078', '1.076', '1.075'] -step:1/50 train_loss:6.9310 train_time:3032ms step_avg:3031.88ms -step:2/50 train_loss:8.3473 train_time:6048ms step_avg:3024.07ms -step:3/50 train_loss:7.5117 train_time:9096ms step_avg:3032.08ms -step:4/50 train_loss:7.5611 train_time:12145ms step_avg:3036.23ms -step:5/50 train_loss:7.3188 train_time:15194ms step_avg:3038.75ms -step:6/50 train_loss:7.0774 train_time:18243ms step_avg:3040.49ms -step:7/50 train_loss:6.9519 train_time:21292ms step_avg:3041.67ms -step:8/50 train_loss:6.9005 train_time:24342ms step_avg:3042.69ms -step:9/50 train_loss:6.5418 train_time:27391ms step_avg:3043.47ms -step:10/50 train_loss:6.1552 train_time:30442ms step_avg:3044.24ms -step:20/50 train_loss:4.8491 train_time:60936ms step_avg:3046.80ms -step:25/50 val_loss:4.3716 val_bpb:2.5891 train_time:76222ms step_avg:3048.87ms h_norms=['12739.2', '11639.7', '10834.5', '10248.2', '9888.2', '9584.1', '9359.5', '9269.6', '9263.6', '9383.8', '9327.1', '9186.1', '9174.7', '9243.6', '9433.6', '9207.4', '9116.2', '9157.3', '9280.5', '9523.6'] growth=['0.899', '0.914', '0.931', '0.946', '0.965', '0.969', '0.977', '0.990', '0.999', '1.013', '0.979', '0.985', '0.999', '1.008', '1.021', '0.984', '0.990', '1.005', '1.013', '1.026'] -step:30/50 train_loss:4.2292 train_time:91444ms step_avg:3048.14ms -step:40/50 train_loss:3.9319 train_time:121963ms step_avg:3049.08ms -step:50/50 train_loss:3.7393 train_time:152613ms step_avg:3052.25ms -step:50/50 val_loss:3.7133 val_bpb:2.1992 train_time:152647ms step_avg:3052.93ms h_norms=['19200.7', '16705.4', '14993.2', '13850.4', '13137.0', '13070.5', '12983.4', '12953.9', '12984.4', '13061.0', '13002.1', '13046.2', '13096.0', '13184.0', '13296.2', '13048.7', '13170.7', '13265.1', '13387.0', '13519.2'] growth=['0.844', '0.870', '0.898', '0.924', '0.948', '0.995', '0.993', '0.998', '1.002', '1.006', '1.011', '1.003', '1.004', '1.007', '1.009', '1.021', '1.009', '1.007', '1.009', '1.010'] -peak memory allocated: 66516 MiB reserved: 67876 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9477 val_bpb:3.5225 eval_time:83302ms -Serialized model: 106023671 bytes -Code size: 99082 bytes -Serialized model int6+lzma: 4803604 bytes -Total submission size int6+lzma: 4902686 bytes -final_int6_roundtrip val_loss:6.1439 val_bpb:3.6388 eval_time:82807ms -final_int6_roundtrip_exact val_loss:6.14393907 val_bpb:3.63878679 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/output.log deleted file mode 100644 index ff4b052b54..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['9918.0', '10850.4', '11857.1', '12934.1', '14082.9', '15373.2', '16715.3', '18129.5', '19630.8', '21229.7', '19385.0', '20986.2', '22687.0', '24496.2', '26416.1', '24210.2', '26136.7', '28180.1', '30337.3', '32616.2'] growth=['1.094', '1.094', '1.093', '1.091', '1.089', '1.092', '1.087', '1.085', '1.083', '1.081', '1.086', '1.083', '1.081', '1.080', '1.078', '1.082', '1.080', '1.078', '1.077', '1.075'] -step:1/50 train_loss:6.9310 train_time:3036ms step_avg:3035.59ms -step:2/50 train_loss:8.3473 train_time:6057ms step_avg:3028.64ms -step:3/50 train_loss:7.5117 train_time:9110ms step_avg:3036.83ms -step:4/50 train_loss:7.5612 train_time:12164ms step_avg:3040.98ms -step:5/50 train_loss:7.3191 train_time:15217ms step_avg:3043.48ms -step:6/50 train_loss:7.0778 train_time:18271ms step_avg:3045.21ms -step:7/50 train_loss:6.9515 train_time:21325ms step_avg:3046.36ms -step:8/50 train_loss:6.8989 train_time:24379ms step_avg:3047.33ms -step:9/50 train_loss:6.5423 train_time:27433ms step_avg:3048.11ms -step:10/50 train_loss:6.1552 train_time:30487ms step_avg:3048.71ms -step:20/50 train_loss:4.8543 train_time:61023ms step_avg:3051.16ms -step:25/50 val_loss:4.3745 val_bpb:2.5908 train_time:76333ms step_avg:3053.30ms h_norms=['12750.8', '11662.4', '10864.5', '10284.5', '9926.1', '9627.8', '9403.7', '9313.4', '9309.0', '9428.9', '9374.8', '9233.4', '9221.1', '9291.2', '9480.8', '9258.3', '9166.0', '9206.0', '9330.2', '9572.9'] growth=['0.900', '0.915', '0.932', '0.947', '0.965', '0.970', '0.977', '0.990', '1.000', '1.013', '0.979', '0.985', '0.999', '1.008', '1.020', '0.985', '0.990', '1.004', '1.013', '1.026'] -step:30/50 train_loss:4.2237 train_time:91579ms step_avg:3052.64ms -step:40/50 train_loss:3.9233 train_time:122152ms step_avg:3053.79ms -step:50/50 train_loss:3.7378 train_time:152866ms step_avg:3057.32ms -step:50/50 val_loss:3.7019 val_bpb:2.1925 train_time:152900ms step_avg:3058.01ms h_norms=['18557.7', '16139.2', '14492.7', '13414.8', '12747.7', '12664.0', '12590.1', '12595.2', '12645.7', '12735.2', '12603.5', '12652.5', '12735.0', '12838.0', '12962.0', '12653.1', '12772.9', '12898.4', '13031.2', '13175.0'] growth=['0.841', '0.870', '0.898', '0.926', '0.950', '0.993', '0.994', '1.000', '1.004', '1.007', '1.009', '1.004', '1.007', '1.008', '1.010', '1.018', '1.009', '1.010', '1.010', '1.011'] -peak memory allocated: 66516 MiB reserved: 67876 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9474 val_bpb:3.5224 eval_time:83475ms -Serialized model: 106023671 bytes -Code size: 99082 bytes -Serialized model int6+lzma: 4801388 bytes -Total submission size int6+lzma: 4900470 bytes -final_int6_roundtrip val_loss:6.1436 val_bpb:3.6386 eval_time:82976ms -final_int6_roundtrip_exact val_loss:6.14358204 val_bpb:3.63857534 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log deleted file mode 100644 index f59955a5a8..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms h_norms=['9917.9', '10757.9', '11684.5', '12697.7', '13792.7', '14968.7', '16232.0', '17584.0', '19026.6', '20567.1', '18830.4', '20360.3', '21989.6', '23732.8', '25596.8', '23508.1', '25360.6', '27334.0', '29434.0', '31674.1'] growth=['1.081', '1.085', '1.086', '1.087', '1.086', '1.085', '1.084', '1.083', '1.082', '1.081', '1.082', '1.081', '1.080', '1.079', '1.079', '1.079', '1.079', '1.078', '1.077', '1.076'] -step:1/50 train_loss:6.9310 train_time:3149ms step_avg:3148.80ms -step:2/50 train_loss:8.4139 train_time:6265ms step_avg:3132.33ms -step:3/50 train_loss:7.5940 train_time:9413ms step_avg:3137.77ms -step:4/50 train_loss:7.3379 train_time:12563ms step_avg:3140.75ms -step:5/50 train_loss:7.1846 train_time:15710ms step_avg:3142.03ms -step:6/50 train_loss:7.1330 train_time:18856ms step_avg:3142.72ms -step:7/50 train_loss:7.0579 train_time:22001ms step_avg:3142.94ms -step:8/50 train_loss:6.8826 train_time:25148ms step_avg:3143.55ms -step:9/50 train_loss:6.5374 train_time:28300ms step_avg:3144.44ms -step:10/50 train_loss:6.1407 train_time:31445ms step_avg:3144.50ms -step:20/50 train_loss:4.7836 train_time:62947ms step_avg:3147.35ms -step:25/50 val_loss:4.3985 val_bpb:2.6051 train_time:78763ms step_avg:3150.51ms h_norms=['12132.8', '10932.9', '9995.3', '9323.7', '8882.5', '8488.8', '8191.2', '8002.8', '7969.0', '8058.7', '8126.2', '7929.0', '7834.6', '7885.5', '8052.5', '7921.9', '7788.9', '7759.4', '7870.9', '8096.2'] growth=['0.889', '0.901', '0.914', '0.933', '0.953', '0.956', '0.965', '0.977', '0.996', '1.011', '0.967', '0.976', '0.988', '1.006', '1.021', '0.974', '0.983', '0.996', '1.014', '1.029'] -step:30/50 train_loss:4.2134 train_time:94516ms step_avg:3150.53ms -step:40/50 train_loss:3.9354 train_time:126096ms step_avg:3152.40ms -step:50/50 train_loss:3.7653 train_time:157849ms step_avg:3156.98ms -step:50/50 val_loss:3.7283 val_bpb:2.2081 train_time:157883ms step_avg:3157.66ms h_norms=['19485.3', '16686.2', '14563.5', '13010.1', '11889.3', '11357.6', '10866.1', '10420.7', '10135.7', '9955.0', '10905.5', '10559.7', '10219.8', '10021.9', '9904.3', '10582.1', '10345.6', '10085.3', '9954.8', '9885.5'] growth=['0.843', '0.856', '0.873', '0.893', '0.914', '0.955', '0.957', '0.959', '0.973', '0.982', '0.971', '0.968', '0.968', '0.981', '0.988', '0.985', '0.978', '0.975', '0.987', '0.993'] -peak memory allocated: 66515 MiB reserved: 67880 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9490 val_bpb:3.5233 eval_time:83476ms -Serialized model: 106023671 bytes -Code size: 99082 bytes -Serialized model int6+lzma: 4795396 bytes -Total submission size int6+lzma: 4894478 bytes -final_int6_roundtrip val_loss:6.1634 val_bpb:3.6503 eval_time:82960ms -final_int6_roundtrip_exact val_loss:6.16344777 val_bpb:3.65034094 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/output.log deleted file mode 100644 index b598d1d44f..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['9697.0', '10451.6', '11267.2', '12136.6', '13066.3', '14092.3', '15164.6', '16287.1', '17473.2', '18730.2', '17176.6', '18430.5', '19751.7', '21150.8', '22632.8', '20820.9', '22301.9', '23864.0', '25513.0', '27242.6', '25115.3', '26855.7', '28683.3', '30605.1', '32620.9'] growth=['1.076', '1.078', '1.078', '1.077', '1.077', '1.079', '1.076', '1.074', '1.073', '1.072', '1.075', '1.073', '1.072', '1.071', '1.070', '1.072', '1.071', '1.070', '1.069', '1.068', '1.070', '1.069', '1.068', '1.067', '1.066'] -step:1/50 train_loss:6.9310 train_time:3595ms step_avg:3594.71ms -step:2/50 train_loss:8.2519 train_time:7178ms step_avg:3588.94ms -step:3/50 train_loss:7.4903 train_time:10793ms step_avg:3597.54ms -step:4/50 train_loss:7.6972 train_time:14409ms step_avg:3602.19ms -step:5/50 train_loss:7.4012 train_time:18024ms step_avg:3604.78ms -step:6/50 train_loss:7.0545 train_time:21640ms step_avg:3606.66ms -step:7/50 train_loss:6.8316 train_time:25257ms step_avg:3608.12ms -step:8/50 train_loss:6.7797 train_time:28873ms step_avg:3609.17ms -step:9/50 train_loss:6.4829 train_time:32490ms step_avg:3610.01ms -step:10/50 train_loss:6.1437 train_time:36108ms step_avg:3610.80ms -step:20/50 train_loss:4.7380 train_time:72276ms step_avg:3613.81ms -step:25/50 val_loss:4.3185 val_bpb:2.5577 train_time:90403ms step_avg:3616.13ms h_norms=['12234.1', '10923.7', '9965.4', '9287.7', '8892.1', '8588.1', '8375.3', '8299.3', '8339.8', '8524.9', '8358.5', '8246.4', '8260.4', '8384.4', '8644.6', '8290.1', '8234.7', '8305.1', '8486.1', '8800.2', '8326.9', '8299.6', '8403.4', '8622.6', '8975.6'] growth=['0.878', '0.893', '0.912', '0.932', '0.957', '0.966', '0.975', '0.991', '1.005', '1.022', '0.978', '0.987', '1.002', '1.015', '1.031', '0.985', '0.993', '1.009', '1.022', '1.037', '0.988', '0.997', '1.013', '1.026', '1.041'] -step:30/50 train_loss:4.1582 train_time:108463ms step_avg:3615.44ms -step:40/50 train_loss:3.8947 train_time:144657ms step_avg:3616.43ms -step:50/50 train_loss:3.7328 train_time:180990ms step_avg:3619.81ms -step:50/50 val_loss:3.6922 val_bpb:2.1867 train_time:181025ms step_avg:3620.49ms h_norms=['17283.6', '14695.4', '12995.7', '11901.6', '11252.4', '11294.1', '11357.9', '11432.8', '11508.8', '11635.1', '11340.6', '11532.3', '11678.7', '11798.8', '11950.4', '11501.0', '11757.0', '11936.4', '12077.0', '12238.6', '11722.3', '12001.2', '12189.6', '12335.8', '12498.2'] growth=['0.819', '0.850', '0.884', '0.916', '0.945', '1.004', '1.006', '1.007', '1.007', '1.011', '1.023', '1.017', '1.013', '1.010', '1.013', '1.034', '1.022', '1.015', '1.012', '1.013', '1.038', '1.024', '1.016', '1.012', '1.013'] -peak memory allocated: 78629 MiB reserved: 79984 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:5.9839 val_bpb:3.5440 eval_time:98671ms -Serialized model: 106023671 bytes -Code size: 99082 bytes -Serialized model int6+lzma: 4803484 bytes -Total submission size int6+lzma: 4902566 bytes -final_int6_roundtrip val_loss:6.1921 val_bpb:3.6673 eval_time:98102ms -final_int6_roundtrip_exact val_loss:6.19214207 val_bpb:3.66733532 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/output.log deleted file mode 100644 index 42f9f123fe..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/output.log +++ /dev/null @@ -1,19 +0,0 @@ -wandb:initialized -warmup_step:1/5 -warmup_step:2/5 -warmup_step:3/5 -warmup_step:4/5 -warmup_step:5/5 -step:0/50 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['9459.8', '10026.0', '10638.5', '11293.0', '11994.1', '12754.8', '13546.4', '14370.5', '15234.0', '16144.6', '14936.2', '15835.7', '16776.4', '17764.6', '18807.4', '17439.0', '18467.7', '19543.9', '20664.7', '21838.3', '20287.3', '21442.5', '22656.2', '23915.0', '25239.7', '23491.8', '24792.6', '26161.8', '27579.4', '29075.0'] growth=['1.057', '1.060', '1.061', '1.062', '1.062', '1.063', '1.062', '1.061', '1.060', '1.060', '1.061', '1.060', '1.059', '1.059', '1.059', '1.060', '1.059', '1.058', '1.057', '1.057', '1.058', '1.057', '1.057', '1.056', '1.055', '1.057', '1.055', '1.055', '1.054', '1.054'] -step:1/50 train_loss:6.9310 train_time:4157ms step_avg:4156.91ms -step:2/50 train_loss:8.1611 train_time:8307ms step_avg:4153.58ms -step:3/50 train_loss:7.5076 train_time:12490ms step_avg:4163.29ms -step:4/50 train_loss:7.6959 train_time:16672ms step_avg:4167.94ms -step:5/50 train_loss:7.4174 train_time:20854ms step_avg:4170.71ms -step:6/50 train_loss:7.1131 train_time:25035ms step_avg:4172.58ms -step:7/50 train_loss:6.9487 train_time:29219ms step_avg:4174.14ms -step:8/50 train_loss:6.7735 train_time:33402ms step_avg:4175.31ms -step:9/50 train_loss:6.4261 train_time:37586ms step_avg:4176.27ms -step:10/50 train_loss:6.0743 train_time:41771ms step_avg:4177.07ms -step:20/50 train_loss:4.7079 train_time:83751ms step_avg:4187.54ms -step:25/50 val_loss:4.2787 val_bpb:2.5341 t diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/output.log deleted file mode 100644 index 86e820ede8..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/output.log +++ /dev/null @@ -1,35 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms -step:1/20000 train_loss:6.9310 train_time:1288ms step_avg:1287.95ms -step:2/20000 train_loss:8.3536 train_time:2591ms step_avg:1295.41ms -step:3/20000 train_loss:7.5089 train_time:3939ms step_avg:1313.01ms -step:4/20000 train_loss:7.5822 train_time:5294ms step_avg:1323.40ms -step:5/20000 train_loss:7.3524 train_time:6645ms step_avg:1328.95ms -step:6/20000 train_loss:7.0866 train_time:7993ms step_avg:1332.18ms -step:7/20000 train_loss:6.9398 train_time:9356ms step_avg:1336.57ms -step:8/20000 train_loss:6.8951 train_time:10726ms step_avg:1340.75ms -step:9/20000 train_loss:6.5426 train_time:12083ms step_avg:1342.55ms -step:10/20000 train_loss:6.1426 train_time:13437ms step_avg:1343.72ms -step:50/20000 train_loss:3.6826 train_time:68838ms step_avg:1376.77ms -step:100/20000 train_loss:3.1286 train_time:138070ms step_avg:1380.70ms -step:150/20000 train_loss:2.7613 train_time:272923ms step_avg:1819.49ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/output.log deleted file mode 100644 index 0622a06eba..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms -step:1/20000 train_loss:6.9310 train_time:2909ms step_avg:2908.98ms -step:2/20000 train_loss:8.4243 train_time:5662ms step_avg:2830.81ms -step:3/20000 train_loss:7.8519 train_time:8500ms step_avg:2833.22ms -step:4/20000 train_loss:7.1213 train_time:11343ms step_avg:2835.65ms -step:5/20000 train_loss:6.5923 train_time:14192ms step_avg:2838.40ms -step:6/20000 train_loss:6.3670 train_time:17042ms step_avg:2840.33ms -step:7/20000 train_loss:6.2103 train_time:19891ms step_avg:2841.54ms -step:8/20000 train_loss:6.1333 train_time:22735ms step_avg:2841.84ms -step:9/20000 train_loss:6.0992 train_time:25576ms step_avg:2841.78ms -step:10/20000 train_loss:5.9961 train_time:28419ms step_avg:2841.93ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/output.log deleted file mode 100644 index 0b74980ee8..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/output.log +++ /dev/null @@ -1,67 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2084, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1760, in main - warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ - return super().__call__(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1027, in forward - def forward(self, input_ids: Tensor, target_ids: Tensor, - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward - return compiled_fn(full_args) - ^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper - all_outs = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g - return f(*args) - ^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply - return super().apply(*args, **kwargs) # type: ignore[misc] - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward - fw_outs = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper - return compiled_fn(runtime_args) - ^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn - outs = compiled_fn(args) - ^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ - return self.current_callable(inputs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run - out = model(new_inputs) - ^^^^^^^^^^^^^^^^^ - File "/tmp/torchinductor_nesta/bj/cbjz4wub4qksf632gjklhdzgwbvw2qz6t5g7gsnwo3esf3zyblfo.py", line 10736, in call - buf771 = empty_strided_cuda((48, 2048, 1536), (3145728, 1536, 1), torch.bfloat16) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 99.88 MiB is free. Process 273188 has 51.64 GiB memory in use. Process 275202 has 51.56 GiB memory in use. Including non-PyTorch memory, this process has 36.48 GiB memory in use. 79.97 GiB allowed; Of the allocated memory 35.81 GiB is allocated by PyTorch, and 11.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/output.log deleted file mode 100644 index b0a576e071..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/output.log +++ /dev/null @@ -1,122 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms -step:1/20000 train_loss:6.9310 train_time:1286ms step_avg:1285.72ms -step:2/20000 train_loss:8.4243 train_time:2520ms step_avg:1259.75ms -step:3/20000 train_loss:7.5899 train_time:3791ms step_avg:1263.65ms -step:4/20000 train_loss:7.3604 train_time:5065ms step_avg:1266.15ms -step:5/20000 train_loss:7.2017 train_time:6337ms step_avg:1267.44ms -step:6/20000 train_loss:7.1139 train_time:7608ms step_avg:1267.94ms -step:7/20000 train_loss:7.0266 train_time:8884ms step_avg:1269.15ms -step:8/20000 train_loss:6.8703 train_time:10155ms step_avg:1269.33ms -step:9/20000 train_loss:6.5277 train_time:11432ms step_avg:1270.24ms -step:10/20000 train_loss:6.1364 train_time:12711ms step_avg:1271.13ms -step:50/20000 train_loss:3.7012 train_time:66255ms step_avg:1325.09ms -step:100/20000 train_loss:3.1707 train_time:133824ms step_avg:1338.24ms -step:150/20000 train_loss:2.8377 train_time:201511ms step_avg:1343.41ms -step:200/20000 train_loss:2.6154 train_time:269226ms step_avg:1346.13ms -step:250/20000 train_loss:2.6168 train_time:336977ms step_avg:1347.91ms -step:300/20000 train_loss:2.4755 train_time:404746ms step_avg:1349.15ms -step:350/20000 train_loss:2.5226 train_time:472525ms step_avg:1350.07ms -step:400/20000 train_loss:2.4300 train_time:540349ms step_avg:1350.87ms -step:450/20000 train_loss:2.2536 train_time:608178ms step_avg:1351.51ms -step:500/20000 train_loss:2.3113 train_time:676567ms step_avg:1353.13ms -step:500/20000 val_loss:2.3274 val_bpb:1.3784 train_time:676611ms step_avg:1353.22ms -step:550/20000 train_loss:2.3695 train_time:744686ms step_avg:1353.98ms -step:600/20000 train_loss:2.2671 train_time:812797ms step_avg:1354.66ms -step:650/20000 train_loss:2.2373 train_time:880946ms step_avg:1355.30ms -step:700/20000 train_loss:2.3081 train_time:949104ms step_avg:1355.86ms -step:750/20000 train_loss:2.2737 train_time:1017254ms step_avg:1356.34ms -step:800/20000 train_loss:2.2530 train_time:1085409ms step_avg:1356.76ms -step:850/20000 train_loss:2.1855 train_time:1153602ms step_avg:1357.18ms -step:900/20000 train_loss:2.1047 train_time:1221810ms step_avg:1357.57ms -step:950/20000 train_loss:2.3058 train_time:1290011ms step_avg:1357.91ms -step:1000/20000 train_loss:2.2370 train_time:1358693ms step_avg:1358.69ms -step:1000/20000 val_loss:2.1807 val_bpb:1.2916 train_time:1358737ms step_avg:1358.74ms -step:1050/20000 train_loss:2.1633 train_time:1426912ms step_avg:1358.96ms -step:1100/20000 train_loss:2.1901 train_time:1495141ms step_avg:1359.22ms -step:1150/20000 train_loss:2.1451 train_time:1563371ms step_avg:1359.45ms -step:1200/20000 train_loss:2.1925 train_time:1631660ms step_avg:1359.72ms -step:1250/20000 train_loss:2.2160 train_time:1699949ms step_avg:1359.96ms -step:1300/20000 train_loss:2.1869 train_time:1768261ms step_avg:1360.20ms -step:1350/20000 train_loss:2.1586 train_time:1836588ms step_avg:1360.44ms -step:1400/20000 train_loss:2.1726 train_time:1904911ms step_avg:1360.65ms -step:1450/20000 train_loss:2.1689 train_time:1973248ms step_avg:1360.86ms -step:1500/20000 train_loss:2.1391 train_time:2042104ms step_avg:1361.40ms -step:1500/20000 val_loss:2.1246 val_bpb:1.2583 train_time:2042149ms step_avg:1361.43ms -step:1550/20000 train_loss:2.1090 train_time:2110481ms step_avg:1361.60ms -step:1600/20000 train_loss:2.1871 train_time:2178869ms step_avg:1361.79ms -step:1650/20000 train_loss:1.9698 train_time:2247283ms step_avg:1361.99ms -step:1700/20000 train_loss:2.0933 train_time:2315736ms step_avg:1362.20ms -step:1750/20000 train_loss:2.0614 train_time:2384155ms step_avg:1362.37ms -step:1800/20000 train_loss:2.0974 train_time:2452595ms step_avg:1362.55ms -step:1850/20000 train_loss:2.1094 train_time:2521066ms step_avg:1362.74ms -step:1900/20000 train_loss:2.0530 train_time:2589507ms step_avg:1362.90ms -step:1950/20000 train_loss:2.0371 train_time:2657961ms step_avg:1363.06ms -step:2000/20000 train_loss:2.2915 train_time:2726880ms step_avg:1363.44ms -step:2000/20000 val_loss:2.0686 val_bpb:1.2252 train_time:2726924ms step_avg:1363.46ms -step:2050/20000 train_loss:2.0580 train_time:2795382ms step_avg:1363.60ms -step:2100/20000 train_loss:2.0309 train_time:2863872ms step_avg:1363.75ms -step:2150/20000 train_loss:2.0080 train_time:2932366ms step_avg:1363.89ms -step:2200/20000 train_loss:2.1611 train_time:3000869ms step_avg:1364.03ms -step:2250/20000 train_loss:2.0531 train_time:3069359ms step_avg:1364.16ms -step:2300/20000 train_loss:2.0314 train_time:3137845ms step_avg:1364.28ms -step:2350/20000 train_loss:1.9839 train_time:3206343ms step_avg:1364.40ms -step:2400/20000 train_loss:2.0978 train_time:3274856ms step_avg:1364.52ms -step:2450/20000 train_loss:2.0583 train_time:3343349ms step_avg:1364.63ms -step:2500/20000 train_loss:2.0143 train_time:3412281ms step_avg:1364.91ms -step:2500/20000 val_loss:2.0210 val_bpb:1.1969 train_time:3412325ms step_avg:1364.93ms -step:2550/20000 train_loss:2.0163 train_time:3480766ms step_avg:1365.01ms -step:2600/20000 train_loss:1.9947 train_time:3549233ms step_avg:1365.09ms -step:2650/20000 train_loss:1.9997 train_time:3617731ms step_avg:1365.18ms -step:2700/20000 train_loss:2.0195 train_time:3686191ms step_avg:1365.26ms -step:2750/20000 train_loss:2.0010 train_time:3754675ms step_avg:1365.34ms -step:2800/20000 train_loss:2.0359 train_time:3823161ms step_avg:1365.41ms -swa:start step:2850 -step:2850/20000 train_loss:1.9860 train_time:3891626ms step_avg:1365.48ms -step:2900/20000 train_loss:2.0033 train_time:3960176ms step_avg:1365.58ms -step:2950/20000 train_loss:2.0417 train_time:4028712ms step_avg:1365.67ms -late_qat:enabled step:2990 scale:0.1498 -step:3000/20000 train_loss:1.9297 train_time:4097686ms step_avg:1365.90ms -step:3000/20000 val_loss:1.9846 val_bpb:1.1754 train_time:4097782ms step_avg:1365.93ms -step:3050/20000 train_loss:1.9368 train_time:4166118ms step_avg:1365.94ms -step:3100/20000 train_loss:2.0003 train_time:4234401ms step_avg:1365.94ms -step:3150/20000 train_loss:2.0099 train_time:4302671ms step_avg:1365.93ms -step:3200/20000 train_loss:1.9846 train_time:4370945ms step_avg:1365.92ms -step:3250/20000 train_loss:1.9515 train_time:4439218ms step_avg:1365.91ms -step:3300/20000 train_loss:1.9330 train_time:4507468ms step_avg:1365.90ms -step:3350/20000 train_loss:1.9699 train_time:4575766ms step_avg:1365.90ms -step:3400/20000 train_loss:2.0133 train_time:4644041ms step_avg:1365.89ms -step:3450/20000 train_loss:1.9670 train_time:4712315ms step_avg:1365.89ms -step:3500/20000 train_loss:1.9497 train_time:4781271ms step_avg:1366.08ms -step:3500/20000 val_loss:1.9558 val_bpb:1.1583 train_time:4781368ms step_avg:1366.11ms -step:3514/20000 val_loss:1.9557 val_bpb:1.1583 train_time:4800532ms step_avg:1366.12ms -stopping_early: wallclock_cap train_time:4800532ms step:3514/20000 -peak memory allocated: 50545 MiB reserved: 50594 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:1.9532 val_bpb:1.1568 eval_time:32961ms -Serialized model: 106023671 bytes -Code size: 99082 bytes -Serialized model int6+lzma: 14754232 bytes -Total submission size int6+lzma: 14853314 bytes -final_int6_roundtrip val_loss:1.9685 val_bpb:1.1659 eval_time:63637ms -final_int6_roundtrip_exact val_loss:1.96850574 val_bpb:1.16585998 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/output.log deleted file mode 100644 index f7740182bb..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/output.log +++ /dev/null @@ -1,42 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms h_norms=['2856693.8', '3617744.0', '3024587.2', '2505640.5', '2045268.8', '1692691.0', '1466293.1', '1317613.1', '1128710.0', '1036196.3', '1421818.8', '1274564.5', '1193117.0', '1029778.4', '1490634.8', '1094878.2', '980524.6', '772689.9', '2807584.8', '1688950.2'] growth=['37.088', '1.266', '0.836', '0.828', '0.816', '0.828', '0.866', '0.899', '0.857', '0.918', '0.893', '0.896', '0.936', '0.863', '1.448', '0.874', '0.896', '0.788', '3.634', '0.602'] -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3120ms step_avg:3120.20ms -late_qat:enabled step:1 scale:0.1125 core_quant:on -step:2/500 train_loss:8.2958 grad_norm:3.6201 train_time:6355ms step_avg:3177.30ms -step:3/500 train_loss:7.7840 grad_norm:1.9355 train_time:9607ms step_avg:3202.48ms -step:4/500 train_loss:7.6312 grad_norm:2.0781 train_time:12861ms step_avg:3215.33ms -step:5/500 train_loss:7.2389 grad_norm:1.7372 train_time:16114ms step_avg:3222.85ms -step:6/500 train_loss:6.9255 grad_norm:1.3250 train_time:19368ms step_avg:3227.98ms -step:7/500 train_loss:6.7056 grad_norm:1.6786 train_time:22623ms step_avg:3231.91ms -step:8/500 train_loss:6.5605 grad_norm:1.9357 train_time:25876ms step_avg:3234.53ms -step:9/500 train_loss:6.3963 grad_norm:1.8729 train_time:29129ms step_avg:3236.52ms -step:10/500 train_loss:6.2184 grad_norm:1.5032 train_time:32381ms step_avg:3238.14ms -step:20/500 train_loss:5.6501 grad_norm:0.3987 train_time:64880ms step_avg:3244.02ms -step:30/500 train_loss:5.4810 grad_norm:0.2723 train_time:97397ms step_avg:3246.58ms -step:40/500 train_loss:5.3294 grad_norm:0.3154 train_time:130061ms step_avg:3251.52ms -step:50/500 train_loss:5.1292 grad_norm:0.4699 train_time:162596ms step_avg:3251.93ms -step:50/500 val_loss:5.1050 val_bpb:3.0235 train_time:162628ms step_avg:3252.56ms h_norms=['5481.6', '5232.7', '5130.8', '5133.4', '5378.8', '5546.0', '6001.6', '6423.6', '6559.1', '6909.3', '6350.7', '7392.7', '7713.3', '8030.2', '8264.0', '7556.6', '7855.9', '8165.5', '8383.7', '8953.6'] growth=['0.776', '0.955', '0.981', '1.000', '1.048', '1.031', '1.082', '1.070', '1.021', '1.053', '1.047', '1.164', '1.043', '1.041', '1.029', '1.066', '1.040', '1.039', '1.027', '1.068'] -step:60/500 train_loss:4.8449 grad_norm:1.4378 train_time:195133ms step_avg:3252.21ms -step:70/500 train_loss:4.6316 grad_norm:0.9923 train_time:227701ms step_avg:3252.87ms -step:80/500 train_loss:4.4605 grad_norm:0.7018 train_time:285889ms step_avg:3573.62ms -step:90/500 train_loss:4.2816 grad_norm:0.4120 train_time:355477ms step_avg:3949.75ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/output.log deleted file mode 100644 index 1aefa45d4f..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/output.log +++ /dev/null @@ -1,51 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms h_norms=['3412002.5', '3162921.2', '9173489.0', '7602557.5', '6195891.5', '5488329.5', '11289328.0', '9569341.0', '7780140.0', '6543904.0', '3864158.2', '3367786.2', '3060453.0', '2578833.0', '2199645.5', '3025920.8', '2612817.8', '2319063.8', '1985769.4', '1676057.1'] growth=['40.638', '0.927', '2.900', '0.829', '0.815', '0.886', '2.057', '0.848', '0.813', '0.841', '0.926', '0.872', '0.909', '0.843', '0.853', '0.893', '0.863', '0.888', '0.856', '0.844'] -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3123ms step_avg:3122.62ms -step:2/500 train_loss:8.2188 grad_norm:3.3853 train_time:6236ms step_avg:3118.00ms -step:3/500 train_loss:28.9039 grad_norm:69.9957 train_time:9371ms step_avg:3123.80ms -step:4/500 train_loss:25.5037 grad_norm:61.7113 train_time:12508ms step_avg:3126.98ms -step:5/500 train_loss:7.6034 grad_norm:1.4821 train_time:15651ms step_avg:3130.24ms -step:6/500 train_loss:26.0025 grad_norm:97.9849 train_time:18790ms step_avg:3131.59ms -step:7/500 train_loss:16.5605 grad_norm:40.5487 train_time:21927ms step_avg:3132.48ms -step:8/500 train_loss:11.7107 grad_norm:10.1270 train_time:25068ms step_avg:3133.44ms -step:9/500 train_loss:11.1058 grad_norm:11.1070 train_time:28207ms step_avg:3134.13ms -step:10/500 train_loss:8.9177 grad_norm:2.2146 train_time:31350ms step_avg:3134.98ms -step:20/500 train_loss:18.2283 grad_norm:171.1513 train_time:62685ms step_avg:3134.24ms -step:30/500 train_loss:7.3933 grad_norm:3.9780 train_time:93988ms step_avg:3132.95ms -step:40/500 train_loss:6.1536 grad_norm:0.7697 train_time:125446ms step_avg:3136.15ms -step:50/500 train_loss:5.7265 grad_norm:0.4587 train_time:156800ms step_avg:3136.00ms -step:50/500 val_loss:5.6210 val_bpb:3.3291 train_time:156832ms step_avg:3136.63ms h_norms=['1483117.2', '2001691.6', '5509635.5', '3288294.0', '2112858.5', '9030652.0', '23049164.0', '14409614.0', '9373722.0', '5569532.0', '1012941.9', '764749.4', '575192.3', '488720.7', '685087.8', '1171268.4', '1059284.2', '703405.6', '689976.9', '514776.4'] growth=['1.514', '1.350', '2.752', '0.597', '0.643', '4.274', '2.552', '0.625', '0.651', '0.594', '0.774', '0.755', '0.752', '0.850', '1.402', '1.114', '0.904', '0.664', '0.981', '0.746'] -step:60/500 train_loss:5.3182 grad_norm:1.1502 train_time:188173ms step_avg:3136.22ms -step:70/500 train_loss:4.9805 grad_norm:0.4400 train_time:219545ms step_avg:3136.35ms -step:80/500 train_loss:4.6777 grad_norm:0.3607 train_time:250922ms step_avg:3136.52ms -step:90/500 train_loss:4.4124 grad_norm:0.7668 train_time:282302ms step_avg:3136.69ms -step:100/500 train_loss:4.2591 grad_norm:0.5623 train_time:313684ms step_avg:3136.84ms -step:100/500 val_loss:4.2203 val_bpb:2.4995 train_time:313716ms step_avg:3137.16ms h_norms=['2539958.8', '1891926.2', '1842518.8', '1259791.9', '834439.0', '2452532.2', '7493377.5', '4728913.5', '3378154.8', '1967445.1', '10342628.0', '6845392.5', '4585671.5', '3120444.5', '2068628.8', '1138807.9', '1212089.2', '1098464.0', '943749.6', '733953.3'] growth=['3.252', '0.745', '0.974', '0.684', '0.662', '2.939', '3.055', '0.631', '0.714', '0.582', '12.042', '0.662', '0.670', '0.680', '0.663', '1.142', '1.064', '0.906', '0.859', '0.778'] -step:110/500 train_loss:4.1071 grad_norm:0.5719 train_time:345069ms step_avg:3136.99ms -step:120/500 train_loss:3.9553 grad_norm:0.6834 train_time:376449ms step_avg:3137.07ms -step:130/500 train_loss:3.8550 grad_norm:2.0622 train_time:407872ms step_avg:3137.47ms -step:140/500 train_loss:3.7516 grad_norm:0.7069 train_time:439278ms step_avg:3137.70ms -step:150/500 train_loss:3.6360 grad_norm:0.2515 train_time:470666ms step_avg:3137.77ms -step:150/500 val_loss:3.6143 val_bpb:2.1406 train_time:470698ms step_avg:3137.98ms h_norms=['1691694.6', '1214504.1', '1032287.4', '726402.4', '507756.3', '1093338.5', '3688532.0', '2256534.5', '1410329.9', '885931.9', '2614546.0', '1920724.0', '1389947.5', '1044625.5', '756565.3', '642715.2', '699743.8', '648021.7', '565602.9', '444204.0'] growth=['3.923', '0.718', '0.850', '0.704', '0.699', '2.153', '3.374', '0.612', '0.625', '0.628', '5.054', '0.735', '0.724', '0.752', '0.724', '1.191', '1.089', '0.926', '0.873', '0.785'] -step:160/500 train_loss:3.5437 grad_norm:0.7928 train_time:502073ms step_avg:3137.96ms -step:170/500 train_loss:3.3979 grad_norm:0.3021 train_time:533493ms step_avg:3138.19ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/output.log deleted file mode 100644 index 37c9fe540c..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/output.log +++ /dev/null @@ -1,100 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['14828.0', '14150.7', '13588.8', '13118.6', '12709.1', '12463.7', '12206.9', '11992.7', '11805.3', '11630.9', '12068.4', '11893.6', '11752.5', '11630.9', '11516.5', '11809.8', '11709.4', '11634.5', '11573.3', '11514.1'] growth=['0.947', '0.954', '0.960', '0.965', '0.969', '0.981', '0.979', '0.982', '0.984', '0.985', '0.988', '0.986', '0.988', '0.990', '0.990', '0.995', '0.992', '0.994', '0.995', '0.995'] -step:1/500 train_loss:6.9310 grad_norm:0.3717 train_time:3145ms step_avg:3144.67ms -step:2/500 train_loss:8.3473 grad_norm:3.5331 train_time:6277ms step_avg:3138.33ms -step:3/500 train_loss:7.5115 grad_norm:1.8520 train_time:9436ms step_avg:3145.34ms -step:4/500 train_loss:7.5611 grad_norm:1.8993 train_time:12598ms step_avg:3149.58ms -step:5/500 train_loss:7.3182 grad_norm:1.9103 train_time:15757ms step_avg:3151.41ms -step:6/500 train_loss:7.0753 grad_norm:1.7013 train_time:18921ms step_avg:3153.55ms -step:7/500 train_loss:6.9528 grad_norm:2.0667 train_time:22082ms step_avg:3154.56ms -step:8/500 train_loss:6.9028 grad_norm:1.4281 train_time:25244ms step_avg:3155.47ms -step:9/500 train_loss:6.5408 grad_norm:1.0079 train_time:28404ms step_avg:3156.00ms -step:10/500 train_loss:6.1499 grad_norm:0.9864 train_time:31565ms step_avg:3156.49ms -step:20/500 train_loss:4.7832 grad_norm:1.0980 train_time:63165ms step_avg:3158.27ms -step:30/500 train_loss:4.1875 grad_norm:1.0659 train_time:94790ms step_avg:3159.67ms -step:40/500 train_loss:3.8630 grad_norm:0.8877 train_time:126560ms step_avg:3164.01ms -step:50/500 train_loss:3.6884 grad_norm:0.7170 train_time:158209ms step_avg:3164.18ms -step:50/500 val_loss:3.6586 val_bpb:2.1668 train_time:158241ms step_avg:3164.81ms h_norms=['12735.5', '11200.0', '10164.1', '9525.6', '9159.4', '9155.7', '9167.3', '9205.2', '9275.0', '9389.5', '9135.8', '9214.4', '9298.5', '9399.7', '9535.9', '9187.1', '9297.4', '9404.6', '9522.6', '9670.8'] growth=['0.845', '0.879', '0.908', '0.937', '0.962', '1.000', '1.001', '1.004', '1.008', '1.012', '1.011', '1.009', '1.009', '1.011', '1.014', '1.016', '1.012', '1.012', '1.013', '1.016'] -step:60/500 train_loss:3.5065 grad_norm:1.0078 train_time:189842ms step_avg:3164.03ms -step:70/500 train_loss:3.4063 grad_norm:0.7213 train_time:221493ms step_avg:3164.18ms -step:80/500 train_loss:3.3329 grad_norm:0.5494 train_time:253155ms step_avg:3164.43ms -step:90/500 train_loss:3.1786 grad_norm:0.4390 train_time:284819ms step_avg:3164.66ms -step:100/500 train_loss:3.1304 grad_norm:0.5572 train_time:316454ms step_avg:3164.54ms -step:100/500 val_loss:3.0898 val_bpb:1.8300 train_time:316486ms step_avg:3164.86ms h_norms=['13387.7', '11921.1', '11209.5', '11093.3', '11390.5', '11631.9', '12018.0', '12495.4', '13068.8', '13826.8', '12333.3', '12773.9', '13286.8', '13868.9', '14643.5', '13114.6', '13524.0', '14018.2', '14576.3', '15343.0'] growth=['0.849', '0.890', '0.940', '0.990', '1.027', '1.021', '1.033', '1.040', '1.046', '1.058', '1.018', '1.036', '1.040', '1.044', '1.056', '1.006', '1.031', '1.037', '1.040', '1.053'] -step:110/500 train_loss:3.0264 grad_norm:0.3633 train_time:348121ms step_avg:3164.73ms -step:120/500 train_loss:2.9410 grad_norm:0.3409 train_time:379800ms step_avg:3165.00ms -step:130/500 train_loss:2.8723 grad_norm:0.3861 train_time:411469ms step_avg:3165.15ms -step:140/500 train_loss:2.8200 grad_norm:0.2985 train_time:443103ms step_avg:3165.02ms -step:150/500 train_loss:2.7805 grad_norm:0.3407 train_time:474745ms step_avg:3164.97ms -step:150/500 val_loss:2.7663 val_bpb:1.6384 train_time:474777ms step_avg:3165.18ms h_norms=['14917.2', '13747.5', '13056.9', '12886.7', '13272.0', '13836.5', '13983.7', '14170.1', '14538.1', '15295.8', '14298.5', '14552.7', '14814.1', '15249.9', '16101.4', '14844.7', '15109.6', '15379.1', '15827.9', '16720.2'] growth=['0.910', '0.922', '0.950', '0.987', '1.030', '1.043', '1.011', '1.013', '1.026', '1.052', '1.027', '1.018', '1.018', '1.029', '1.056', '1.002', '1.018', '1.018', '1.029', '1.056'] -step:160/500 train_loss:2.7721 grad_norm:0.4282 train_time:506415ms step_avg:3165.09ms -step:170/500 train_loss:2.7118 grad_norm:0.3055 train_time:538058ms step_avg:3165.05ms -step:180/500 train_loss:2.6200 grad_norm:0.2472 train_time:569702ms step_avg:3165.01ms -step:190/500 train_loss:2.6444 grad_norm:0.3218 train_time:601362ms step_avg:3165.06ms -step:200/500 train_loss:2.5645 grad_norm:0.2424 train_time:633157ms step_avg:3165.79ms -step:200/500 val_loss:2.6022 val_bpb:1.5411 train_time:633189ms step_avg:3165.94ms h_norms=['17029.4', '15746.8', '14904.1', '14604.7', '14967.2', '16073.3', '15851.9', '15730.9', '15825.5', '16423.3', '16300.0', '16203.6', '16170.9', '16336.4', '17025.2', '16594.0', '16523.8', '16512.3', '16692.3', '17399.1'] growth=['0.922', '0.925', '0.946', '0.980', '1.025', '1.074', '0.986', '0.992', '1.006', '1.038', '1.056', '0.994', '0.998', '1.010', '1.042', '1.026', '0.996', '0.999', '1.011', '1.042'] -step:210/500 train_loss:2.5650 grad_norm:0.3549 train_time:664794ms step_avg:3165.68ms -step:220/500 train_loss:2.6050 grad_norm:0.3495 train_time:696436ms step_avg:3165.62ms -step:230/500 train_loss:2.5417 grad_norm:0.3301 train_time:728081ms step_avg:3165.57ms -step:240/500 train_loss:2.5378 grad_norm:0.2441 train_time:759737ms step_avg:3165.57ms -step:250/500 train_loss:2.5756 grad_norm:0.3062 train_time:791358ms step_avg:3165.43ms -step:250/500 val_loss:2.5268 val_bpb:1.4965 train_time:791390ms step_avg:3165.56ms h_norms=['18846.3', '17410.6', '16438.8', '15959.5', '16305.4', '17764.0', '17388.4', '17140.6', '17049.7', '17635.7', '17928.4', '17683.9', '17472.3', '17397.4', '17960.3', '18072.6', '17862.5', '17642.2', '17546.1', '18026.9'] growth=['0.917', '0.924', '0.944', '0.971', '1.022', '1.089', '0.979', '0.986', '0.995', '1.034', '1.076', '0.986', '0.988', '0.996', '1.032', '1.045', '0.988', '0.988', '0.995', '1.027'] -step:260/500 train_loss:2.5344 grad_norm:0.2526 train_time:823044ms step_avg:3165.55ms -step:270/500 train_loss:2.4962 grad_norm:0.3197 train_time:854686ms step_avg:3165.50ms -step:280/500 train_loss:2.4329 grad_norm:0.2331 train_time:886309ms step_avg:3165.39ms -step:290/500 train_loss:2.4736 grad_norm:0.2550 train_time:917933ms step_avg:3165.29ms -step:300/500 train_loss:2.4380 grad_norm:0.2264 train_time:949572ms step_avg:3165.24ms -step:300/500 val_loss:2.4532 val_bpb:1.4529 train_time:949604ms step_avg:3165.35ms h_norms=['20941.1', '19182.7', '18052.0', '17457.7', '17865.7', '19511.0', '19019.9', '18704.8', '18540.8', '19344.5', '19674.8', '19330.4', '18996.2', '18791.5', '19454.1', '19743.2', '19426.2', '19055.5', '18799.3', '19275.4'] growth=['0.909', '0.916', '0.941', '0.967', '1.023', '1.092', '0.975', '0.983', '0.991', '1.043', '1.080', '0.982', '0.983', '0.989', '1.035', '1.050', '0.984', '0.981', '0.987', '1.025'] -step:310/500 train_loss:2.3557 grad_norm:0.2355 train_time:981222ms step_avg:3165.23ms -step:320/500 train_loss:2.4268 grad_norm:0.2275 train_time:1012875ms step_avg:3165.24ms -step:330/500 train_loss:2.4731 grad_norm:0.2582 train_time:1044522ms step_avg:3165.22ms -step:340/500 train_loss:2.3813 grad_norm:0.2222 train_time:1076174ms step_avg:3165.22ms -step:350/500 train_loss:2.4827 grad_norm:0.2099 train_time:1107944ms step_avg:3165.55ms -step:350/500 val_loss:2.4038 val_bpb:1.4237 train_time:1107975ms step_avg:3165.64ms h_norms=['23083.6', '21048.2', '19811.9', '19011.2', '19310.5', '21330.8', '20798.1', '20502.5', '20108.1', '20815.6', '21418.1', '21061.6', '20705.5', '20224.1', '20676.4', '21384.5', '21039.3', '20616.6', '20063.3', '20274.8'] growth=['0.904', '0.912', '0.941', '0.960', '1.016', '1.105', '0.975', '0.986', '0.981', '1.035', '1.096', '0.983', '0.983', '0.977', '1.022', '1.067', '0.984', '0.980', '0.973', '1.011'] -step:360/500 train_loss:2.2488 grad_norm:0.2012 train_time:1139604ms step_avg:3165.57ms -step:370/500 train_loss:2.4498 grad_norm:0.1714 train_time:1171266ms step_avg:3165.58ms -step:380/500 train_loss:2.3948 grad_norm:0.2284 train_time:1202904ms step_avg:3165.54ms -step:390/500 train_loss:2.3586 grad_norm:0.1837 train_time:1234583ms step_avg:3165.60ms -step:400/500 train_loss:2.4046 grad_norm:0.2053 train_time:1266209ms step_avg:3165.52ms -step:400/500 val_loss:2.3705 val_bpb:1.4040 train_time:1266241ms step_avg:3165.60ms h_norms=['25593.2', '23180.9', '21902.2', '20833.8', '21042.7', '23067.7', '22480.5', '22319.1', '21743.0', '22667.8', '23199.6', '22798.3', '22512.9', '21781.8', '22345.9', '23150.2', '22743.8', '22358.7', '21535.2', '21760.3'] growth=['0.893', '0.906', '0.945', '0.951', '1.010', '1.096', '0.975', '0.993', '0.974', '1.043', '1.091', '0.983', '0.987', '0.968', '1.026', '1.067', '0.982', '0.983', '0.963', '1.010'] -step:410/500 train_loss:2.3694 grad_norm:0.2156 train_time:1297880ms step_avg:3165.56ms -step:420/500 train_loss:2.4022 grad_norm:0.1854 train_time:1329534ms step_avg:3165.56ms -step:430/500 train_loss:2.3222 grad_norm:0.2276 train_time:1361181ms step_avg:3165.54ms -step:440/500 train_loss:2.4095 grad_norm:0.1854 train_time:1392812ms step_avg:3165.48ms -step:450/500 train_loss:2.2365 grad_norm:0.2154 train_time:1424471ms step_avg:3165.49ms -step:450/500 val_loss:2.3386 val_bpb:1.3851 train_time:1424502ms step_avg:3165.56ms h_norms=['28710.0', '25807.1', '24367.3', '23021.1', '23326.2', '25712.0', '25093.1', '24936.9', '24053.5', '25235.9', '25868.9', '25395.7', '25094.3', '23995.7', '24591.2', '25723.8', '25209.5', '24761.3', '23522.8', '23651.5'] growth=['0.884', '0.899', '0.944', '0.945', '1.013', '1.102', '0.976', '0.994', '0.965', '1.049', '1.100', '0.982', '0.988', '0.956', '1.025', '1.077', '0.980', '0.982', '0.950', '1.005'] -step:460/500 train_loss:2.3715 grad_norm:0.2246 train_time:1456119ms step_avg:3165.48ms -step:470/500 train_loss:2.3155 grad_norm:0.2162 train_time:1487752ms step_avg:3165.43ms -step:480/500 train_loss:2.2437 grad_norm:0.1745 train_time:1519404ms step_avg:3165.42ms -step:490/500 train_loss:2.2728 grad_norm:0.2349 train_time:1551048ms step_avg:3165.41ms -step:500/500 train_loss:2.2855 grad_norm:0.1518 train_time:1582691ms step_avg:3165.38ms -step:500/500 val_loss:2.3106 val_bpb:1.3684 train_time:1582722ms step_avg:3165.44ms h_norms=['31454.2', '27922.0', '26377.7', '24565.6', '24573.1', '27173.7', '26402.9', '26559.0', '25384.0', '26739.2', '27427.3', '26814.3', '26736.1', '25274.3', '25895.2', '27331.6', '26687.2', '26396.8', '24736.9', '24660.2'] growth=['0.877', '0.888', '0.945', '0.931', '1.000', '1.106', '0.972', '1.006', '0.956', '1.053', '1.108', '0.978', '0.997', '0.945', '1.025', '1.088', '0.976', '0.989', '0.937', '0.997'] -peak memory allocated: 66505 MiB reserved: 67408 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:2.6566 val_bpb:1.5734 eval_time:85304ms -Serialized model: 107256259 bytes -Code size: 102003 bytes -Serialized model int6+lzma: 9596524 bytes -Total submission size int6+lzma: 9698527 bytes -final_int6_roundtrip val_loss:2.7108 val_bpb:1.6055 eval_time:84596ms -final_int6_roundtrip_exact val_loss:2.71075277 val_bpb:1.60546048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/output.log deleted file mode 100644 index c167c75c19..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/output.log +++ /dev/null @@ -1,93 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['14775.5', '14115.8', '13566.0', '13105.3', '12700.6', '12471.2', '12236.8', '12036.6', '11858.5', '11689.9', '12098.3', '11947.3', '11822.2', '11710.3', '11602.1', '11865.1', '11790.1', '11731.6', '11680.5', '11627.6'] growth=['0.948', '0.955', '0.961', '0.966', '0.969', '0.982', '0.981', '0.984', '0.985', '0.986', '0.990', '0.988', '0.990', '0.991', '0.991', '0.997', '0.994', '0.995', '0.996', '0.995'] -step:1/500 train_loss:6.9310 grad_norm:0.3717 train_time:3140ms step_avg:3139.78ms -step:2/500 train_loss:8.3473 grad_norm:3.5331 train_time:6273ms step_avg:3136.63ms -step:3/500 train_loss:7.5115 grad_norm:1.8520 train_time:9435ms step_avg:3145.04ms -step:4/500 train_loss:7.5611 grad_norm:1.8997 train_time:12598ms step_avg:3149.51ms -step:5/500 train_loss:7.3178 grad_norm:1.9097 train_time:15759ms step_avg:3151.86ms -step:6/500 train_loss:7.0757 grad_norm:1.7063 train_time:18919ms step_avg:3153.23ms -step:7/500 train_loss:6.9543 grad_norm:2.0712 train_time:22079ms step_avg:3154.18ms -step:8/500 train_loss:6.9054 grad_norm:1.4262 train_time:25237ms step_avg:3154.66ms -step:9/500 train_loss:6.5427 grad_norm:1.0088 train_time:28396ms step_avg:3155.09ms -step:10/500 train_loss:6.1508 grad_norm:0.9852 train_time:31553ms step_avg:3155.25ms -step:20/500 train_loss:4.7862 grad_norm:1.1515 train_time:63123ms step_avg:3156.17ms -step:30/500 train_loss:4.1778 grad_norm:0.8901 train_time:94709ms step_avg:3156.95ms -step:40/500 train_loss:3.8576 grad_norm:1.1652 train_time:126432ms step_avg:3160.79ms -step:50/500 train_loss:3.6818 grad_norm:0.6740 train_time:158029ms step_avg:3160.57ms -step:50/500 val_loss:3.6685 val_bpb:2.1727 train_time:158061ms step_avg:3161.21ms h_norms=['12863.0', '11304.5', '10260.4', '9600.6', '9205.4', '9177.2', '9156.3', '9186.9', '9244.0', '9340.1', '9128.5', '9175.6', '9256.6', '9347.6', '9467.6', '9152.0', '9233.4', '9342.2', '9452.6', '9586.7'] growth=['0.847', '0.879', '0.908', '0.936', '0.959', '0.997', '0.998', '1.003', '1.006', '1.010', '1.007', '1.005', '1.009', '1.010', '1.013', '1.012', '1.009', '1.012', '1.012', '1.014'] -step:60/500 train_loss:3.5054 grad_norm:0.6287 train_time:189648ms step_avg:3160.80ms -step:70/500 train_loss:3.3867 grad_norm:0.3547 train_time:221269ms step_avg:3160.99ms -step:80/500 train_loss:3.3521 grad_norm:0.6155 train_time:252895ms step_avg:3161.19ms -step:90/500 train_loss:3.1908 grad_norm:0.5564 train_time:284515ms step_avg:3161.28ms -step:100/500 train_loss:3.1363 grad_norm:0.5304 train_time:316098ms step_avg:3160.98ms -step:100/500 val_loss:3.1033 val_bpb:1.8380 train_time:316130ms step_avg:3161.30ms h_norms=['13308.9', '11916.3', '11272.1', '11158.4', '11454.6', '11658.4', '12034.7', '12510.9', '13065.2', '13811.4', '12347.7', '12779.1', '13288.3', '13857.8', '14627.5', '13120.1', '13520.6', '14007.0', '14556.3', '15322.5'] growth=['0.851', '0.895', '0.946', '0.990', '1.027', '1.018', '1.032', '1.040', '1.044', '1.057', '1.014', '1.035', '1.040', '1.043', '1.056', '1.002', '1.031', '1.036', '1.039', '1.053'] -step:110/500 train_loss:3.0377 grad_norm:0.4577 train_time:347728ms step_avg:3161.16ms -step:120/500 train_loss:2.9310 grad_norm:0.2932 train_time:379357ms step_avg:3161.31ms -step:130/500 train_loss:2.8607 grad_norm:0.3171 train_time:411020ms step_avg:3161.69ms -step:140/500 train_loss:2.8210 grad_norm:0.3873 train_time:442615ms step_avg:3161.53ms -step:150/500 train_loss:2.7654 grad_norm:0.2930 train_time:474228ms step_avg:3161.52ms -step:150/500 val_loss:2.7671 val_bpb:1.6388 train_time:474260ms step_avg:3161.73ms h_norms=['14988.2', '13795.7', '13108.2', '12902.9', '13200.7', '13774.5', '13856.8', '14021.4', '14325.3', '14996.9', '14146.5', '14360.2', '14618.0', '15004.1', '15799.7', '14631.0', '14869.2', '15146.1', '15557.6', '16412.6'] growth=['0.912', '0.920', '0.950', '0.984', '1.023', '1.043', '1.006', '1.012', '1.022', '1.047', '1.026', '1.015', '1.018', '1.026', '1.053', '0.999', '1.016', '1.019', '1.027', '1.055'] -step:160/500 train_loss:2.7691 grad_norm:0.3606 train_time:505863ms step_avg:3161.64ms -step:170/500 train_loss:2.7187 grad_norm:0.4251 train_time:537482ms step_avg:3161.66ms -step:180/500 train_loss:2.6329 grad_norm:0.3579 train_time:569087ms step_avg:3161.59ms -step:190/500 train_loss:2.6453 grad_norm:0.3163 train_time:600712ms step_avg:3161.64ms -step:200/500 train_loss:2.5662 grad_norm:0.2988 train_time:632432ms step_avg:3162.16ms -step:200/500 val_loss:2.6122 val_bpb:1.5471 train_time:632464ms step_avg:3162.32ms h_norms=['16776.0', '15547.2', '14750.1', '14438.0', '14724.7', '15842.7', '15653.2', '15553.5', '15587.1', '16103.0', '16018.9', '15971.6', '15973.8', '16090.8', '16737.7', '16299.2', '16286.8', '16309.8', '16448.2', '17131.7'] growth=['0.920', '0.927', '0.949', '0.979', '1.020', '1.076', '0.988', '0.994', '1.002', '1.033', '1.057', '0.997', '1.000', '1.007', '1.040', '1.024', '0.999', '1.001', '1.008', '1.042'] -step:210/500 train_loss:2.5532 grad_norm:0.2664 train_time:664045ms step_avg:3162.12ms -step:220/500 train_loss:2.6024 grad_norm:0.2987 train_time:695666ms step_avg:3162.12ms -step:230/500 train_loss:2.5356 grad_norm:0.2965 train_time:727281ms step_avg:3162.09ms -step:240/500 train_loss:2.5368 grad_norm:0.2684 train_time:758901ms step_avg:3162.09ms -step:250/500 train_loss:2.5806 grad_norm:0.3130 train_time:790509ms step_avg:3162.03ms -step:250/500 val_loss:2.5228 val_bpb:1.4941 train_time:790540ms step_avg:3162.16ms h_norms=['18876.3', '17455.2', '16574.6', '16150.7', '16451.4', '17948.8', '17544.9', '17340.5', '17260.6', '17849.1', '18128.4', '17863.6', '17701.0', '17631.6', '18226.5', '18294.9', '18057.8', '17883.5', '17791.2', '18311.4'] growth=['0.919', '0.925', '0.950', '0.974', '1.019', '1.091', '0.977', '0.988', '0.995', '1.034', '1.075', '0.985', '0.991', '0.996', '1.034', '1.041', '0.987', '0.990', '0.995', '1.029'] -step:260/500 train_loss:2.5372 grad_norm:0.2644 train_time:822155ms step_avg:3162.14ms -step:270/500 train_loss:2.4938 grad_norm:0.2812 train_time:853783ms step_avg:3162.16ms -step:280/500 train_loss:2.4306 grad_norm:0.2003 train_time:885408ms step_avg:3162.17ms -step:290/500 train_loss:2.4765 grad_norm:0.3092 train_time:917016ms step_avg:3162.12ms -step:300/500 train_loss:2.4326 grad_norm:0.1681 train_time:948617ms step_avg:3162.06ms -step:300/500 val_loss:2.4489 val_bpb:1.4504 train_time:948649ms step_avg:3162.16ms h_norms=['20956.8', '19310.5', '18350.0', '17752.1', '17947.3', '19637.2', '19181.5', '18935.7', '18738.5', '19319.1', '19714.4', '19430.9', '19185.8', '18962.7', '19445.1', '19787.3', '19502.4', '19205.8', '18928.7', '19244.6'] growth=['0.908', '0.921', '0.950', '0.967', '1.011', '1.094', '0.977', '0.987', '0.990', '1.031', '1.080', '0.986', '0.987', '0.988', '1.025', '1.049', '0.986', '0.985', '0.986', '1.017'] -step:310/500 train_loss:2.3587 grad_norm:0.2276 train_time:980220ms step_avg:3162.00ms -step:320/500 train_loss:2.4347 grad_norm:0.2859 train_time:1011827ms step_avg:3161.96ms -step:330/500 train_loss:2.4707 grad_norm:0.2418 train_time:1043434ms step_avg:3161.92ms -step:340/500 train_loss:2.3754 grad_norm:0.1686 train_time:1075069ms step_avg:3161.97ms -step:350/500 train_loss:2.4888 grad_norm:0.2348 train_time:1106822ms step_avg:3162.35ms -step:350/500 val_loss:2.4056 val_bpb:1.4248 train_time:1106854ms step_avg:3162.44ms h_norms=['23133.0', '21145.9', '20229.4', '19385.2', '19483.8', '21466.3', '20951.7', '20846.7', '20425.8', '21015.3', '21540.5', '21179.0', '20979.7', '20461.2', '20830.5', '21487.0', '21095.2', '20800.2', '20191.7', '20302.1'] growth=['0.900', '0.914', '0.957', '0.958', '1.005', '1.102', '0.976', '0.995', '0.980', '1.029', '1.088', '0.983', '0.991', '0.975', '1.018', '1.059', '0.982', '0.986', '0.971', '1.005'] -step:360/500 train_loss:2.2591 grad_norm:0.2411 train_time:1138444ms step_avg:3162.34ms -step:370/500 train_loss:2.4545 grad_norm:0.1909 train_time:1170061ms step_avg:3162.33ms -step:380/500 train_loss:2.3976 grad_norm:0.2061 train_time:1201677ms step_avg:3162.31ms -step:390/500 train_loss:2.3586 grad_norm:0.1674 train_time:1233312ms step_avg:3162.34ms -step:400/500 train_loss:2.4052 grad_norm:0.1943 train_time:1264905ms step_avg:3162.26ms -step:400/500 val_loss:2.3698 val_bpb:1.4036 train_time:1264937ms step_avg:3162.34ms h_norms=['25393.6', '23052.1', '22258.0', '21100.4', '21137.0', '23185.3', '22585.4', '22799.7', '22117.3', '22854.6', '23335.3', '22912.0', '22955.3', '22121.1', '22550.5', '23327.2', '22852.6', '22741.1', '21795.4', '21892.2'] growth=['0.894', '0.908', '0.966', '0.948', '1.002', '1.097', '0.974', '1.009', '0.970', '1.033', '1.088', '0.982', '1.002', '0.964', '1.019', '1.063', '0.980', '0.995', '0.958', '1.004'] -step:410/500 train_loss:2.3671 grad_norm:0.1937 train_time:1296526ms step_avg:3162.26ms -step:420/500 train_loss:2.4047 grad_norm:0.1777 train_time:1328148ms step_avg:3162.26ms -step:430/500 train_loss:2.3188 grad_norm:0.1700 train_time:1359758ms step_avg:3162.23ms -step:440/500 train_loss:2.4078 grad_norm:0.1629 train_time:1391376ms step_avg:3162.22ms -step:450/500 train_loss:2.2324 grad_norm:0.1648 train_time:1422989ms step_avg:3162.20ms -step:450/500 val_loss:2.3380 val_bpb:1.3847 train_time:1423021ms step_avg:3162.27ms h_norms=['28599.1', '25678.3', '24727.3', '23273.6', '23246.6', '25483.9', '24735.6', '24912.8', '23956.1', '24918.5', '25533.7', '24997.5', '24992.8', '23857.5', '24365.7', '25454.0', '24832.7', '24642.9', '23373.0', '23408.4'] growth=['0.878', '0.898', '0.963', '0.941', '0.999', '1.096', '0.971', '1.007', '0.962', '1.040', '1.094', '0.979', '1.000', '0.955', '1.021', '1.072', '0.976', '0.992', '0.948', '1.002'] -step:460/500 train_loss:2.3718 grad_norm:0.1999 train_time:1454626ms step_avg:3162.23ms -step:470/500 train_loss:2.3160 grad_norm:0.1574 train_time:1486238ms step_avg:3162.21ms -step:480/500 train_loss:2.2435 grad_norm:0.1638 train_time:1517858ms step_avg:3162.20ms -step:490/500 train_loss:2.2685 grad_norm:0.1750 train_time:1549486ms step_avg:3162.22ms -step:500/500 train_loss:2.2886 grad_norm:0.1629 train_time:1581107ms step_avg:3162.21ms -step:500/500 val_loss:2.3129 val_bpb:1.3698 train_time:1581139ms step_avg:3162.28ms h_norms=['31084.9', '27716.0', '27061.8', '25055.7', '24710.5', '27086.1', '26368.3', '27059.3', '25774.6', '26889.8', '27399.1', '26809.2', '27111.8', '25485.0', '25899.8', '27258.6', '26544.8', '26595.7', '24796.2', '24675.7'] growth=['0.870', '0.892', '0.976', '0.926', '0.986', '1.096', '0.973', '1.026', '0.953', '1.043', '1.098', '0.978', '1.011', '0.940', '1.016', '1.079', '0.974', '1.002', '0.932', '0.995'] -peak memory allocated: 66518 MiB reserved: 67422 MiB -ema:applying EMA weights diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/output.log deleted file mode 100644 index 57d9b7b561..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/output.log +++ /dev/null @@ -1,35 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.02ms h_norms=['14832.4', '14167.9', '13609.8', '13150.6', '12748.8', '12545.2', '12303.4', '12097.1', '11923.4', '11759.5', '12183.2', '12024.0', '11892.6', '11786.0', '11682.4', '11960.2', '11876.8', '11812.8', '11767.3', '11718.7'] growth=['0.949', '0.955', '0.961', '0.966', '0.969', '0.984', '0.981', '0.983', '0.986', '0.986', '0.992', '0.987', '0.989', '0.991', '0.991', '1.000', '0.993', '0.995', '0.996', '0.996'] -step:1/500 train_loss:6.9310 grad_norm:0.3717 train_time:3140ms step_avg:3140.38ms -step:2/500 train_loss:8.3473 grad_norm:3.5331 train_time:6270ms step_avg:3134.76ms -step:3/500 train_loss:7.5115 grad_norm:1.8521 train_time:9429ms step_avg:3142.91ms -step:4/500 train_loss:7.5611 grad_norm:1.8995 train_time:12585ms step_avg:3146.24ms -step:5/500 train_loss:7.3178 grad_norm:1.9091 train_time:15741ms step_avg:3148.20ms -step:6/500 train_loss:7.0749 grad_norm:1.6999 train_time:18894ms step_avg:3149.03ms -step:7/500 train_loss:6.9522 grad_norm:2.0634 train_time:22046ms step_avg:3149.47ms -step:8/500 train_loss:6.9027 grad_norm:1.4293 train_time:25201ms step_avg:3150.09ms -step:9/500 train_loss:6.5406 grad_norm:1.0086 train_time:28361ms step_avg:3151.26ms -step:10/500 train_loss:6.1498 grad_norm:0.9873 train_time:31520ms step_avg:3152.04ms -step:20/500 train_loss:4.7838 grad_norm:1.1062 train_time:63114ms step_avg:3155.70ms -step:30/500 train_loss:4.1864 grad_norm:0.9834 train_time:94733ms step_avg:3157.75ms -step:40/500 train_loss:3.8592 grad_norm:1.1661 train_time:126430ms step_avg:3160.76ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/output.log deleted file mode 100644 index 2a4260a40a..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/output.log +++ /dev/null @@ -1,42 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.01ms h_norms=['14111.1', '13526.4', '13079.9', '12686.8', '12345.1', '12169.5', '12016.5', '11895.0', '11762.2', '11626.3', '11910.0', '11834.3', '11775.0', '11694.1', '11606.3', '11768.3', '11743.2', '11726.3', '11686.2', '11633.5'] growth=['0.946', '0.959', '0.967', '0.970', '0.973', '0.986', '0.987', '0.990', '0.989', '0.988', '0.993', '0.994', '0.995', '0.993', '0.992', '0.999', '0.998', '0.999', '0.997', '0.995'] -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3132ms step_avg:3132.23ms -step:2/500 train_loss:8.2518 grad_norm:3.3887 train_time:6252ms step_avg:3125.87ms -step:3/500 train_loss:7.4857 grad_norm:1.6765 train_time:9403ms step_avg:3134.45ms -step:4/500 train_loss:7.7545 grad_norm:2.0593 train_time:12554ms step_avg:3138.50ms -step:5/500 train_loss:7.4669 grad_norm:2.0969 train_time:15705ms step_avg:3141.04ms -step:6/500 train_loss:7.1116 grad_norm:1.8286 train_time:18855ms step_avg:3142.45ms -step:7/500 train_loss:6.8770 grad_norm:2.4120 train_time:22007ms step_avg:3143.88ms -step:8/500 train_loss:6.7917 grad_norm:1.6092 train_time:25159ms step_avg:3144.88ms -step:9/500 train_loss:6.5438 grad_norm:1.2773 train_time:28308ms step_avg:3145.33ms -step:10/500 train_loss:6.1776 grad_norm:1.1355 train_time:31457ms step_avg:3145.72ms -step:20/500 train_loss:4.8363 grad_norm:1.5152 train_time:62951ms step_avg:3147.57ms -step:30/500 train_loss:4.1657 grad_norm:0.7915 train_time:94460ms step_avg:3148.68ms -step:40/500 train_loss:3.8635 grad_norm:0.8288 train_time:126098ms step_avg:3152.46ms -step:50/500 train_loss:3.6893 grad_norm:1.0199 train_time:157622ms step_avg:3152.43ms -step:50/500 val_loss:3.6358 val_bpb:2.1533 train_time:157653ms step_avg:3153.06ms h_norms=['12813.5', '11251.6', '10224.6', '9522.0', '9100.5', '9110.7', '9077.5', '9115.4', '9128.8', '9187.3', '9069.4', '9101.9', '9187.7', '9232.6', '9313.7', '9094.9', '9161.1', '9273.7', '9332.8', '9426.5'] growth=['0.852', '0.878', '0.909', '0.931', '0.956', '1.001', '0.996', '1.004', '1.001', '1.006', '1.013', '1.004', '1.009', '1.005', '1.009', '1.020', '1.007', '1.012', '1.006', '1.010'] -step:60/500 train_loss:3.5102 grad_norm:0.6809 train_time:189149ms step_avg:3152.49ms -step:70/500 train_loss:3.3973 grad_norm:0.6864 train_time:220677ms step_avg:3152.52ms -step:80/500 train_loss:3.3157 grad_norm:0.5593 train_time:252203ms step_avg:3152.54ms -step:90/500 train_loss:3.1562 grad_norm:0.4835 train_time:283735ms step_avg:3152.61ms -step:100/500 train_loss:3.1161 grad_norm:0.5429 train_time:315255ms step_avg:3152.55ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/output.log deleted file mode 100644 index 95bd7f1c09..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/output.log +++ /dev/null @@ -1,42 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms h_norms=['14329.1', '13723.6', '13269.1', '12851.2', '12501.0', '12349.8', '12190.5', '12066.6', '11911.7', '11770.1', '12059.5', '11984.8', '11930.4', '11833.9', '11738.3', '11921.0', '11898.5', '11878.8', '11820.1', '11770.2'] growth=['0.946', '0.958', '0.967', '0.969', '0.973', '0.988', '0.987', '0.990', '0.987', '0.988', '0.994', '0.994', '0.995', '0.992', '0.992', '1.001', '0.998', '0.998', '0.995', '0.996'] -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3140ms step_avg:3139.82ms -step:2/500 train_loss:8.2452 grad_norm:3.3679 train_time:6268ms step_avg:3133.81ms -step:3/500 train_loss:7.4836 grad_norm:1.6553 train_time:9427ms step_avg:3142.31ms -step:4/500 train_loss:7.7829 grad_norm:2.0735 train_time:12585ms step_avg:3146.17ms -step:5/500 train_loss:7.5027 grad_norm:2.1035 train_time:15742ms step_avg:3148.49ms -step:6/500 train_loss:7.1378 grad_norm:1.8533 train_time:18903ms step_avg:3150.44ms -step:7/500 train_loss:6.8951 grad_norm:2.4164 train_time:22063ms step_avg:3151.85ms -step:8/500 train_loss:6.7800 grad_norm:1.6064 train_time:25224ms step_avg:3153.04ms -step:9/500 train_loss:6.5383 grad_norm:1.2870 train_time:28387ms step_avg:3154.06ms -step:10/500 train_loss:6.1716 grad_norm:1.1190 train_time:31546ms step_avg:3154.63ms -step:20/500 train_loss:4.8024 grad_norm:1.3364 train_time:63138ms step_avg:3156.91ms -step:30/500 train_loss:4.1822 grad_norm:2.0484 train_time:94744ms step_avg:3158.14ms -step:40/500 train_loss:3.8452 grad_norm:0.8175 train_time:126488ms step_avg:3162.19ms -step:50/500 train_loss:3.6888 grad_norm:0.9309 train_time:158100ms step_avg:3162.00ms -step:50/500 val_loss:3.6447 val_bpb:2.1586 train_time:158132ms step_avg:3162.64ms h_norms=['12971.7', '11306.1', '10212.0', '9466.3', '9016.7', '8957.7', '8905.6', '8930.6', '8942.4', '9004.0', '8900.7', '8919.4', '8995.0', '9040.1', '9125.4', '8916.2', '8974.5', '9075.4', '9139.3', '9237.6'] growth=['0.842', '0.872', '0.903', '0.927', '0.953', '0.993', '0.994', '1.003', '1.001', '1.007', '1.007', '1.002', '1.008', '1.005', '1.009', '1.015', '1.007', '1.011', '1.007', '1.011'] -step:60/500 train_loss:3.5046 grad_norm:0.9289 train_time:189719ms step_avg:3161.98ms -step:70/500 train_loss:3.3859 grad_norm:0.5209 train_time:221302ms step_avg:3161.45ms -step:80/500 train_loss:3.3231 grad_norm:0.4777 train_time:252892ms step_avg:3161.15ms -step:90/500 train_loss:3.1589 grad_norm:0.4236 train_time:284505ms step_avg:3161.17ms -step:100/500 train_loss:3.1087 grad_norm:0.3903 train_time:316117ms step_avg:3161.17ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/output.log deleted file mode 100644 index 6b4a192299..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/output.log +++ /dev/null @@ -1,32 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms h_norms=['71486.6', '109852.4', '112898.0', '103194.9', '101669.1', '103905.5', '128262.2', '125516.1', '152206.8', '148436.4', '117193.6', '119566.9', '200935.0', '175293.6', '156000.0', '116143.9', '120580.6', '121930.9', '128013.4', '136703.7'] growth=['1.595', '1.537', '1.028', '0.914', '0.985', '1.022', '1.234', '0.979', '1.213', '0.975', '1.121', '1.020', '1.681', '0.872', '0.890', '1.062', '1.038', '1.011', '1.050', '1.068'] -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3141ms step_avg:3140.54ms -step:2/500 train_loss:8.1781 grad_norm:3.1870 train_time:6272ms step_avg:3136.02ms -step:3/500 train_loss:10.9302 grad_norm:11.9253 train_time:9425ms step_avg:3141.53ms -step:4/500 train_loss:7.6575 grad_norm:2.1410 train_time:12578ms step_avg:3144.58ms -step:5/500 train_loss:7.0804 grad_norm:1.6717 train_time:15735ms step_avg:3146.99ms -step:6/500 train_loss:6.9353 grad_norm:1.4697 train_time:18890ms step_avg:3148.35ms -step:7/500 train_loss:6.8485 grad_norm:2.6919 train_time:22046ms step_avg:3149.40ms -step:8/500 train_loss:6.7499 grad_norm:1.6549 train_time:25202ms step_avg:3150.24ms -step:9/500 train_loss:6.8418 grad_norm:1.7199 train_time:28357ms step_avg:3150.83ms -step:10/500 train_loss:6.5955 grad_norm:3.6129 train_time:31518ms step_avg:3151.83ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/output.log deleted file mode 100644 index 359bec767f..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/output.log +++ /dev/null @@ -1,23 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.01ms h_norms=['49006.3', '70788.1', '67619.4', '66895.7', '80691.9', '85686.5', '91540.4', '100243.9', '98369.8', '96924.4', '115874.1', '105822.4', '350994.9', '231898.9', '189787.3', '83816.5', '104820.2', '91576.3', '75437.8', '74316.6'] growth=['1.006', '1.444', '0.955', '0.989', '1.206', '1.062', '1.068', '1.095', '0.981', '0.985', '1.418', '0.913', '3.317', '0.661', '0.818', '0.916', '1.251', '0.874', '0.824', '0.985'] -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3140ms step_avg:3139.77ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/output.log deleted file mode 100644 index 9d6fc84aca..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/output.log +++ /dev/null @@ -1,42 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms h_norms=['101044.9', '100874.6', '129352.4', '167648.2', '181140.5', '180204.0', '175799.0', '197658.3', '355913.8', '317039.1', '212913.5', '192786.2', '329348.0', '289972.6', '272664.0', '228358.1', '225461.7', '218889.9', '214625.2', '213371.5'] growth=['2.311', '0.998', '1.282', '1.296', '1.080', '0.995', '0.976', '1.124', '1.801', '0.891', '1.072', '0.905', '1.708', '0.880', '0.940', '1.089', '0.987', '0.971', '0.981', '0.994'] -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3142ms step_avg:3142.50ms -step:2/500 train_loss:8.1781 grad_norm:3.1870 train_time:6271ms step_avg:3135.66ms -step:3/500 train_loss:10.9302 grad_norm:11.9253 train_time:9428ms step_avg:3142.66ms -step:4/500 train_loss:7.6576 grad_norm:2.1410 train_time:12584ms step_avg:3146.09ms -step:5/500 train_loss:7.0801 grad_norm:1.6701 train_time:15745ms step_avg:3148.90ms -step:6/500 train_loss:6.9355 grad_norm:1.4723 train_time:18906ms step_avg:3151.05ms -step:7/500 train_loss:6.8501 grad_norm:2.6989 train_time:22064ms step_avg:3152.00ms -step:8/500 train_loss:6.7493 grad_norm:1.6422 train_time:25222ms step_avg:3152.79ms -step:9/500 train_loss:6.8287 grad_norm:1.6899 train_time:28380ms step_avg:3153.31ms -step:10/500 train_loss:6.6041 grad_norm:3.7299 train_time:31539ms step_avg:3153.89ms -step:20/500 train_loss:6.9737 grad_norm:6.8887 train_time:63117ms step_avg:3155.83ms -step:30/500 train_loss:4.9900 grad_norm:1.1052 train_time:94692ms step_avg:3156.39ms -step:40/500 train_loss:4.3631 grad_norm:0.4267 train_time:126432ms step_avg:3160.80ms -step:50/500 train_loss:4.0261 grad_norm:0.4910 train_time:158038ms step_avg:3160.76ms -step:50/500 val_loss:3.9805 val_bpb:2.3575 train_time:158070ms step_avg:3161.40ms h_norms=['90033.2', '84674.3', '83484.3', '85619.1', '82273.7', '100121.1', '118020.5', '132644.2', '141924.9', '144315.5', '114997.4', '136485.1', '157285.0', '167945.2', '171560.8', '128904.4', '152052.9', '175277.8', '187742.6', '192007.2'] growth=['0.938', '0.940', '0.986', '1.026', '0.961', '1.217', '1.179', '1.124', '1.070', '1.017', '1.236', '1.187', '1.152', '1.068', '1.022', '1.242', '1.180', '1.153', '1.071', '1.023'] -step:60/500 train_loss:3.7728 grad_norm:0.4561 train_time:189599ms step_avg:3159.98ms -step:70/500 train_loss:3.6609 grad_norm:0.5540 train_time:221200ms step_avg:3159.99ms -step:80/500 train_loss:3.5918 grad_norm:0.4286 train_time:252796ms step_avg:3159.95ms -step:90/500 train_loss:3.4451 grad_norm:0.3377 train_time:284393ms step_avg:3159.92ms -step:100/500 train_loss:3.3839 grad_norm:0.4333 train_time:315997ms step_avg:3159.97ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/output.log deleted file mode 100644 index 3d99758f31..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/output.log +++ /dev/null @@ -1,36 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.01ms -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3138ms step_avg:3138.10ms -step:2/500 train_loss:6.6659 grad_norm:0.8083 train_time:6266ms step_avg:3133.06ms -step:3/500 train_loss:6.0874 grad_norm:0.8065 train_time:9426ms step_avg:3141.98ms -step:4/500 train_loss:5.8500 grad_norm:0.4762 train_time:12583ms step_avg:3145.81ms -step:5/500 train_loss:5.8868 grad_norm:1.3528 train_time:15741ms step_avg:3148.16ms -step:6/500 train_loss:5.8887 grad_norm:1.1622 train_time:18899ms step_avg:3149.82ms -step:7/500 train_loss:5.8843 grad_norm:1.2053 train_time:22057ms step_avg:3150.96ms -step:8/500 train_loss:5.8714 grad_norm:1.3344 train_time:25217ms step_avg:3152.11ms -step:9/500 train_loss:5.8215 grad_norm:1.0506 train_time:28378ms step_avg:3153.13ms -step:10/500 train_loss:5.6859 grad_norm:0.9420 train_time:31538ms step_avg:3153.85ms -step:20/500 train_loss:5.5241 grad_norm:1.6334 train_time:63125ms step_avg:3156.24ms -step:30/500 train_loss:4.5729 grad_norm:1.5870 train_time:94746ms step_avg:3158.20ms -step:40/500 train_loss:4.2207 grad_norm:2.9475 train_time:126497ms step_avg:3162.43ms -step:50/500 train_loss:3.8353 grad_norm:0.7305 train_time:158105ms step_avg:3162.11ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/output.log deleted file mode 100644 index 995bbc6d62..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/output.log +++ /dev/null @@ -1,48 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3131ms step_avg:3130.65ms -step:2/500 train_loss:6.0630 grad_norm:0.8852 train_time:6251ms step_avg:3125.47ms -step:3/500 train_loss:6.2767 grad_norm:1.1566 train_time:9401ms step_avg:3133.57ms -step:4/500 train_loss:6.8246 grad_norm:2.3769 train_time:12551ms step_avg:3137.76ms -step:5/500 train_loss:7.1034 grad_norm:2.4893 train_time:15703ms step_avg:3140.51ms -step:6/500 train_loss:6.7987 grad_norm:1.9131 train_time:18852ms step_avg:3142.07ms -step:7/500 train_loss:6.8808 grad_norm:2.6979 train_time:22000ms step_avg:3142.92ms -step:8/500 train_loss:6.8119 grad_norm:2.5149 train_time:25150ms step_avg:3143.70ms -step:9/500 train_loss:6.6692 grad_norm:1.6300 train_time:28301ms step_avg:3144.53ms -step:10/500 train_loss:6.4778 grad_norm:2.4469 train_time:31451ms step_avg:3145.07ms -step:20/500 train_loss:5.0602 grad_norm:1.3291 train_time:62943ms step_avg:3147.14ms -step:30/500 train_loss:4.8216 grad_norm:2.0738 train_time:94481ms step_avg:3149.37ms -step:40/500 train_loss:4.1574 grad_norm:1.0588 train_time:126107ms step_avg:3152.67ms -step:50/500 train_loss:3.8998 grad_norm:0.9799 train_time:157602ms step_avg:3152.05ms -step:50/500 val_loss:3.8491 val_bpb:2.2797 train_time:157634ms step_avg:3152.68ms h_norms=['62087.4', '62948.8', '63328.1', '64415.2', '65555.9', '88174.0', '99297.2', '117964.5', '117972.9', '117057.8', '94021.8', '103414.3', '113567.6', '119372.5', '124603.3', '83857.0', '106503.6', '121953.5', '119428.6', '122676.6'] growth=['0.982', '1.014', '1.006', '1.017', '1.018', '1.345', '1.126', '1.188', '1.000', '0.992', '1.392', '1.100', '1.098', '1.051', '1.044', '1.165', '1.270', '1.145', '0.979', '1.027'] -step:60/500 train_loss:3.6849 grad_norm:0.5692 train_time:189133ms step_avg:3152.22ms -step:70/500 train_loss:3.5871 grad_norm:0.6241 train_time:220645ms step_avg:3152.08ms -step:80/500 train_loss:3.5228 grad_norm:0.4259 train_time:252153ms step_avg:3151.92ms -step:90/500 train_loss:3.3905 grad_norm:0.5294 train_time:283679ms step_avg:3151.99ms -step:100/500 train_loss:3.3403 grad_norm:0.4869 train_time:315174ms step_avg:3151.74ms -step:100/500 val_loss:3.3095 val_bpb:1.9601 train_time:315205ms step_avg:3152.05ms h_norms=['57455.8', '58322.8', '59194.5', '60695.2', '61908.6', '74000.3', '96013.6', '104985.8', '108845.2', '110765.5', '83814.2', '95809.5', '103009.6', '108539.2', '113703.6', '75220.7', '86779.0', '96119.3', '103205.4', '107897.3'] growth=['0.983', '1.015', '1.015', '1.025', '1.020', '1.195', '1.297', '1.093', '1.037', '1.018', '1.346', '1.143', '1.075', '1.054', '1.048', '1.182', '1.154', '1.108', '1.074', '1.045'] -step:110/500 train_loss:3.2594 grad_norm:0.5430 train_time:346619ms step_avg:3151.08ms -step:120/500 train_loss:3.1627 grad_norm:0.3443 train_time:378074ms step_avg:3150.62ms -step:130/500 train_loss:3.0993 grad_norm:0.4013 train_time:409594ms step_avg:3150.72ms -step:140/500 train_loss:3.0509 grad_norm:0.4336 train_time:441106ms step_avg:3150.76ms -step:150/500 train_loss:2.9906 grad_norm:0.4297 train_time:472653ms step_avg:3151.02ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/output.log deleted file mode 100644 index 0031d895ac..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/output.log +++ /dev/null @@ -1,35 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms jpw:0.1000 -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3138ms step_avg:3137.92ms -step:2/500 train_loss:6.0630 grad_norm:0.8852 train_time:6265ms step_avg:3132.44ms -step:3/500 train_loss:6.2767 grad_norm:1.1566 train_time:9418ms step_avg:3139.33ms -step:4/500 train_loss:6.8246 grad_norm:2.3705 train_time:12572ms step_avg:3142.95ms -step:5/500 train_loss:7.1027 grad_norm:2.4915 train_time:15728ms step_avg:3145.56ms -step:6/500 train_loss:7.8286 grad_norm:8.6216 train_time:18882ms step_avg:3146.95ms -step:7/500 train_loss:6.9764 grad_norm:2.0638 train_time:22040ms step_avg:3148.60ms -step:8/500 train_loss:6.6684 grad_norm:1.9308 train_time:25196ms step_avg:3149.51ms -step:9/500 train_loss:6.6037 grad_norm:1.6760 train_time:28351ms step_avg:3150.11ms -step:10/500 train_loss:6.3408 grad_norm:1.7243 train_time:31507ms step_avg:3150.71ms -step:20/500 train_loss:5.1670 grad_norm:1.0415 train_time:63051ms step_avg:3152.55ms -step:30/500 train_loss:4.5147 grad_norm:1.1903 train_time:94615ms step_avg:3153.85ms -step:40/500 train_loss:4.0341 grad_norm:0.6994 train_time:126304ms step_avg:3157.60ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/output.log deleted file mode 100644 index da423b9e56..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/output.log +++ /dev/null @@ -1,39 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.03ms jpw:0.1000 -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3134ms step_avg:3133.83ms -step:2/500 train_loss:8.1781 grad_norm:3.1870 train_time:6255ms step_avg:3127.28ms -step:3/500 train_loss:38.9994 grad_norm:116.1021 train_time:9401ms step_avg:3133.70ms -step:4/500 train_loss:11.0719 grad_norm:18.4144 train_time:12551ms step_avg:3137.63ms -step:5/500 train_loss:7.4167 grad_norm:1.3480 train_time:15699ms step_avg:3139.71ms -step:6/500 train_loss:7.3220 grad_norm:4.6318 train_time:18851ms step_avg:3141.88ms -step:7/500 train_loss:7.3726 grad_norm:2.9103 train_time:22002ms step_avg:3143.20ms -step:8/500 train_loss:7.0895 grad_norm:2.4760 train_time:25155ms step_avg:3144.38ms -step:9/500 train_loss:6.7142 grad_norm:1.5844 train_time:28307ms step_avg:3145.17ms -step:10/500 train_loss:6.5562 grad_norm:2.4313 train_time:31456ms step_avg:3145.64ms -step:20/500 train_loss:5.8714 grad_norm:3.6692 train_time:62966ms step_avg:3148.32ms -step:30/500 train_loss:5.0120 grad_norm:1.4222 train_time:94460ms step_avg:3148.67ms -step:40/500 train_loss:4.3942 grad_norm:0.4195 train_time:126054ms step_avg:3151.34ms -step:50/500 train_loss:4.0610 grad_norm:0.4135 train_time:157574ms step_avg:3151.47ms -step:50/500 val_loss:4.0211 val_bpb:2.3816 train_time:157605ms step_avg:3152.10ms h_norms=['106271.3', '105043.7', '106111.3', '108041.9', '104153.4', '127767.0', '319009.2', '273895.3', '225243.8', '193857.2', '141638.6', '148276.8', '156801.2', '166727.3', '166524.3', '147292.3', '167621.6', '181512.6', '192726.1', '193725.8'] growth=['0.971', '0.988', '1.010', '1.018', '0.964', '1.227', '2.497', '0.859', '0.822', '0.861', '1.310', '1.047', '1.057', '1.063', '0.999', '1.237', '1.138', '1.083', '1.062', '1.005'] jpw:0.5590 -step:60/500 train_loss:3.8035 grad_norm:0.5494 train_time:189124ms step_avg:3152.07ms -step:70/500 train_loss:3.6830 grad_norm:0.5418 train_time:220632ms step_avg:3151.88ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/output.log deleted file mode 100644 index 3ec7213eb5..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/output.log +++ /dev/null @@ -1,39 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.01ms jpw:0.1000 -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3128ms step_avg:3128.41ms -step:2/500 train_loss:8.2222 grad_norm:3.3072 train_time:6251ms step_avg:3125.58ms -step:3/500 train_loss:33.5787 grad_norm:96.8257 train_time:9398ms step_avg:3132.71ms -step:4/500 train_loss:9.6490 grad_norm:12.2426 train_time:12549ms step_avg:3137.20ms -step:5/500 train_loss:7.2469 grad_norm:1.2742 train_time:15703ms step_avg:3140.50ms -step:6/500 train_loss:7.2557 grad_norm:4.1964 train_time:18853ms step_avg:3142.10ms -step:7/500 train_loss:7.1683 grad_norm:2.3011 train_time:22004ms step_avg:3143.39ms -step:8/500 train_loss:6.8964 grad_norm:2.7614 train_time:25156ms step_avg:3144.44ms -step:9/500 train_loss:6.7751 grad_norm:3.1754 train_time:28305ms step_avg:3145.02ms -step:10/500 train_loss:6.3857 grad_norm:1.2630 train_time:31455ms step_avg:3145.47ms -step:20/500 train_loss:5.6141 grad_norm:1.3015 train_time:62945ms step_avg:3147.24ms -step:30/500 train_loss:5.6210 grad_norm:9.5553 train_time:94425ms step_avg:3147.51ms -step:40/500 train_loss:4.6846 grad_norm:0.5640 train_time:126016ms step_avg:3150.39ms -step:50/500 train_loss:4.1659 grad_norm:0.2352 train_time:157487ms step_avg:3149.75ms -step:50/500 val_loss:4.1176 val_bpb:2.4387 train_time:157519ms step_avg:3150.38ms h_norms=['335866.4', '282945.0', '272903.9', '222662.3', '156368.6', '961668.8', '881032.8', '713078.4', '516760.4', '397430.0', '153059.2', '147951.4', '140618.3', '126310.0', '114506.3', '125878.2', '191353.2', '171239.0', '146885.7', '127148.9'] growth=['1.930', '0.842', '0.965', '0.816', '0.702', '6.150', '0.916', '0.809', '0.725', '0.769', '0.966', '0.967', '0.950', '0.898', '0.907', '1.026', '1.520', '0.895', '0.858', '0.866'] jpw:0.5590 -step:60/500 train_loss:3.8723 grad_norm:0.4775 train_time:189007ms step_avg:3150.12ms -step:70/500 train_loss:3.7416 grad_norm:0.5037 train_time:220553ms step_avg:3150.76ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/output.log deleted file mode 100644 index 9e72f53613..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/output.log +++ /dev/null @@ -1,100 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms jpw:0.1000 -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3131ms step_avg:3131.00ms -step:2/500 train_loss:8.2530 grad_norm:3.3922 train_time:6252ms step_avg:3126.10ms -step:3/500 train_loss:8.0060 grad_norm:9.9122 train_time:9401ms step_avg:3133.59ms -step:4/500 train_loss:7.1538 grad_norm:1.2770 train_time:12549ms step_avg:3137.34ms -step:5/500 train_loss:7.1510 grad_norm:1.7610 train_time:15698ms step_avg:3139.56ms -step:6/500 train_loss:6.9298 grad_norm:2.0765 train_time:18847ms step_avg:3141.20ms -step:7/500 train_loss:6.9181 grad_norm:3.2331 train_time:21996ms step_avg:3142.27ms -step:8/500 train_loss:6.7962 grad_norm:1.6115 train_time:25145ms step_avg:3143.09ms -step:9/500 train_loss:6.4521 grad_norm:1.1930 train_time:28294ms step_avg:3143.83ms -step:10/500 train_loss:6.1161 grad_norm:1.3597 train_time:31450ms step_avg:3144.97ms -step:20/500 train_loss:4.7664 grad_norm:2.1164 train_time:62972ms step_avg:3148.59ms -step:30/500 train_loss:4.3513 grad_norm:2.1193 train_time:94486ms step_avg:3149.53ms -step:40/500 train_loss:4.3373 grad_norm:6.4797 train_time:126119ms step_avg:3152.97ms -step:50/500 train_loss:3.9627 grad_norm:0.7730 train_time:157627ms step_avg:3152.54ms -step:50/500 val_loss:3.9200 val_bpb:2.3216 train_time:157659ms step_avg:3153.17ms h_norms=['64274.7', '65654.7', '65790.9', '63923.0', '61167.5', '71576.8', '76431.3', '80533.1', '84488.5', '86665.3', '87025.9', '725127.4', '632645.4', '544538.6', '445896.0', '80419.9', '91824.9', '110608.9', '131367.6', '145242.4'] growth=['0.989', '1.021', '1.002', '0.972', '0.957', '1.170', '1.068', '1.054', '1.049', '1.026', '1.357', '8.332', '0.872', '0.861', '0.819', '1.180', '1.142', '1.205', '1.188', '1.106'] jpw:2.5990 -step:60/500 train_loss:3.7054 grad_norm:0.9363 train_time:189133ms step_avg:3152.22ms -step:70/500 train_loss:3.5955 grad_norm:0.4140 train_time:220643ms step_avg:3152.04ms -step:80/500 train_loss:3.5108 grad_norm:0.4226 train_time:252172ms step_avg:3152.16ms -step:90/500 train_loss:3.3388 grad_norm:0.4857 train_time:283690ms step_avg:3152.12ms -step:100/500 train_loss:3.2920 grad_norm:0.4328 train_time:315205ms step_avg:3152.05ms -step:100/500 val_loss:3.2529 val_bpb:1.9266 train_time:315237ms step_avg:3152.37ms h_norms=['68496.9', '69533.5', '70329.0', '70151.5', '67282.9', '82184.2', '97588.3', '111284.9', '122709.0', '127061.8', '114348.1', '154423.4', '158482.6', '160652.5', '158037.5', '95279.7', '112085.8', '124794.6', '134699.2', '137489.2'] growth=['1.024', '1.015', '1.011', '0.997', '0.959', '1.221', '1.187', '1.140', '1.103', '1.035', '1.651', '1.350', '1.026', '1.014', '0.984', '1.374', '1.176', '1.113', '1.079', '1.021'] jpw:0.1490 -step:110/500 train_loss:3.1973 grad_norm:0.4638 train_time:346673ms step_avg:3151.57ms -step:120/500 train_loss:3.0966 grad_norm:0.3586 train_time:378163ms step_avg:3151.36ms -step:130/500 train_loss:3.0353 grad_norm:0.4942 train_time:409701ms step_avg:3151.55ms -step:140/500 train_loss:2.9792 grad_norm:0.3505 train_time:441225ms step_avg:3151.60ms -step:150/500 train_loss:2.9324 grad_norm:0.3410 train_time:472737ms step_avg:3151.58ms -step:150/500 val_loss:2.9123 val_bpb:1.7248 train_time:472768ms step_avg:3151.79ms h_norms=['54591.3', '54936.0', '54614.2', '52686.0', '48419.6', '56665.0', '62607.1', '67063.4', '69061.8', '67313.2', '78936.0', '89378.7', '89174.9', '84854.0', '79911.0', '75692.4', '81948.0', '80201.0', '77479.5', '72017.6'] growth=['1.016', '1.006', '0.994', '0.965', '0.919', '1.170', '1.105', '1.071', '1.030', '0.975', '1.623', '1.132', '0.998', '0.952', '0.942', '1.556', '1.083', '0.979', '0.966', '0.930'] jpw:0.1000 -step:160/500 train_loss:2.9131 grad_norm:0.3417 train_time:504245ms step_avg:3151.53ms -step:170/500 train_loss:2.8519 grad_norm:0.3201 train_time:535772ms step_avg:3151.60ms -step:180/500 train_loss:2.7615 grad_norm:0.3215 train_time:567320ms step_avg:3151.78ms -step:190/500 train_loss:2.7781 grad_norm:0.3936 train_time:598853ms step_avg:3151.86ms -step:200/500 train_loss:2.6955 grad_norm:0.3496 train_time:630517ms step_avg:3152.58ms -step:200/500 val_loss:2.7432 val_bpb:1.6247 train_time:630549ms step_avg:3152.74ms h_norms=['46320.0', '46179.5', '45761.4', '44508.9', '40976.6', '47797.8', '51896.9', '53164.9', '52566.1', '49366.9', '45993.8', '52287.0', '57769.8', '55335.3', '50418.6', '61436.7', '59953.2', '55890.7', '54343.5', '50288.3'] growth=['0.998', '0.997', '0.991', '0.973', '0.921', '1.166', '1.086', '1.024', '0.989', '0.939', '1.122', '1.137', '1.105', '0.958', '0.911', '1.498', '0.976', '0.932', '0.972', '0.925'] jpw:0.1000 -step:210/500 train_loss:2.6793 grad_norm:0.2864 train_time:662071ms step_avg:3152.72ms -step:220/500 train_loss:2.7259 grad_norm:0.2815 train_time:693632ms step_avg:3152.87ms -step:230/500 train_loss:2.6538 grad_norm:0.2559 train_time:725190ms step_avg:3153.00ms -step:240/500 train_loss:2.6575 grad_norm:0.3120 train_time:756724ms step_avg:3153.02ms -step:250/500 train_loss:2.7087 grad_norm:0.4401 train_time:788283ms step_avg:3153.13ms -step:250/500 val_loss:2.6326 val_bpb:1.5592 train_time:788315ms step_avg:3153.26ms h_norms=['43543.6', '43248.6', '42211.9', '41586.9', '38254.6', '45794.9', '47618.7', '47301.2', '46865.0', '43024.5', '41461.4', '45151.9', '46339.2', '43777.7', '39702.6', '44273.0', '48162.3', '46945.5', '44420.7', '40421.7'] growth=['0.993', '0.993', '0.976', '0.985', '0.920', '1.197', '1.040', '0.993', '0.991', '0.918', '1.086', '1.089', '1.026', '0.945', '0.907', '1.160', '1.088', '0.975', '0.946', '0.910'] jpw:0.1000 -step:260/500 train_loss:2.6426 grad_norm:0.3115 train_time:819839ms step_avg:3153.23ms -step:270/500 train_loss:2.6021 grad_norm:0.3289 train_time:851372ms step_avg:3153.23ms -step:280/500 train_loss:2.5391 grad_norm:0.2397 train_time:882901ms step_avg:3153.22ms -step:290/500 train_loss:2.5774 grad_norm:0.2781 train_time:914431ms step_avg:3153.21ms -step:300/500 train_loss:2.5344 grad_norm:0.2146 train_time:945978ms step_avg:3153.26ms -step:300/500 val_loss:2.5490 val_bpb:1.5097 train_time:946010ms step_avg:3153.37ms h_norms=['43232.1', '42432.1', '41223.9', '40796.4', '37886.6', '45413.0', '46563.8', '45871.2', '46402.4', '43360.7', '41003.4', '46803.1', '44047.2', '41094.8', '37367.9', '56900.2', '52918.1', '50843.4', '45060.9', '39790.1'] growth=['0.974', '0.981', '0.972', '0.990', '0.929', '1.199', '1.025', '0.985', '1.012', '0.934', '1.082', '1.141', '0.941', '0.933', '0.909', '1.499', '0.930', '0.961', '0.886', '0.883'] jpw:0.1000 -step:310/500 train_loss:2.4523 grad_norm:0.2264 train_time:977542ms step_avg:3153.36ms -step:320/500 train_loss:2.5183 grad_norm:0.2105 train_time:1009085ms step_avg:3153.39ms -step:330/500 train_loss:2.5571 grad_norm:0.2283 train_time:1040628ms step_avg:3153.42ms -step:340/500 train_loss:2.4750 grad_norm:0.4910 train_time:1072163ms step_avg:3153.42ms -step:350/500 train_loss:2.5743 grad_norm:0.2184 train_time:1103847ms step_avg:3153.85ms -step:350/500 val_loss:2.4914 val_bpb:1.4755 train_time:1103879ms step_avg:3153.94ms h_norms=['44224.9', '42616.1', '40801.7', '40665.7', '38003.5', '45954.9', '46072.2', '44968.3', '46243.3', '43608.8', '39877.4', '46404.4', '44214.0', '40903.7', '37660.7', '46128.9', '44751.9', '42821.9', '41310.8', '37880.0'] growth=['0.963', '0.964', '0.957', '0.997', '0.935', '1.209', '1.003', '0.976', '1.028', '0.943', '1.050', '1.164', '0.953', '0.925', '0.921', '1.213', '0.970', '0.957', '0.965', '0.917'] jpw:0.1000 -step:360/500 train_loss:2.3401 grad_norm:0.2802 train_time:1135398ms step_avg:3153.88ms -step:370/500 train_loss:2.5430 grad_norm:0.2968 train_time:1166946ms step_avg:3153.91ms -step:380/500 train_loss:2.4805 grad_norm:0.2258 train_time:1198510ms step_avg:3153.97ms -step:390/500 train_loss:2.4340 grad_norm:0.1956 train_time:1230116ms step_avg:3154.14ms -step:400/500 train_loss:2.4815 grad_norm:0.1874 train_time:1261679ms step_avg:3154.20ms -step:400/500 val_loss:2.4425 val_bpb:1.4466 train_time:1261711ms step_avg:3154.28ms h_norms=['46265.4', '44746.1', '42831.1', '42585.7', '40307.7', '47408.2', '47175.3', '46076.6', '47453.2', '45874.0', '42031.8', '46244.6', '48766.2', '44593.0', '39205.9', '48067.7', '47603.8', '45564.2', '42074.1', '38804.8'] growth=['0.948', '0.967', '0.957', '0.994', '0.947', '1.176', '0.995', '0.977', '1.030', '0.967', '1.041', '1.100', '1.055', '0.914', '0.879', '1.187', '0.990', '0.957', '0.923', '0.922'] jpw:0.1000 -step:410/500 train_loss:2.4447 grad_norm:0.2270 train_time:1293242ms step_avg:3154.25ms -step:420/500 train_loss:2.4766 grad_norm:0.1941 train_time:1324799ms step_avg:3154.28ms -step:430/500 train_loss:2.3966 grad_norm:0.2439 train_time:1356341ms step_avg:3154.28ms -step:440/500 train_loss:2.4855 grad_norm:0.2573 train_time:1387879ms step_avg:3154.27ms -step:450/500 train_loss:2.3007 grad_norm:0.5815 train_time:1419425ms step_avg:3154.28ms -step:450/500 val_loss:2.4005 val_bpb:1.4217 train_time:1419456ms step_avg:3154.35ms h_norms=['49465.2', '47383.0', '45019.3', '44070.9', '41581.5', '48767.1', '48045.7', '46626.6', '47351.6', '45564.7', '73151.9', '1831187.8', '1431135.6', '1073570.8', '836909.8', '68914.1', '62393.5', '62367.8', '57867.6', '52301.5'] growth=['0.938', '0.958', '0.950', '0.979', '0.944', '1.173', '0.985', '0.970', '1.016', '0.962', '1.777', '25.033', '0.782', '0.750', '0.780', '1.670', '0.905', '1.000', '0.928', '0.904'] jpw:0.1000 -step:460/500 train_loss:2.4303 grad_norm:0.1559 train_time:1450982ms step_avg:3154.31ms -step:470/500 train_loss:2.3816 grad_norm:0.1371 train_time:1482497ms step_avg:3154.25ms -step:480/500 train_loss:2.3382 grad_norm:0.2851 train_time:1514014ms step_avg:3154.20ms -step:490/500 train_loss:2.3456 grad_norm:0.4312 train_time:1545525ms step_avg:3154.13ms -step:500/500 train_loss:2.3579 grad_norm:0.2399 train_time:1577067ms step_avg:3154.13ms -step:500/500 val_loss:2.3751 val_bpb:1.4067 train_time:1577098ms step_avg:3154.20ms h_norms=['54182.7', '51741.9', '48776.3', '47916.3', '45263.7', '52401.3', '51762.4', '50104.6', '51713.6', '50582.7', '434613.8', '385628.6', '300553.7', '315624.6', '281217.0', '176020.4', '267042.6', '201712.7', '153802.4', '124577.2'] growth=['0.928', '0.955', '0.943', '0.982', '0.945', '1.158', '0.988', '0.968', '1.032', '0.978', '9.613', '0.887', '0.779', '1.050', '0.891', '3.893', '1.517', '0.755', '0.762', '0.810'] jpw:0.1000 -peak memory allocated: 66527 MiB reserved: 67428 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:2.7827 val_bpb:1.6481 eval_time:85106ms -Serialized model: 110942659 bytes -Code size: 103206 bytes -Serialized model int6+lzma: 10736384 bytes -Total submission size int6+lzma: 10839590 bytes -final_int6_roundtrip val_loss:2.8519 val_bpb:1.6890 eval_time:84507ms -final_int6_roundtrip_exact val_loss:2.85187392 val_bpb:1.68904037 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/output.log deleted file mode 100644 index ee502872af..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/output.log +++ /dev/null @@ -1,50 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2155, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1815, in main - warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ - return super().__call__(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1061, in forward - x, h_core_in, h_core_out, pass_penalty = self._forward_hidden(input_ids, feedback_fn, stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1033, in _forward_hidden - x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 786, in forward - x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 747, in forward - x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 215.94 MiB is free. Including non-PyTorch memory, this process has 54.45 GiB memory in use. Process 386552 has 65.25 GiB memory in use. Process 386553 has 19.88 GiB memory in use. Of the allocated memory 53.67 GiB is allocated by PyTorch, and 120.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/output.log deleted file mode 100644 index 0670c5c62f..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/output.log +++ /dev/null @@ -1,14 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2155, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1816, in main - (warmup_loss * grad_scale).backward() - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward - torch.autograd.backward( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward - _engine_run_backward( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward - return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 576.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 509.94 MiB is free. Process 386554 has 54.45 GiB memory in use. Including non-PyTorch memory, this process has 64.96 GiB memory in use. Process 386553 has 19.88 GiB memory in use. Of the allocated memory 64.13 GiB is allocated by PyTorch, and 110.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/output.log deleted file mode 100644 index 42fa78428c..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/output.log +++ /dev/null @@ -1,50 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2155, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1815, in main - warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ - return super().__call__(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1061, in forward - x, h_core_in, h_core_out, pass_penalty = self._forward_hidden(input_ids, feedback_fn, stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1033, in _forward_hidden - x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 786, in forward - x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 748, in forward - return F.linear(x.square(), down_w.to(x.dtype)) - ^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 576.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 195.94 MiB is free. Process 386554 has 54.46 GiB memory in use. Process 386552 has 65.25 GiB memory in use. Including non-PyTorch memory, this process has 19.88 GiB memory in use. Of the allocated memory 19.11 GiB is allocated by PyTorch, and 100.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/output.log deleted file mode 100644 index f109d80e95..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/output.log +++ /dev/null @@ -1,63 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms jpw:0.1000 -step:1/500 train_loss:6.9303 grad_norm:0.3807 train_time:3150ms step_avg:3149.64ms -step:2/500 train_loss:8.2530 grad_norm:3.3922 train_time:6286ms step_avg:3143.18ms -step:3/500 train_loss:7.4761 grad_norm:1.7568 train_time:9456ms step_avg:3151.99ms -step:4/500 train_loss:7.5765 grad_norm:1.7785 train_time:12627ms step_avg:3156.70ms -step:5/500 train_loss:7.3284 grad_norm:1.8550 train_time:15795ms step_avg:3159.04ms -step:6/500 train_loss:7.0817 grad_norm:1.4558 train_time:18966ms step_avg:3160.97ms -step:7/500 train_loss:6.9079 grad_norm:2.0922 train_time:22139ms step_avg:3162.66ms -step:8/500 train_loss:6.9275 grad_norm:2.0590 train_time:25309ms step_avg:3163.66ms -step:9/500 train_loss:6.6772 grad_norm:1.6278 train_time:28478ms step_avg:3164.21ms -step:10/500 train_loss:6.2414 grad_norm:1.7465 train_time:31650ms step_avg:3164.98ms -step:20/500 train_loss:5.1363 grad_norm:2.7138 train_time:63335ms step_avg:3166.75ms -step:30/500 train_loss:4.3551 grad_norm:1.3891 train_time:95058ms step_avg:3168.60ms -step:40/500 train_loss:4.0787 grad_norm:1.7074 train_time:126907ms step_avg:3172.68ms -step:50/500 train_loss:3.8860 grad_norm:0.8342 train_time:158631ms step_avg:3172.61ms -step:50/500 val_loss:3.8436 val_bpb:2.2764 train_time:158662ms step_avg:3173.24ms h_norms=['29399.1', '30127.6', '31287.8', '32221.2', '34106.9', '39952.1', '45442.3', '51958.3', '58827.5', '67490.0', '55560.2', '64617.7', '74357.9', '86351.5', '99810.2', '82827.4', '103212.4', '115017.4', '128838.7', '149489.9'] growth=['1.084', '1.025', '1.039', '1.030', '1.059', '1.171', '1.137', '1.143', '1.132', '1.147', '1.186', '1.163', '1.151', '1.161', '1.156', '1.254', '1.246', '1.114', '1.120', '1.160'] jpw:1.0690 -step:60/500 train_loss:3.6867 grad_norm:0.7324 train_time:190352ms step_avg:3172.53ms -step:70/500 train_loss:3.5703 grad_norm:1.0982 train_time:222085ms step_avg:3172.64ms -step:80/500 train_loss:3.5100 grad_norm:0.6625 train_time:253815ms step_avg:3172.68ms -step:90/500 train_loss:3.3401 grad_norm:0.6580 train_time:285565ms step_avg:3172.94ms -step:100/500 train_loss:3.2941 grad_norm:0.6615 train_time:317291ms step_avg:3172.91ms -step:100/500 val_loss:3.2441 val_bpb:1.9214 train_time:317323ms step_avg:3173.23ms h_norms=['47705.2', '50092.6', '51794.0', '52615.4', '53976.0', '64508.0', '74412.4', '84893.8', '95510.2', '106712.2', '85551.0', '100970.7', '116826.7', '133686.2', '150967.2', '117625.8', '139019.5', '162155.8', '186225.0', '211766.1'] growth=['1.052', '1.050', '1.034', '1.016', '1.026', '1.195', '1.154', '1.141', '1.125', '1.117', '1.208', '1.180', '1.157', '1.144', '1.129', '1.213', '1.182', '1.166', '1.148', '1.137'] jpw:0.1190 -step:110/500 train_loss:3.1981 grad_norm:0.6386 train_time:349008ms step_avg:3172.80ms -step:120/500 train_loss:3.1035 grad_norm:0.5124 train_time:380716ms step_avg:3172.63ms -step:130/500 train_loss:3.0258 grad_norm:0.3921 train_time:412455ms step_avg:3172.73ms -step:140/500 train_loss:2.9851 grad_norm:0.4252 train_time:444162ms step_avg:3172.58ms -step:150/500 train_loss:2.9393 grad_norm:0.5980 train_time:475876ms step_avg:3172.51ms -step:150/500 val_loss:2.9206 val_bpb:1.7298 train_time:475907ms step_avg:3172.71ms h_norms=['44309.3', '46766.3', '47516.9', '47905.3', '47774.2', '56567.2', '63533.7', '70064.7', '76357.1', '81758.3', '68481.7', '78087.2', '86586.4', '95168.0', '103378.6', '83962.7', '95447.8', '106783.7', '118504.9', '130044.9'] growth=['1.061', '1.055', '1.016', '1.008', '0.997', '1.184', '1.123', '1.103', '1.090', '1.071', '1.183', '1.140', '1.109', '1.099', '1.086', '1.172', '1.137', '1.119', '1.110', '1.097'] jpw:0.1000 -step:160/500 train_loss:2.9259 grad_norm:0.4479 train_time:507599ms step_avg:3172.49ms -step:170/500 train_loss:2.8777 grad_norm:0.4693 train_time:539353ms step_avg:3172.66ms -step:180/500 train_loss:2.7824 grad_norm:0.3977 train_time:571098ms step_avg:3172.77ms -step:190/500 train_loss:2.8183 grad_norm:0.5028 train_time:602819ms step_avg:3172.73ms -step:200/500 train_loss:2.7302 grad_norm:0.4966 train_time:634675ms step_avg:3173.37ms -step:200/500 val_loss:2.7679 val_bpb:1.6393 train_time:634706ms step_avg:3173.53ms h_norms=['43898.3', '46876.7', '47639.6', '47540.1', '46729.0', '54471.5', '60239.5', '64693.1', '68195.4', '70570.1', '61019.1', '68002.8', '73186.1', '77789.0', '81406.4', '69222.6', '76278.3', '82542.8', '88466.1', '93667.9'] growth=['1.112', '1.068', '1.016', '0.998', '0.983', '1.166', '1.106', '1.074', '1.054', '1.035', '1.147', '1.114', '1.076', '1.063', '1.047', '1.137', '1.102', '1.082', '1.072', '1.059'] jpw:0.1000 -step:210/500 train_loss:2.7159 grad_norm:0.3988 train_time:666398ms step_avg:3173.33ms -step:220/500 train_loss:2.7636 grad_norm:0.3531 train_time:698119ms step_avg:3173.27ms -step:230/500 train_loss:2.6959 grad_norm:0.3534 train_time:729840ms step_avg:3173.22ms -step:240/500 train_loss:2.6946 grad_norm:0.3099 train_time:761572ms step_avg:3173.21ms -step:250/500 train_loss:2.7419 grad_norm:0.3689 train_time:793293ms step_avg:3173.17ms -step:250/500 val_loss:2.6831 val_bpb:1.5891 train_time:793325ms step_avg:3173.30ms h_norms=['43230.1', '46896.1', '47276.5', '45457.8', '43084.9', '50881.0', '55210.7', '57425.2', '57111.8', '55992.8', '52524.6', '56946.6', '59050.3', '59130.6', '58425.0', '54273.5', '57840.7', '60095.9', '60604.2', '60888.6'] growth=['1.135', '1.085', '1.008', '0.962', '0.948', '1.181', '1.085', '1.040', '0.995', '0.980', '1.143', '1.084', '1.037', '1.001', '0.988', '1.113', '1.066', '1.039', '1.008', '1.005'] jpw:0.1000 -step:260/500 train_loss:2.6850 grad_norm:0.3416 train_time:825063ms step_avg:3173.32ms -step:270/500 train_loss:2.6478 grad_norm:0.3872 train_time:856842ms step_avg:3173.49ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/output.log deleted file mode 100644 index d61a863916..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/output.log +++ /dev/null @@ -1,65 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/500 val_loss:6.9292 val_bpb:4.1038 train_time:0ms step_avg:0.02ms jpw:0.1000 -step:1/500 train_loss:6.9303 train_time:3121ms step_avg:3121.37ms -step:2/500 train_loss:8.2964 train_time:6284ms step_avg:3141.84ms -step:3/500 train_loss:7.7322 train_time:9457ms step_avg:3152.21ms -step:4/500 train_loss:8.5580 train_time:12628ms step_avg:3156.95ms -step:5/500 train_loss:8.4686 train_time:15797ms step_avg:3159.39ms -step:6/500 train_loss:7.7993 train_time:18963ms step_avg:3160.57ms -step:7/500 train_loss:7.2392 train_time:22130ms step_avg:3161.45ms -step:8/500 train_loss:7.0090 train_time:25296ms step_avg:3162.02ms -step:9/500 train_loss:6.5969 train_time:28465ms step_avg:3162.73ms -step:10/500 train_loss:6.4712 train_time:31634ms step_avg:3163.42ms -step:20/500 train_loss:5.4373 train_time:63316ms step_avg:3165.82ms -step:30/500 train_loss:4.7425 train_time:95013ms step_avg:3167.10ms -step:40/500 train_loss:4.5571 train_time:126854ms step_avg:3171.35ms -step:50/500 train_loss:4.2765 train_time:158552ms step_avg:3171.04ms -step:50/500 val_loss:4.2310 val_bpb:2.5059 train_time:158584ms step_avg:3171.67ms h_norms=['128020.9', '142329.3', '150817.1', '164904.4', '172878.4', '221093.7', '304101.2', '290425.7', '279737.3', '288668.9', '254141.1', '302318.5', '342082.8', '372346.5', '408459.7', '357174.3', '431074.6', '494662.8', '539688.3', '601738.6'] growth=['1.137', '1.112', '1.060', '1.093', '1.048', '1.279', '1.375', '0.955', '0.963', '1.032', '1.207', '1.190', '1.132', '1.088', '1.097', '1.197', '1.207', '1.148', '1.091', '1.115'] jpw:0.1000 -step:60/500 train_loss:4.0440 train_time:190235ms step_avg:3170.59ms -step:70/500 train_loss:3.9054 train_time:221921ms step_avg:3170.31ms -step:80/500 train_loss:3.8129 train_time:253616ms step_avg:3170.20ms -step:90/500 train_loss:3.6730 train_time:285312ms step_avg:3170.13ms -step:100/500 train_loss:3.6064 train_time:317006ms step_avg:3170.06ms -step:100/500 val_loss:3.5712 val_bpb:2.1151 train_time:317038ms step_avg:3170.38ms h_norms=['217506.4', '238701.1', '250839.5', '253256.4', '255121.4', '309055.8', '367379.8', '398821.4', '411101.1', '432634.1', '369051.3', '439129.6', '489535.2', '523037.9', '563631.0', '484894.3', '584870.6', '666437.0', '722848.2', '791578.9'] growth=['1.119', '1.097', '1.051', '1.010', '1.007', '1.211', '1.189', '1.086', '1.031', '1.052', '1.215', '1.190', '1.115', '1.068', '1.078', '1.215', '1.206', '1.139', '1.085', '1.095'] jpw:0.1000 -step:110/500 train_loss:3.5100 train_time:348728ms step_avg:3170.25ms -step:120/500 train_loss:3.4055 train_time:380452ms step_avg:3170.43ms -step:130/500 train_loss:3.3291 train_time:412209ms step_avg:3170.84ms -step:140/500 train_loss:3.2646 train_time:443910ms step_avg:3170.78ms -step:150/500 train_loss:3.1985 train_time:475595ms step_avg:3170.63ms -step:150/500 val_loss:3.1731 val_bpb:1.8793 train_time:475626ms step_avg:3170.84ms h_norms=['202425.6', '219581.3', '229199.7', '231251.1', '232042.3', '280260.3', '331272.2', '359679.4', '373077.9', '392866.2', '327880.4', '388136.0', '431546.2', '460585.2', '494839.2', '422156.4', '505718.2', '574404.4', '622526.9', '679339.4'] growth=['1.104', '1.085', '1.044', '1.009', '1.003', '1.208', '1.182', '1.086', '1.037', '1.053', '1.208', '1.184', '1.112', '1.067', '1.074', '1.212', '1.198', '1.136', '1.084', '1.091'] jpw:0.1000 -step:160/500 train_loss:3.1467 train_time:507288ms step_avg:3170.55ms -step:170/500 train_loss:3.0741 train_time:538985ms step_avg:3170.50ms -step:180/500 train_loss:2.9611 train_time:570664ms step_avg:3170.36ms -step:190/500 train_loss:2.9609 train_time:602355ms step_avg:3170.29ms -step:200/500 train_loss:2.8727 train_time:634178ms step_avg:3170.89ms -step:200/500 val_loss:2.9169 val_bpb:1.7275 train_time:634210ms step_avg:3171.05ms h_norms=['195918.4', '211776.8', '222459.3', '225583.9', '227291.6', '278823.8', '321587.3', '350774.8', '365721.4', '384108.5', '319066.1', '375610.7', '414454.6', '441410.4', '472513.9', '402015.0', '479212.5', '542916.8', '583835.1', '633030.1'] growth=['1.112', '1.081', '1.050', '1.014', '1.008', '1.227', '1.153', '1.091', '1.043', '1.050', '1.204', '1.177', '1.103', '1.065', '1.070', '1.204', '1.192', '1.133', '1.075', '1.084'] jpw:0.1000 -step:210/500 train_loss:2.8522 train_time:665890ms step_avg:3170.90ms -step:220/500 train_loss:2.9033 train_time:697577ms step_avg:3170.81ms -step:230/500 train_loss:2.8202 train_time:729281ms step_avg:3170.79ms -step:240/500 train_loss:2.8185 train_time:760982ms step_avg:3170.76ms -step:250/500 train_loss:2.8500 train_time:792694ms step_avg:3170.78ms -step:250/500 val_loss:2.7950 val_bpb:1.6554 train_time:792726ms step_avg:3170.90ms h_norms=['187987.2', '202979.7', '214368.9', '220031.3', '223903.4', '284093.7', '324382.2', '353471.8', '371760.4', '389398.9', '318921.8', '370811.2', '408036.2', '436269.2', '464205.0', '395210.2', '465609.5', '525084.6', '560371.8', '601627.7'] growth=['1.132', '1.080', '1.056', '1.026', '1.018', '1.269', '1.142', '1.090', '1.052', '1.047', '1.203', '1.163', '1.100', '1.069', '1.064', '1.198', '1.178', '1.128', '1.067', '1.074'] jpw:0.1000 -step:260/500 train_loss:2.7996 train_time:824394ms step_avg:3170.75ms -step:270/500 train_loss:2.7542 train_time:856111ms step_avg:3170.78ms -step:280/500 train_loss:2.6927 train_time:887831ms step_avg:3170.82ms -step:290/500 train_loss:2.7306 train_time:919538ms step_avg:3170.82ms diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/output.log deleted file mode 100644 index ec78344062..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/output.log +++ /dev/null @@ -1,88 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/20000 val_loss:6.9291 val_bpb:4.1038 train_time:0ms step_avg:0.01ms -step:1/20000 train_loss:6.9303 grad_norm:0.3807 train_time:79129ms step_avg:79129.16ms -late_qat:enabled step:1 scale:0.0351 core_quant:on -step:2/20000 train_loss:8.3329 grad_norm:3.6210 train_time:164629ms step_avg:82314.47ms -step:3/20000 train_loss:8.2999 grad_norm:3.5783 train_time:250620ms step_avg:83539.89ms -step:4/20000 train_loss:8.2004 grad_norm:3.5109 train_time:332890ms step_avg:83222.59ms -W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] torch._dynamo hit config.recompile_limit (8) -W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] function: 'forward' (/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py:1053) -W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] last reason: 0/7: self._lora_step_mul == 0.002 # s = self._lora_scale * getattr(self, '_lora_step_mul', 1.0) # records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py:1024 in _forward_hidden -W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] User stack trace: -W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1055, in forward -W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) -W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1024, in _forward_hidden -W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] s = self._lora_scale * getattr(self, '_lora_step_mul', 1.0) -W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". -W0326 20:44:27.328000 404831 torch/_dynamo/convert_frame.py:1743] [0/8] To diagnose recompilation issues, see https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html -Traceback (most recent call last): - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1773, in _compile - raise_unimplemented_cache_limit_exceeded() - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1757, in raise_unimplemented_cache_limit_exceeded - unimplemented( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/exc.py", line 634, in unimplemented - raise Unsupported(msg, gb_type, skip_frame) -torch._dynamo.exc.Unsupported: Dynamo recompile limit exceeded - Explanation: Dynamo attempted to recompile the code object too many times, exceeding the recompile_limit cache size limit (currently set to 8). Excessive recompilations can degrade performance due to the compilation overhead of each recompilation. - Hint: To monitor recompilations, enable TORCH_LOGS=recompiles. If recompilations are expected, consider increasing torch._dynamo.config.recompile_limit to an appropriate value. - Hint: See https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html for tips on dealing with recompilations. - - Developer debug context: Limit type: recompile_limit - - For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0039.html - -The above exception was the direct cause of the following exception: - -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2145, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1902, in main - loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ - return super().__call__(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 2316, in __call__ - result = self._torchdynamo_orig_backend( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 729, in __call__ - result = _compile( - ^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1780, in _compile - raise FailOnRecompileLimitHit( -torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/output.log deleted file mode 100644 index c8f9580ede..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/output.log +++ /dev/null @@ -1,186 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/20000 val_loss:6.9291 val_bpb:4.1038 train_time:0ms step_avg:0.01ms -step:1/20000 train_loss:6.9303 grad_norm:0.3807 train_time:1305ms step_avg:1305.20ms -step:2/20000 train_loss:8.2624 grad_norm:3.4023 train_time:2653ms step_avg:1326.56ms -step:3/20000 train_loss:7.4846 grad_norm:1.6936 train_time:4015ms step_avg:1338.21ms -step:4/20000 train_loss:7.7154 grad_norm:1.9838 train_time:5385ms step_avg:1346.32ms -step:5/20000 train_loss:7.4456 grad_norm:2.1207 train_time:6747ms step_avg:1349.42ms -step:6/20000 train_loss:7.0896 grad_norm:1.7550 train_time:8108ms step_avg:1351.41ms -step:7/20000 train_loss:6.8569 grad_norm:2.3306 train_time:9470ms step_avg:1352.92ms -step:8/20000 train_loss:6.7973 grad_norm:1.6453 train_time:10833ms step_avg:1354.11ms -step:9/20000 train_loss:6.5582 grad_norm:1.2844 train_time:12199ms step_avg:1355.49ms -step:10/20000 train_loss:6.2034 grad_norm:1.2514 train_time:13569ms step_avg:1356.86ms -step:50/20000 train_loss:3.6856 grad_norm:0.8303 train_time:68670ms step_avg:1373.39ms -step:100/20000 train_loss:3.1157 grad_norm:0.4168 train_time:137708ms step_avg:1377.08ms -step:150/20000 train_loss:2.7810 grad_norm:0.3675 train_time:206699ms step_avg:1377.99ms -step:200/20000 train_loss:2.5703 grad_norm:0.3384 train_time:275646ms step_avg:1378.23ms -step:250/20000 train_loss:2.5789 grad_norm:0.2913 train_time:344574ms step_avg:1378.30ms -step:300/20000 train_loss:2.4357 grad_norm:0.2132 train_time:413532ms step_avg:1378.44ms -step:350/20000 train_loss:2.4866 grad_norm:0.2055 train_time:483575ms step_avg:1381.64ms -step:400/20000 train_loss:2.4144 grad_norm:0.2487 train_time:552589ms step_avg:1381.47ms -step:450/20000 train_loss:2.2333 grad_norm:0.1527 train_time:621533ms step_avg:1381.18ms -step:500/20000 train_loss:2.2865 grad_norm:0.1511 train_time:690504ms step_avg:1381.01ms -step:500/20000 val_loss:2.3117 val_bpb:1.3691 train_time:690515ms step_avg:1381.03ms -step:550/20000 train_loss:2.3480 grad_norm:0.1354 train_time:760462ms step_avg:1382.66ms -step:600/20000 train_loss:2.2536 grad_norm:0.2002 train_time:829477ms step_avg:1382.46ms -step:650/20000 train_loss:2.2306 grad_norm:0.1194 train_time:898608ms step_avg:1382.47ms -step:700/20000 train_loss:2.3041 grad_norm:0.1617 train_time:967715ms step_avg:1382.45ms -step:750/20000 train_loss:2.2754 grad_norm:0.1308 train_time:1036878ms step_avg:1382.50ms -step:800/20000 train_loss:2.2542 grad_norm:0.1211 train_time:1106111ms step_avg:1382.64ms -step:850/20000 train_loss:2.1786 grad_norm:0.0688 train_time:1175361ms step_avg:1382.78ms -step:900/20000 train_loss:2.0929 grad_norm:0.0751 train_time:1244692ms step_avg:1382.99ms -step:950/20000 train_loss:2.2961 grad_norm:0.1601 train_time:1314910ms step_avg:1384.12ms -step:1000/20000 train_loss:2.2261 grad_norm:0.0833 train_time:1384163ms step_avg:1384.16ms -step:1000/20000 val_loss:2.1725 val_bpb:1.2867 train_time:1384175ms step_avg:1384.17ms -step:1050/20000 train_loss:2.1507 grad_norm:0.1497 train_time:1453425ms step_avg:1384.21ms -step:1100/20000 train_loss:2.1755 grad_norm:0.0691 train_time:1522714ms step_avg:1384.29ms -step:1150/20000 train_loss:2.1286 grad_norm:0.0721 train_time:1592948ms step_avg:1385.17ms -step:1200/20000 train_loss:2.1760 grad_norm:0.0782 train_time:1662222ms step_avg:1385.19ms -step:1250/20000 train_loss:2.2002 grad_norm:0.0675 train_time:1731489ms step_avg:1385.19ms -step:1300/20000 train_loss:2.1691 grad_norm:0.0856 train_time:1800790ms step_avg:1385.22ms -step:1350/20000 train_loss:2.1443 grad_norm:0.0825 train_time:1870108ms step_avg:1385.27ms -step:1400/20000 train_loss:2.1563 grad_norm:0.0820 train_time:1939417ms step_avg:1385.30ms -step:1450/20000 train_loss:2.1534 grad_norm:0.0778 train_time:2008708ms step_avg:1385.32ms -step:1500/20000 train_loss:2.1264 grad_norm:0.1636 train_time:2077976ms step_avg:1385.32ms -step:1500/20000 val_loss:2.1131 val_bpb:1.2515 train_time:2077988ms step_avg:1385.33ms -step:1550/20000 train_loss:2.0937 grad_norm:0.0739 train_time:2148175ms step_avg:1385.92ms -step:1600/20000 train_loss:2.1730 grad_norm:0.0742 train_time:2217401ms step_avg:1385.88ms -step:1650/20000 train_loss:1.9579 grad_norm:0.0866 train_time:2286669ms step_avg:1385.86ms -step:1700/20000 train_loss:2.0866 grad_norm:0.0640 train_time:2356000ms step_avg:1385.88ms -step:1750/20000 train_loss:2.0575 grad_norm:0.0784 train_time:2425290ms step_avg:1385.88ms -step:1800/20000 train_loss:2.0953 grad_norm:0.0589 train_time:2495476ms step_avg:1386.38ms -step:1850/20000 train_loss:2.1090 grad_norm:0.1099 train_time:2564723ms step_avg:1386.34ms -step:1900/20000 train_loss:2.0553 grad_norm:0.0538 train_time:2633991ms step_avg:1386.31ms -step:1950/20000 train_loss:2.0417 grad_norm:0.0691 train_time:2703295ms step_avg:1386.31ms -step:2000/20000 train_loss:2.2933 grad_norm:0.0634 train_time:2772559ms step_avg:1386.28ms -step:2000/20000 val_loss:2.0736 val_bpb:1.2281 train_time:2772571ms step_avg:1386.29ms -step:2050/20000 train_loss:2.0610 grad_norm:0.0643 train_time:2841839ms step_avg:1386.26ms -step:2100/20000 train_loss:2.0352 grad_norm:0.0542 train_time:2911097ms step_avg:1386.24ms -step:2150/20000 train_loss:2.0150 grad_norm:0.0748 train_time:2980382ms step_avg:1386.22ms -step:2200/20000 train_loss:2.1675 grad_norm:0.0647 train_time:3050764ms step_avg:1386.71ms -step:2250/20000 train_loss:2.0588 grad_norm:0.0651 train_time:3120289ms step_avg:1386.80ms -step:2300/20000 train_loss:2.0371 grad_norm:0.0742 train_time:3189823ms step_avg:1386.88ms -step:2350/20000 train_loss:1.9911 grad_norm:0.0819 train_time:3259324ms step_avg:1386.95ms -step:2400/20000 train_loss:2.1049 grad_norm:0.0508 train_time:3329685ms step_avg:1387.37ms -step:2450/20000 train_loss:2.0658 grad_norm:0.0537 train_time:3398968ms step_avg:1387.33ms -step:2500/20000 train_loss:2.0210 grad_norm:0.0627 train_time:3468256ms step_avg:1387.30ms -step:2500/20000 val_loss:2.0271 val_bpb:1.2005 train_time:3468267ms step_avg:1387.31ms -step:2550/20000 train_loss:2.0220 grad_norm:0.0459 train_time:3537589ms step_avg:1387.29ms -step:2600/20000 train_loss:1.9997 grad_norm:0.0445 train_time:3606904ms step_avg:1387.27ms -step:2650/20000 train_loss:2.0041 grad_norm:0.0439 train_time:3676209ms step_avg:1387.25ms -step:2700/20000 train_loss:2.0259 grad_norm:0.0450 train_time:3745493ms step_avg:1387.22ms -step:2750/20000 train_loss:2.0067 grad_norm:0.0443 train_time:3814755ms step_avg:1387.18ms -step:2800/20000 train_loss:2.0409 grad_norm:0.0486 train_time:3884987ms step_avg:1387.50ms -step:2850/20000 train_loss:1.9897 grad_norm:0.0474 train_time:3954266ms step_avg:1387.46ms -step:2900/20000 train_loss:2.0047 grad_norm:0.0898 train_time:4023547ms step_avg:1387.43ms -step:2950/20000 train_loss:2.0428 grad_norm:0.0410 train_time:4092882ms step_avg:1387.42ms -step:3000/20000 train_loss:1.9290 grad_norm:0.0543 train_time:4163141ms step_avg:1387.71ms -step:3000/20000 val_loss:1.9832 val_bpb:1.1746 train_time:4163153ms step_avg:1387.72ms -step:3050/20000 train_loss:1.9349 grad_norm:0.0457 train_time:4232654ms step_avg:1387.76ms -step:3100/20000 train_loss:1.9977 grad_norm:0.0425 train_time:4302197ms step_avg:1387.81ms -swa:start step:3150 -step:3150/20000 train_loss:2.0068 grad_norm:0.0401 train_time:4371737ms step_avg:1387.85ms -step:3200/20000 train_loss:1.9809 grad_norm:0.0467 train_time:4441129ms step_avg:1387.85ms -late_qat:enabled step:3204 scale:0.1497 core_quant:on -step:3250/20000 train_loss:1.9488 grad_norm:0.0413 train_time:4597242ms step_avg:1414.54ms -step:3300/20000 train_loss:1.9274 grad_norm:0.0417 train_time:4666636ms step_avg:1414.13ms -step:3350/20000 train_loss:1.9650 grad_norm:0.0352 train_time:4736129ms step_avg:1413.77ms -step:3396/20000 val_loss:1.9539 val_bpb:1.1572 train_time:4800035ms step_avg:1413.44ms -stopping_early: wallclock_cap train_time:4800035ms step:3396/20000 -peak memory allocated: 50639 MiB reserved: 50682 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:1.9508 val_bpb:1.1554 eval_time:33632ms -Serialized model: 110942659 bytes -Code size: 102570 bytes -Serialized model int6+lzma: 17439360 bytes -Total submission size int6+lzma: 17541930 bytes -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] torch._dynamo hit config.recompile_limit (8) -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] function: 'forward' (/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py:1054) -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] last reason: 0/7: self._modules['blocks']._modules['0']._modules['attn']._modules['rotary']._cos_cached is None # self._cos_cached is None # records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py:590 in forward -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] User stack trace: -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1004, in _forward_hidden -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] x, raw_v = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 783, in forward -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] attn_out, raw_v = self.attn(self.attn_norm(x_in) * self.ln_scale_factor, q_w, k_w, v_w, out_w, v_embed=v_embed, v0=v0) -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 678, in forward -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] cos, sin = self.rotary(seqlen, x.device, q.dtype) -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 590, in forward -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] self._cos_cached is None -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". -W0326 22:22:14.551000 411475 torch/_dynamo/convert_frame.py:1743] [0/8] To diagnose recompilation issues, see https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html -Traceback (most recent call last): - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1773, in _compile - raise_unimplemented_cache_limit_exceeded() - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1757, in raise_unimplemented_cache_limit_exceeded - unimplemented( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/exc.py", line 634, in unimplemented - raise Unsupported(msg, gb_type, skip_frame) -torch._dynamo.exc.Unsupported: Dynamo recompile limit exceeded - Explanation: Dynamo attempted to recompile the code object too many times, exceeding the recompile_limit cache size limit (currently set to 8). Excessive recompilations can degrade performance due to the compilation overhead of each recompilation. - Hint: To monitor recompilations, enable TORCH_LOGS=recompiles. If recompilations are expected, consider increasing torch._dynamo.config.recompile_limit to an appropriate value. - Hint: See https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html for tips on dealing with recompilations. - - Developer debug context: Limit type: recompile_limit - - For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0039.html - -The above exception was the direct cause of the following exception: - -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2146, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2084, in main - q_val_loss, q_val_bpb = eval_val( - ^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 359, in eval_val - batch_loss = model(x, y).detach() - ^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ - return super().__call__(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 2316, in __call__ - result = self._torchdynamo_orig_backend( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 729, in __call__ - result = _compile( - ^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1780, in _compile - raise FailOnRecompileLimitHit( -torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/output.log deleted file mode 100644 index 62f3920e45..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/output.log +++ /dev/null @@ -1,41 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main - (warmup_loss * grad_scale).backward() - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward - torch.autograd.backward( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward - _engine_run_backward( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward - return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply - return user_fn(self, *args) - ^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward - return impl_fn() - ^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn - out = CompiledFunction._backward_impl(ctx, all_args) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl - out = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ - return self.current_callable(inputs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run - out = model(new_inputs) - ^^^^^^^^^^^^^^^^^ - File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call - buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 71.88 MiB is free. Process 438880 has 48.42 GiB memory in use. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Process 438888 has 42.87 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/output.log deleted file mode 100644 index c5ed5853d1..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/output.log +++ /dev/null @@ -1,41 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main - (warmup_loss * grad_scale).backward() - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward - torch.autograd.backward( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward - _engine_run_backward( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward - return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply - return user_fn(self, *args) - ^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward - return impl_fn() - ^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn - out = CompiledFunction._backward_impl(ctx, all_args) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl - out = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ - return self.current_callable(inputs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run - out = model(new_inputs) - ^^^^^^^^^^^^^^^^^ - File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call - buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 71.88 MiB is free. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Process 438887 has 48.42 GiB memory in use. Process 438888 has 42.87 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/output.log deleted file mode 100644 index 8daabbc0b7..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/output.log +++ /dev/null @@ -1,109 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main - warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ - return super().__call__(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward - def forward(self, input_ids: Tensor, target_ids: Tensor, - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward - return compiled_fn(full_args) - ^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper - all_outs = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g - return f(*args) - ^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply - return super().apply(*args, **kwargs) # type: ignore[misc] - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward - fw_outs = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper - return compiled_fn(runtime_args) - ^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn - outs = compiled_fn(args) - ^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ - return self.current_callable(inputs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run - out = model(new_inputs) - ^^^^^^^^^^^^^^^^^ - File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 9307, in call - buf866 = torch.ops.flash_attn_3._flash_attn_forward.default(buf865, buf864, reinterpret_tensor(buf850, (48, 2048, 4, 64), (524288, 256, 64, 1), 0), None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0.125, True, window_size_left=-1, window_size_right=-1, attention_chunk=0, softcap=0.0, rotary_interleaved=True, scheduler_metadata=None, num_splits=1, pack_gqa=None, sm_margin=0) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__ - return self._op(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/autograd.py", line 112, in autograd_impl - result = forward_no_grad(*args, Metadata(keyset, keyword_only_args)) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/autograd.py", line 41, in forward_no_grad - result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 872, in redispatch - return self._handle.redispatch_boxed(keyset, *args, **kwargs) # type: ignore[return-value] - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner - return disable_fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__ - res = func(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__ - return self._op(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 347, in backend_impl - result = self._backend_fns[device_type](*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner - return disable_fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 382, in wrapped_fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/flash_attention_hopper/flash_attn_interface.py", line 93, in _flash_attn_forward - out, softmax_lse, out_accum, softmax_lse_accum = flash_attn_3_cuda.fwd( - ^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1269, in __call__ - return self._op(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^ -RuntimeError: torch_call_dispatcher( "aten::new_empty", "", stack.data(), TORCH_ABI_VERSION) API call failed at /root/.venv/lib/python3.12/site-packages/torch/include/torch/csrc/stable/ops.h, line 579 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/output.log deleted file mode 100644 index bbe357f37b..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/output.log +++ /dev/null @@ -1,41 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main - (warmup_loss * grad_scale).backward() - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward - torch.autograd.backward( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward - _engine_run_backward( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward - return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply - return user_fn(self, *args) - ^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward - return impl_fn() - ^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn - out = CompiledFunction._backward_impl(ctx, all_args) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl - out = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ - return self.current_callable(inputs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run - out = model(new_inputs) - ^^^^^^^^^^^^^^^^^ - File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call - buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 71.88 MiB is free. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Process 443507 has 42.87 GiB memory in use. Process 443517 has 48.42 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/output.log deleted file mode 100644 index 7c33dc22e1..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/output.log +++ /dev/null @@ -1,41 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main - (warmup_loss * grad_scale).backward() - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward - torch.autograd.backward( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward - _engine_run_backward( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward - return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply - return user_fn(self, *args) - ^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward - return impl_fn() - ^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn - out = CompiledFunction._backward_impl(ctx, all_args) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl - out = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ - return self.current_callable(inputs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run - out = model(new_inputs) - ^^^^^^^^^^^^^^^^^ - File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call - buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 71.88 MiB is free. Process 443506 has 48.42 GiB memory in use. Process 443507 has 42.87 GiB memory in use. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/output.log deleted file mode 100644 index 8daabbc0b7..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/output.log +++ /dev/null @@ -1,109 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main - warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ - return super().__call__(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward - def forward(self, input_ids: Tensor, target_ids: Tensor, - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward - return compiled_fn(full_args) - ^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper - all_outs = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g - return f(*args) - ^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply - return super().apply(*args, **kwargs) # type: ignore[misc] - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward - fw_outs = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper - return compiled_fn(runtime_args) - ^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn - outs = compiled_fn(args) - ^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ - return self.current_callable(inputs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run - out = model(new_inputs) - ^^^^^^^^^^^^^^^^^ - File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 9307, in call - buf866 = torch.ops.flash_attn_3._flash_attn_forward.default(buf865, buf864, reinterpret_tensor(buf850, (48, 2048, 4, 64), (524288, 256, 64, 1), 0), None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0.125, True, window_size_left=-1, window_size_right=-1, attention_chunk=0, softcap=0.0, rotary_interleaved=True, scheduler_metadata=None, num_splits=1, pack_gqa=None, sm_margin=0) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__ - return self._op(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/autograd.py", line 112, in autograd_impl - result = forward_no_grad(*args, Metadata(keyset, keyword_only_args)) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/autograd.py", line 41, in forward_no_grad - result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 872, in redispatch - return self._handle.redispatch_boxed(keyset, *args, **kwargs) # type: ignore[return-value] - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner - return disable_fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__ - res = func(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__ - return self._op(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 347, in backend_impl - result = self._backend_fns[device_type](*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner - return disable_fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 382, in wrapped_fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/flash_attention_hopper/flash_attn_interface.py", line 93, in _flash_attn_forward - out, softmax_lse, out_accum, softmax_lse_accum = flash_attn_3_cuda.fwd( - ^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1269, in __call__ - return self._op(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^ -RuntimeError: torch_call_dispatcher( "aten::new_empty", "", stack.data(), TORCH_ABI_VERSION) API call failed at /root/.venv/lib/python3.12/site-packages/torch/include/torch/csrc/stable/ops.h, line 579 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/output.log deleted file mode 100644 index af28a40e2a..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/output.log +++ /dev/null @@ -1,319 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms -step:1/20000 train_loss:6.9310 grad_norm:0.3717 train_time:1291ms step_avg:1291.09ms -step:2/20000 train_loss:8.3536 grad_norm:3.5393 train_time:2598ms step_avg:1298.81ms -step:3/20000 train_loss:7.5089 grad_norm:1.8069 train_time:3954ms step_avg:1318.13ms -step:4/20000 train_loss:7.5822 grad_norm:1.8725 train_time:5317ms step_avg:1329.29ms -step:5/20000 train_loss:7.3524 grad_norm:1.8843 train_time:6673ms step_avg:1334.64ms -step:6/20000 train_loss:7.0868 grad_norm:1.7131 train_time:8028ms step_avg:1338.00ms -step:7/20000 train_loss:6.9401 grad_norm:2.0897 train_time:9384ms step_avg:1340.63ms -step:8/20000 train_loss:6.8952 grad_norm:1.4534 train_time:10745ms step_avg:1343.15ms -step:9/20000 train_loss:6.5431 grad_norm:1.0222 train_time:12102ms step_avg:1344.70ms -step:10/20000 train_loss:6.1427 grad_norm:0.9715 train_time:13466ms step_avg:1346.55ms -step:50/20000 train_loss:3.6903 grad_norm:0.9422 train_time:68054ms step_avg:1361.07ms -step:100/20000 train_loss:3.1184 grad_norm:0.5410 train_time:136293ms step_avg:1362.93ms -step:150/20000 train_loss:2.7752 grad_norm:0.3613 train_time:205070ms step_avg:1367.13ms -step:200/20000 train_loss:2.5614 grad_norm:0.2693 train_time:273305ms step_avg:1366.53ms -step:250/20000 train_loss:2.5709 grad_norm:0.2522 train_time:341556ms step_avg:1366.22ms -step:300/20000 train_loss:2.4364 grad_norm:0.2295 train_time:409825ms step_avg:1366.08ms -step:350/20000 train_loss:2.4859 grad_norm:0.2104 train_time:478072ms step_avg:1365.92ms -step:400/20000 train_loss:2.3988 grad_norm:0.1555 train_time:546341ms step_avg:1365.85ms -step:450/20000 train_loss:2.2317 grad_norm:0.1958 train_time:614614ms step_avg:1365.81ms -step:500/20000 train_loss:2.2898 grad_norm:0.1775 train_time:682900ms step_avg:1365.80ms -step:500/20000 val_loss:2.3130 val_bpb:1.3699 train_time:682945ms step_avg:1365.89ms -step:550/20000 train_loss:2.3492 grad_norm:0.1559 train_time:751209ms step_avg:1365.83ms -step:600/20000 train_loss:2.2513 grad_norm:0.1438 train_time:819544ms step_avg:1365.91ms -step:650/20000 train_loss:2.2323 grad_norm:0.1536 train_time:888368ms step_avg:1366.72ms -step:700/20000 train_loss:2.3026 grad_norm:0.1020 train_time:956783ms step_avg:1366.83ms -step:750/20000 train_loss:2.2750 grad_norm:0.1105 train_time:1025183ms step_avg:1366.91ms -step:800/20000 train_loss:2.2546 grad_norm:0.1031 train_time:1093599ms step_avg:1367.00ms -step:850/20000 train_loss:2.1799 grad_norm:0.0737 train_time:1162084ms step_avg:1367.16ms -step:900/20000 train_loss:2.0960 grad_norm:0.0817 train_time:1230597ms step_avg:1367.33ms -step:950/20000 train_loss:2.2968 grad_norm:0.0953 train_time:1299094ms step_avg:1367.47ms -step:1000/20000 train_loss:2.2247 grad_norm:0.0713 train_time:1367589ms step_avg:1367.59ms -step:1000/20000 val_loss:2.1722 val_bpb:1.2865 train_time:1367633ms step_avg:1367.63ms -step:1050/20000 train_loss:2.1500 grad_norm:0.1469 train_time:1436112ms step_avg:1367.73ms -step:1100/20000 train_loss:2.1744 grad_norm:0.0794 train_time:1504991ms step_avg:1368.17ms -step:1150/20000 train_loss:2.1290 grad_norm:0.0672 train_time:1573762ms step_avg:1368.49ms -step:1200/20000 train_loss:2.1756 grad_norm:0.0636 train_time:1642514ms step_avg:1368.76ms -step:1250/20000 train_loss:2.1991 grad_norm:0.0599 train_time:1711283ms step_avg:1369.03ms -step:1300/20000 train_loss:2.1695 grad_norm:0.1132 train_time:1780070ms step_avg:1369.28ms -step:1350/20000 train_loss:2.1436 grad_norm:0.1200 train_time:1848866ms step_avg:1369.53ms -step:1400/20000 train_loss:2.1553 grad_norm:0.0700 train_time:1917654ms step_avg:1369.75ms -step:1450/20000 train_loss:2.1501 grad_norm:0.0631 train_time:1986442ms step_avg:1369.96ms -step:1500/20000 train_loss:2.1193 grad_norm:0.0733 train_time:2055220ms step_avg:1370.15ms -step:1500/20000 val_loss:2.1071 val_bpb:1.2479 train_time:2055264ms step_avg:1370.18ms -step:1550/20000 train_loss:2.0928 grad_norm:0.0758 train_time:2124013ms step_avg:1370.33ms -step:1600/20000 train_loss:2.1722 grad_norm:0.0814 train_time:2193129ms step_avg:1370.71ms -step:1650/20000 train_loss:1.9557 grad_norm:0.0655 train_time:2261915ms step_avg:1370.86ms -step:1700/20000 train_loss:2.0848 grad_norm:0.0634 train_time:2330710ms step_avg:1371.01ms -step:1750/20000 train_loss:2.0562 grad_norm:0.0759 train_time:2399493ms step_avg:1371.14ms -step:1800/20000 train_loss:2.0964 grad_norm:0.0645 train_time:2468259ms step_avg:1371.26ms -step:1850/20000 train_loss:2.1107 grad_norm:0.0831 train_time:2537046ms step_avg:1371.38ms -step:1900/20000 train_loss:2.0580 grad_norm:0.0648 train_time:2605824ms step_avg:1371.49ms -step:1950/20000 train_loss:2.0431 grad_norm:0.0981 train_time:2674651ms step_avg:1371.62ms -step:2000/20000 train_loss:2.2944 grad_norm:0.0838 train_time:2743419ms step_avg:1371.71ms -step:2000/20000 val_loss:2.0763 val_bpb:1.2297 train_time:2743463ms step_avg:1371.73ms -step:2050/20000 train_loss:2.0607 grad_norm:0.1013 train_time:2812501ms step_avg:1371.95ms -step:2100/20000 train_loss:2.0358 grad_norm:0.0558 train_time:2881257ms step_avg:1372.03ms -step:2150/20000 train_loss:2.0142 grad_norm:0.0526 train_time:2950035ms step_avg:1372.11ms -step:2200/20000 train_loss:2.1668 grad_norm:0.0614 train_time:3018808ms step_avg:1372.19ms -step:2250/20000 train_loss:2.0604 grad_norm:0.0644 train_time:3087562ms step_avg:1372.25ms -step:2300/20000 train_loss:2.0377 grad_norm:0.1123 train_time:3156291ms step_avg:1372.30ms -step:2350/20000 train_loss:1.9923 grad_norm:0.0511 train_time:3225042ms step_avg:1372.36ms -step:2400/20000 train_loss:2.1062 grad_norm:0.0682 train_time:3293804ms step_avg:1372.42ms -step:2450/20000 train_loss:2.0650 grad_norm:0.0639 train_time:3362565ms step_avg:1372.48ms -step:2500/20000 train_loss:2.0208 grad_norm:0.0580 train_time:3431320ms step_avg:1372.53ms -step:2500/20000 val_loss:2.0279 val_bpb:1.2010 train_time:3431364ms step_avg:1372.55ms -step:2550/20000 train_loss:2.0211 grad_norm:0.0558 train_time:3500393ms step_avg:1372.70ms -step:2600/20000 train_loss:2.0001 grad_norm:0.0479 train_time:3569165ms step_avg:1372.76ms -step:2650/20000 train_loss:2.0040 grad_norm:0.0582 train_time:3637929ms step_avg:1372.80ms -step:2700/20000 train_loss:2.0265 grad_norm:0.0542 train_time:3706703ms step_avg:1372.85ms -step:2750/20000 train_loss:2.0077 grad_norm:0.0457 train_time:3775459ms step_avg:1372.89ms -step:2800/20000 train_loss:2.0415 grad_norm:0.0569 train_time:3844241ms step_avg:1372.94ms -step:2850/20000 train_loss:1.9900 grad_norm:0.0487 train_time:3913011ms step_avg:1372.99ms -step:2900/20000 train_loss:2.0045 grad_norm:0.0438 train_time:3981769ms step_avg:1373.02ms -step:2950/20000 train_loss:2.0440 grad_norm:0.0447 train_time:4050513ms step_avg:1373.06ms -step:3000/20000 train_loss:1.9316 grad_norm:0.0567 train_time:4119545ms step_avg:1373.18ms -step:3000/20000 val_loss:1.9838 val_bpb:1.1749 train_time:4119590ms step_avg:1373.20ms -step:3050/20000 train_loss:1.9372 grad_norm:0.0506 train_time:4188300ms step_avg:1373.21ms -step:3100/20000 train_loss:1.9990 grad_norm:0.0465 train_time:4257075ms step_avg:1373.25ms -step:3150/20000 train_loss:2.0077 grad_norm:0.0401 train_time:4325837ms step_avg:1373.28ms -swa:start step:3200 -step:3200/20000 train_loss:1.9812 grad_norm:0.0445 train_time:4394566ms step_avg:1373.30ms -late_qat:enabled step:3241 scale:0.1495 core_quant:on -step:3250/20000 train_loss:1.9531 grad_norm:0.0567 train_time:4519079ms step_avg:1390.49ms -step:3300/20000 train_loss:1.9296 grad_norm:0.0386 train_time:4587540ms step_avg:1390.16ms -step:3350/20000 train_loss:1.9653 grad_norm:0.0394 train_time:4655858ms step_avg:1389.81ms -step:3400/20000 train_loss:2.0099 grad_norm:0.0483 train_time:4724204ms step_avg:1389.47ms -step:3450/20000 train_loss:1.9637 grad_norm:0.0369 train_time:4792535ms step_avg:1389.14ms -step:3456/20000 val_loss:1.9505 val_bpb:1.1552 train_time:4800814ms step_avg:1389.12ms -stopping_early: wallclock_cap train_time:4800814ms step:3456/20000 -peak memory allocated: 50545 MiB reserved: 50594 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:1.9472 val_bpb:1.1532 eval_time:32839ms -Serialized model: 106023671 bytes -Code size: 102633 bytes -Serialized model int6+lzma: 16373548 bytes -Total submission size int6+lzma: 16476181 bytes -final_int6_roundtrip val_loss:1.9574 val_bpb:1.1593 eval_time:39862ms -final_int6_roundtrip_exact val_loss:1.95735252 val_bpb:1.15925441 -final_int6_sliding_window val_loss:1.9164 val_bpb:1.1350 stride:64 eval_time:1105486ms -final_int6_sliding_window_exact val_loss:1.91642779 val_bpb:1.13501949 -final_int8_zlib_roundtrip_exact val_loss:1.91642779 val_bpb:1.13501949 -ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=2 -ttt_sliding:params unfrozen=26923088 frozen=4112 - ttt_chunk [1/1893] bpb=1.226275 time=1.9s - ttt_chunk [11/1893] bpb=1.128206 time=20.5s - ttt_chunk [21/1893] bpb=1.137378 time=39.0s - ttt_chunk [31/1893] bpb=1.142175 time=57.6s - ttt_chunk [41/1893] bpb=1.138228 time=76.1s - ttt_chunk [51/1893] bpb=1.139877 time=94.6s - ttt_chunk [61/1893] bpb=1.143695 time=113.2s - ttt_chunk [71/1893] bpb=1.141806 time=131.7s - ttt_chunk [81/1893] bpb=1.138175 time=150.2s - ttt_chunk [91/1893] bpb=1.137107 time=168.8s - ttt_chunk [101/1893] bpb=1.138115 time=187.3s - ttt_chunk [111/1893] bpb=1.138295 time=205.9s - ttt_chunk [121/1893] bpb=1.134671 time=224.4s - ttt_chunk [131/1893] bpb=1.133939 time=242.9s - ttt_chunk [141/1893] bpb=1.132766 time=261.5s - ttt_chunk [151/1893] bpb=1.132980 time=280.0s - ttt_chunk [161/1893] bpb=1.133800 time=298.6s - ttt_chunk [171/1893] bpb=1.135874 time=317.1s - ttt_chunk [181/1893] bpb=1.135884 time=335.6s - ttt_chunk [191/1893] bpb=1.138340 time=354.2s - ttt_chunk [201/1893] bpb=1.137866 time=372.7s - ttt_chunk [211/1893] bpb=1.136957 time=391.2s - ttt_chunk [221/1893] bpb=1.137842 time=409.8s - ttt_chunk [231/1893] bpb=1.137565 time=428.3s - ttt_chunk [241/1893] bpb=1.137849 time=446.8s - ttt_chunk [251/1893] bpb=1.137360 time=465.4s - ttt_chunk [261/1893] bpb=1.136692 time=483.9s - ttt_chunk [271/1893] bpb=1.135780 time=502.5s - ttt_chunk [281/1893] bpb=1.137389 time=521.0s - ttt_chunk [291/1893] bpb=1.137018 time=539.5s - ttt_chunk [301/1893] bpb=1.137918 time=558.1s - ttt_chunk [311/1893] bpb=1.138001 time=576.6s - ttt_chunk [321/1893] bpb=1.138708 time=595.1s - ttt_chunk [331/1893] bpb=1.138179 time=613.7s - ttt_chunk [341/1893] bpb=1.137832 time=632.2s - ttt_chunk [351/1893] bpb=1.138543 time=650.8s - ttt_chunk [361/1893] bpb=1.139301 time=669.3s - ttt_chunk [371/1893] bpb=1.139185 time=687.8s - ttt_chunk [381/1893] bpb=1.138924 time=706.4s - ttt_chunk [391/1893] bpb=1.139607 time=724.9s - ttt_chunk [401/1893] bpb=1.139172 time=743.4s - ttt_chunk [411/1893] bpb=1.138218 time=762.0s - ttt_chunk [421/1893] bpb=1.138334 time=780.5s - ttt_chunk [431/1893] bpb=1.138777 time=799.1s - ttt_chunk [441/1893] bpb=1.138161 time=817.6s - ttt_chunk [451/1893] bpb=1.138301 time=836.1s - ttt_chunk [461/1893] bpb=1.138190 time=854.7s - ttt_chunk [471/1893] bpb=1.137746 time=873.2s - ttt_chunk [481/1893] bpb=1.137597 time=891.8s - ttt_chunk [491/1893] bpb=1.137722 time=910.3s - ttt_chunk [501/1893] bpb=1.137492 time=928.8s - ttt_chunk [511/1893] bpb=1.137017 time=947.4s - ttt_chunk [521/1893] bpb=1.136714 time=965.9s - ttt_chunk [531/1893] bpb=1.137443 time=984.5s - ttt_chunk [541/1893] bpb=1.137557 time=1003.0s - ttt_chunk [551/1893] bpb=1.137019 time=1021.5s - ttt_chunk [561/1893] bpb=1.136885 time=1040.1s - ttt_chunk [571/1893] bpb=1.136621 time=1058.6s - ttt_chunk [581/1893] bpb=1.136257 time=1077.2s - ttt_chunk [591/1893] bpb=1.135719 time=1095.7s - ttt_chunk [601/1893] bpb=1.135711 time=1114.2s - ttt_chunk [611/1893] bpb=1.135386 time=1132.8s - ttt_chunk [621/1893] bpb=1.135235 time=1151.3s - ttt_chunk [631/1893] bpb=1.134973 time=1169.9s - ttt_chunk [641/1893] bpb=1.134519 time=1188.4s - ttt_chunk [651/1893] bpb=1.134057 time=1206.9s - ttt_chunk [661/1893] bpb=1.133947 time=1225.5s - ttt_chunk [671/1893] bpb=1.133482 time=1244.0s - ttt_chunk [681/1893] bpb=1.132918 time=1262.6s - ttt_chunk [691/1893] bpb=1.132994 time=1281.1s - ttt_chunk [701/1893] bpb=1.132163 time=1299.6s - ttt_chunk [711/1893] bpb=1.132176 time=1318.2s - ttt_chunk [721/1893] bpb=1.132090 time=1336.7s - ttt_chunk [731/1893] bpb=1.132331 time=1355.2s - ttt_chunk [741/1893] bpb=1.132205 time=1373.8s - ttt_chunk [751/1893] bpb=1.131884 time=1392.3s - ttt_chunk [761/1893] bpb=1.132028 time=1410.8s - ttt_chunk [771/1893] bpb=1.131860 time=1429.4s - ttt_chunk [781/1893] bpb=1.132024 time=1447.9s - ttt_chunk [791/1893] bpb=1.131869 time=1466.4s - ttt_chunk [801/1893] bpb=1.131804 time=1485.0s - ttt_chunk [811/1893] bpb=1.131817 time=1503.5s - ttt_chunk [821/1893] bpb=1.131702 time=1522.1s - ttt_chunk [831/1893] bpb=1.131418 time=1540.6s - ttt_chunk [841/1893] bpb=1.131180 time=1559.1s - ttt_chunk [851/1893] bpb=1.131241 time=1577.7s - ttt_chunk [861/1893] bpb=1.131312 time=1596.2s - ttt_chunk [871/1893] bpb=1.131521 time=1614.7s - ttt_chunk [881/1893] bpb=1.131519 time=1633.3s - ttt_chunk [891/1893] bpb=1.130978 time=1651.8s - ttt_chunk [901/1893] bpb=1.130995 time=1670.3s - ttt_chunk [911/1893] bpb=1.130849 time=1688.9s - ttt_chunk [921/1893] bpb=1.130984 time=1707.4s - ttt_chunk [931/1893] bpb=1.130928 time=1726.0s - ttt_chunk [941/1893] bpb=1.131129 time=1744.5s - ttt_chunk [951/1893] bpb=1.131431 time=1763.0s - ttt_chunk [961/1893] bpb=1.131741 time=1781.6s - ttt_chunk [971/1893] bpb=1.132107 time=1800.1s - ttt_chunk [981/1893] bpb=1.132319 time=1818.6s - ttt_chunk [991/1893] bpb=1.132236 time=1837.2s - ttt_chunk [1001/1893] bpb=1.132567 time=1855.7s - ttt_chunk [1011/1893] bpb=1.132723 time=1874.3s - ttt_chunk [1021/1893] bpb=1.133011 time=1892.8s - ttt_chunk [1031/1893] bpb=1.133400 time=1911.3s - ttt_chunk [1041/1893] bpb=1.133897 time=1929.9s - ttt_chunk [1051/1893] bpb=1.133756 time=1948.4s - ttt_chunk [1061/1893] bpb=1.133865 time=1967.0s - ttt_chunk [1071/1893] bpb=1.134029 time=1985.5s - ttt_chunk [1081/1893] bpb=1.134076 time=2004.1s - ttt_chunk [1091/1893] bpb=1.134326 time=2022.7s - ttt_chunk [1101/1893] bpb=1.134469 time=2041.2s - ttt_chunk [1111/1893] bpb=1.134274 time=2059.8s - ttt_chunk [1121/1893] bpb=1.134049 time=2078.3s - ttt_chunk [1131/1893] bpb=1.133943 time=2096.9s - ttt_chunk [1141/1893] bpb=1.133705 time=2115.4s - ttt_chunk [1151/1893] bpb=1.133733 time=2134.0s - ttt_chunk [1161/1893] bpb=1.133569 time=2152.5s - ttt_chunk [1171/1893] bpb=1.133389 time=2171.1s - ttt_chunk [1181/1893] bpb=1.133164 time=2189.6s - ttt_chunk [1191/1893] bpb=1.133317 time=2208.2s - ttt_chunk [1201/1893] bpb=1.133519 time=2226.8s - ttt_chunk [1211/1893] bpb=1.133117 time=2245.3s - ttt_chunk [1221/1893] bpb=1.133455 time=2263.9s - ttt_chunk [1231/1893] bpb=1.133394 time=2282.4s - ttt_chunk [1241/1893] bpb=1.133104 time=2300.9s - ttt_chunk [1251/1893] bpb=1.132567 time=2319.5s - ttt_chunk [1261/1893] bpb=1.132300 time=2338.0s - ttt_chunk [1271/1893] bpb=1.132047 time=2356.6s - ttt_chunk [1281/1893] bpb=1.131738 time=2375.1s - ttt_chunk [1291/1893] bpb=1.131494 time=2393.7s - ttt_chunk [1301/1893] bpb=1.131443 time=2412.2s - ttt_chunk [1311/1893] bpb=1.131173 time=2430.7s - ttt_chunk [1321/1893] bpb=1.130872 time=2449.3s - ttt_chunk [1331/1893] bpb=1.130632 time=2467.8s - ttt_chunk [1341/1893] bpb=1.130505 time=2486.4s - ttt_chunk [1351/1893] bpb=1.130352 time=2504.9s - ttt_chunk [1361/1893] bpb=1.130484 time=2523.5s - ttt_chunk [1371/1893] bpb=1.130705 time=2542.0s - ttt_chunk [1381/1893] bpb=1.130910 time=2560.5s - ttt_chunk [1391/1893] bpb=1.130695 time=2579.1s - ttt_chunk [1401/1893] bpb=1.130724 time=2597.6s - ttt_chunk [1411/1893] bpb=1.130831 time=2616.2s - ttt_chunk [1421/1893] bpb=1.130815 time=2634.7s - ttt_chunk [1431/1893] bpb=1.130791 time=2653.3s - ttt_chunk [1441/1893] bpb=1.131256 time=2671.8s - ttt_chunk [1451/1893] bpb=1.131119 time=2691.1s - ttt_chunk [1461/1893] bpb=1.131048 time=2709.6s - ttt_chunk [1471/1893] bpb=1.131643 time=2728.2s - ttt_chunk [1481/1893] bpb=1.131517 time=2746.7s - ttt_chunk [1491/1893] bpb=1.131890 time=2765.3s - ttt_chunk [1501/1893] bpb=1.131872 time=2783.8s - ttt_chunk [1511/1893] bpb=1.131833 time=2802.3s - ttt_chunk [1521/1893] bpb=1.131945 time=2820.9s - ttt_chunk [1531/1893] bpb=1.132160 time=2839.4s - ttt_chunk [1541/1893] bpb=1.132230 time=2858.0s - ttt_chunk [1551/1893] bpb=1.132470 time=2876.5s - ttt_chunk [1561/1893] bpb=1.132554 time=2895.1s - ttt_chunk [1571/1893] bpb=1.132686 time=2913.6s - ttt_chunk [1581/1893] bpb=1.132836 time=2932.1s - ttt_chunk [1591/1893] bpb=1.132902 time=2950.7s - ttt_chunk [1601/1893] bpb=1.133020 time=2969.2s - ttt_chunk [1611/1893] bpb=1.133281 time=2987.8s - ttt_chunk [1621/1893] bpb=1.133141 time=3006.3s - ttt_chunk [1631/1893] bpb=1.133187 time=3024.8s - ttt_chunk [1641/1893] bpb=1.133212 time=3043.4s - ttt_chunk [1651/1893] bpb=1.133269 time=3061.9s - ttt_chunk [1661/1893] bpb=1.133410 time=3080.5s - ttt_chunk [1671/1893] bpb=1.133595 time=3099.0s - ttt_chunk [1681/1893] bpb=1.133686 time=3117.5s - ttt_chunk [1691/1893] bpb=1.133787 time=3136.1s - ttt_chunk [1701/1893] bpb=1.133884 time=3154.6s - ttt_chunk [1711/1893] bpb=1.133862 time=3173.2s - ttt_chunk [1721/1893] bpb=1.133701 time=3191.7s - ttt_chunk [1731/1893] bpb=1.133797 time=3210.2s - ttt_chunk [1741/1893] bpb=1.133534 time=3228.8s - ttt_chunk [1751/1893] bpb=1.133407 time=3247.3s - ttt_chunk [1761/1893] bpb=1.133444 time=3265.9s - ttt_chunk [1771/1893] bpb=1.133395 time=3284.4s - ttt_chunk [1781/1893] bpb=1.133298 time=3303.0s - ttt_chunk [1791/1893] bpb=1.132959 time=3321.5s - ttt_chunk [1801/1893] bpb=1.132941 time=3340.0s - ttt_chunk [1811/1893] bpb=1.132795 time=3358.6s - ttt_chunk [1821/1893] bpb=1.132853 time=3377.1s - ttt_chunk [1831/1893] bpb=1.132699 time=3395.7s - ttt_chunk [1841/1893] bpb=1.132738 time=3414.2s - ttt_chunk [1851/1893] bpb=1.132559 time=3432.7s - ttt_chunk [1861/1893] bpb=1.132478 time=3451.3s - ttt_chunk [1871/1893] bpb=1.132413 time=3469.8s - ttt_chunk [1881/1893] bpb=1.132170 time=3488.4s - ttt_chunk [1891/1893] bpb=1.132153 time=3506.9s - ttt_chunk [1893/1893] bpb=1.132184 time=3509.9s -ttt_sliding:done val_loss=1.911640 val_bpb=1.132184 elapsed=3510.0s -legal_ttt val_loss:1.9116 val_bpb:1.1322 eval_time:3510399ms -legal_ttt_exact val_loss:1.91163996 val_bpb:1.13218386 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/output.log deleted file mode 100644 index 804e8ee9f1..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/output.log +++ /dev/null @@ -1,67 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main - warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ - return super().__call__(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward - def forward(self, input_ids: Tensor, target_ids: Tensor, - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward - return compiled_fn(full_args) - ^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper - all_outs = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g - return f(*args) - ^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply - return super().apply(*args, **kwargs) # type: ignore[misc] - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward - fw_outs = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper - return compiled_fn(runtime_args) - ^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn - outs = compiled_fn(args) - ^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ - return self.current_callable(inputs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run - out = model(new_inputs) - ^^^^^^^^^^^^^^^^^ - File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 7110, in call - buf5 = empty_strided_cuda((98304, 512), (512, 1), torch.bfloat16) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 18.69 MiB is free. Process 448538 has 758.00 MiB memory in use. Including non-PyTorch memory, this process has 758.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 127.50 MiB is allocated by PyTorch, and 18.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/output.log deleted file mode 100644 index 90f71e3c3b..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/output.log +++ /dev/null @@ -1,67 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main - warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ - return super().__call__(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward - def forward(self, input_ids: Tensor, target_ids: Tensor, - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward - return compiled_fn(full_args) - ^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper - all_outs = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g - return f(*args) - ^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply - return super().apply(*args, **kwargs) # type: ignore[misc] - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward - fw_outs = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper - return compiled_fn(runtime_args) - ^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn - outs = compiled_fn(args) - ^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ - return self.current_callable(inputs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run - out = model(new_inputs) - ^^^^^^^^^^^^^^^^^ - File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 7110, in call - buf5 = empty_strided_cuda((98304, 512), (512, 1), torch.bfloat16) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 20.69 MiB is free. Including non-PyTorch memory, this process has 758.00 MiB memory in use. Process 448539 has 756.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 127.50 MiB is allocated by PyTorch, and 18.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/output.log deleted file mode 100644 index 9a78b745cc..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/output.log +++ /dev/null @@ -1,67 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main - warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__ - return super().__call__(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl - return self._call_impl(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl - return forward_call(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward - def forward(self, input_ids: Tensor, target_ids: Tensor, - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward - return compiled_fn(full_args) - ^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper - all_outs = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g - return f(*args) - ^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply - return super().apply(*args, **kwargs) # type: ignore[misc] - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward - fw_outs = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper - return compiled_fn(runtime_args) - ^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn - outs = compiled_fn(args) - ^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ - return self.current_callable(inputs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run - out = model(new_inputs) - ^^^^^^^^^^^^^^^^^ - File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 9077, in call - buf776 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 24.69 MiB is free. Process 448538 has 754.00 MiB memory in use. Process 448539 has 756.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Including non-PyTorch memory, this process has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 38.01 GiB is allocated by PyTorch, and 15.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/output.log deleted file mode 100644 index ca77ba8372..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/output.log +++ /dev/null @@ -1,41 +0,0 @@ -wandb:initialized -Traceback (most recent call last): - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in - main() - File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main - (warmup_loss * grad_scale).backward() - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward - torch.autograd.backward( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward - _engine_run_backward( - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward - return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply - return user_fn(self, *args) - ^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward - return impl_fn() - ^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn - out = CompiledFunction._backward_impl(ctx, all_args) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl - out = call_func_at_runtime_with_args( - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args - out = normalize_as_list(f(args)) - ^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn - return fn(*args, **kwargs) - ^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__ - return self.current_callable(inputs) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run - out = model(new_inputs) - ^^^^^^^^^^^^^^^^^ - File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call - buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 18.69 MiB is free. Process 448538 has 758.00 MiB memory in use. Process 448539 has 758.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/output.log b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/output.log deleted file mode 100644 index 50a741a513..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/output.log +++ /dev/null @@ -1,379 +0,0 @@ -wandb:initialized -warmup_step:1/20 -warmup_step:2/20 -warmup_step:3/20 -warmup_step:4/20 -warmup_step:5/20 -warmup_step:6/20 -warmup_step:7/20 -warmup_step:8/20 -warmup_step:9/20 -warmup_step:10/20 -warmup_step:11/20 -warmup_step:12/20 -warmup_step:13/20 -warmup_step:14/20 -warmup_step:15/20 -warmup_step:16/20 -warmup_step:17/20 -warmup_step:18/20 -warmup_step:19/20 -warmup_step:20/20 -step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms -step:1/20000 train_loss:6.9310 grad_norm:0.3715 train_time:739ms step_avg:739.22ms -step:2/20000 train_loss:8.4759 grad_norm:3.5698 train_time:1441ms step_avg:720.50ms -step:3/20000 train_loss:7.5787 grad_norm:2.0259 train_time:2207ms step_avg:735.61ms -step:4/20000 train_loss:7.3563 grad_norm:1.4981 train_time:2972ms step_avg:742.93ms -step:5/20000 train_loss:7.1725 grad_norm:1.6464 train_time:3729ms step_avg:745.72ms -step:6/20000 train_loss:7.1055 grad_norm:1.5402 train_time:4489ms step_avg:748.20ms -step:7/20000 train_loss:7.0940 grad_norm:1.7366 train_time:5253ms step_avg:750.42ms -step:8/20000 train_loss:6.9891 grad_norm:1.2439 train_time:6012ms step_avg:751.51ms -step:9/20000 train_loss:6.6063 grad_norm:0.9271 train_time:6770ms step_avg:752.25ms -step:10/20000 train_loss:6.2335 grad_norm:0.8702 train_time:7539ms step_avg:753.87ms -step:50/20000 train_loss:3.7232 grad_norm:0.7000 train_time:38194ms step_avg:763.87ms -step:100/20000 train_loss:3.2045 grad_norm:0.8824 train_time:76601ms step_avg:766.01ms -step:150/20000 train_loss:2.8496 grad_norm:0.3729 train_time:114956ms step_avg:766.37ms -step:200/20000 train_loss:2.6184 grad_norm:0.3442 train_time:153346ms step_avg:766.73ms -step:250/20000 train_loss:2.6059 grad_norm:0.2754 train_time:191710ms step_avg:766.84ms -step:300/20000 train_loss:2.4690 grad_norm:0.3187 train_time:230112ms step_avg:767.04ms -step:350/20000 train_loss:2.5041 grad_norm:0.1730 train_time:268510ms step_avg:767.17ms -step:400/20000 train_loss:2.4246 grad_norm:0.2139 train_time:306906ms step_avg:767.26ms -step:450/20000 train_loss:2.2470 grad_norm:0.1624 train_time:345332ms step_avg:767.40ms -step:500/20000 train_loss:2.3033 grad_norm:0.1807 train_time:383744ms step_avg:767.49ms -step:500/20000 val_loss:2.3232 val_bpb:1.3759 train_time:383791ms step_avg:767.58ms -step:550/20000 train_loss:2.3638 grad_norm:0.1696 train_time:422169ms step_avg:767.58ms -step:600/20000 train_loss:2.2655 grad_norm:0.1474 train_time:460616ms step_avg:767.69ms -step:650/20000 train_loss:2.2434 grad_norm:0.1741 train_time:499080ms step_avg:767.82ms -step:700/20000 train_loss:2.3153 grad_norm:0.1303 train_time:537548ms step_avg:767.93ms -step:750/20000 train_loss:2.2883 grad_norm:0.0978 train_time:576022ms step_avg:768.03ms -step:800/20000 train_loss:2.2638 grad_norm:0.0930 train_time:614511ms step_avg:768.14ms -step:850/20000 train_loss:2.1916 grad_norm:0.0668 train_time:653031ms step_avg:768.27ms -step:900/20000 train_loss:2.1043 grad_norm:0.0706 train_time:691568ms step_avg:768.41ms -step:950/20000 train_loss:2.3072 grad_norm:0.0614 train_time:730098ms step_avg:768.52ms -step:1000/20000 train_loss:2.2373 grad_norm:0.0756 train_time:768630ms step_avg:768.63ms -step:1000/20000 val_loss:2.1827 val_bpb:1.2927 train_time:768677ms step_avg:768.68ms -step:1050/20000 train_loss:2.1577 grad_norm:0.0518 train_time:807163ms step_avg:768.73ms -step:1100/20000 train_loss:2.1862 grad_norm:0.1172 train_time:845684ms step_avg:768.80ms -step:1150/20000 train_loss:2.1385 grad_norm:0.0562 train_time:884226ms step_avg:768.89ms -step:1200/20000 train_loss:2.1870 grad_norm:0.1121 train_time:922748ms step_avg:768.96ms -step:1250/20000 train_loss:2.2093 grad_norm:0.0998 train_time:961291ms step_avg:769.03ms -step:1300/20000 train_loss:2.1788 grad_norm:0.0968 train_time:999829ms step_avg:769.10ms -step:1350/20000 train_loss:2.1574 grad_norm:0.1471 train_time:1038366ms step_avg:769.16ms -step:1400/20000 train_loss:2.1680 grad_norm:0.0608 train_time:1076900ms step_avg:769.21ms -step:1450/20000 train_loss:2.1623 grad_norm:0.0881 train_time:1115450ms step_avg:769.28ms -step:1500/20000 train_loss:2.1325 grad_norm:0.0715 train_time:1153976ms step_avg:769.32ms -step:1500/20000 val_loss:2.1196 val_bpb:1.2554 train_time:1154023ms step_avg:769.35ms -step:1550/20000 train_loss:2.1034 grad_norm:0.0555 train_time:1192498ms step_avg:769.35ms -step:1600/20000 train_loss:2.1841 grad_norm:0.0690 train_time:1231045ms step_avg:769.40ms -step:1650/20000 train_loss:1.9679 grad_norm:0.0623 train_time:1269584ms step_avg:769.44ms -step:1700/20000 train_loss:2.0979 grad_norm:0.1024 train_time:1308158ms step_avg:769.50ms -step:1750/20000 train_loss:2.0668 grad_norm:0.0916 train_time:1346716ms step_avg:769.55ms -step:1800/20000 train_loss:2.1070 grad_norm:0.0947 train_time:1385264ms step_avg:769.59ms -step:1850/20000 train_loss:2.1232 grad_norm:0.0632 train_time:1423808ms step_avg:769.63ms -step:1900/20000 train_loss:2.0734 grad_norm:0.0745 train_time:1462341ms step_avg:769.65ms -step:1950/20000 train_loss:2.0600 grad_norm:0.1123 train_time:1500897ms step_avg:769.69ms -step:2000/20000 train_loss:2.3179 grad_norm:0.0744 train_time:1539440ms step_avg:769.72ms -step:2000/20000 val_loss:2.0970 val_bpb:1.2420 train_time:1539487ms step_avg:769.74ms -step:2050/20000 train_loss:2.0834 grad_norm:0.0576 train_time:1577995ms step_avg:769.75ms -step:2100/20000 train_loss:2.0619 grad_norm:0.0535 train_time:1616543ms step_avg:769.78ms -step:2150/20000 train_loss:2.0442 grad_norm:0.0773 train_time:1655080ms step_avg:769.80ms -step:2200/20000 train_loss:2.2001 grad_norm:0.0771 train_time:1693618ms step_avg:769.83ms -step:2250/20000 train_loss:2.0929 grad_norm:0.0580 train_time:1732173ms step_avg:769.85ms -step:2300/20000 train_loss:2.0744 grad_norm:0.0679 train_time:1770710ms step_avg:769.87ms -step:2350/20000 train_loss:2.0331 grad_norm:0.1111 train_time:1809258ms step_avg:769.90ms -step:2400/20000 train_loss:2.1487 grad_norm:0.0711 train_time:1847814ms step_avg:769.92ms -step:2450/20000 train_loss:2.1108 grad_norm:0.0571 train_time:1886353ms step_avg:769.94ms -step:2500/20000 train_loss:2.0727 grad_norm:0.0554 train_time:1924887ms step_avg:769.95ms -step:2500/20000 val_loss:2.0778 val_bpb:1.2306 train_time:1924935ms step_avg:769.97ms -step:2550/20000 train_loss:2.0748 grad_norm:0.0686 train_time:1963451ms step_avg:769.98ms -step:2600/20000 train_loss:2.0557 grad_norm:0.0604 train_time:2001997ms step_avg:770.00ms -step:2650/20000 train_loss:2.0616 grad_norm:0.0686 train_time:2040534ms step_avg:770.01ms -step:2700/20000 train_loss:2.0896 grad_norm:0.1260 train_time:2079065ms step_avg:770.02ms -step:2750/20000 train_loss:2.0721 grad_norm:0.0683 train_time:2117609ms step_avg:770.04ms -step:2800/20000 train_loss:2.1120 grad_norm:0.0617 train_time:2156137ms step_avg:770.05ms -step:2850/20000 train_loss:2.0672 grad_norm:0.0618 train_time:2194684ms step_avg:770.06ms -step:2900/20000 train_loss:2.0771 grad_norm:0.0606 train_time:2233233ms step_avg:770.08ms -step:2950/20000 train_loss:2.1240 grad_norm:0.0555 train_time:2271768ms step_avg:770.09ms -step:3000/20000 train_loss:2.0159 grad_norm:0.1451 train_time:2310313ms step_avg:770.10ms -step:3000/20000 val_loss:2.0668 val_bpb:1.2241 train_time:2310360ms step_avg:770.12ms -step:3050/20000 train_loss:2.0182 grad_norm:0.0595 train_time:2348849ms step_avg:770.11ms -step:3100/20000 train_loss:2.0905 grad_norm:0.1292 train_time:2387390ms step_avg:770.13ms -step:3150/20000 train_loss:2.1075 grad_norm:0.0625 train_time:2425926ms step_avg:770.14ms -step:3200/20000 train_loss:2.0823 grad_norm:0.0562 train_time:2464469ms step_avg:770.15ms -step:3250/20000 train_loss:2.0517 grad_norm:0.0675 train_time:2502996ms step_avg:770.15ms -step:3300/20000 train_loss:2.0328 grad_norm:0.0879 train_time:2541545ms step_avg:770.17ms -step:3350/20000 train_loss:2.0720 grad_norm:0.0532 train_time:2580093ms step_avg:770.18ms -step:3400/20000 train_loss:2.1303 grad_norm:0.1521 train_time:2618607ms step_avg:770.18ms -step:3450/20000 train_loss:2.0795 grad_norm:0.0880 train_time:2657152ms step_avg:770.19ms -step:3500/20000 train_loss:2.0592 grad_norm:0.0568 train_time:2695692ms step_avg:770.20ms -step:3500/20000 val_loss:2.0566 val_bpb:1.2180 train_time:2695739ms step_avg:770.21ms -step:3550/20000 train_loss:2.0263 grad_norm:0.1106 train_time:2734239ms step_avg:770.21ms -step:3600/20000 train_loss:2.0242 grad_norm:0.0546 train_time:2772777ms step_avg:770.22ms -step:3650/20000 train_loss:2.0434 grad_norm:0.0792 train_time:2811292ms step_avg:770.22ms -step:3700/20000 train_loss:2.0376 grad_norm:0.0698 train_time:2849835ms step_avg:770.23ms -step:3750/20000 train_loss:2.0472 grad_norm:0.0748 train_time:2888385ms step_avg:770.24ms -step:3800/20000 train_loss:2.0429 grad_norm:0.0871 train_time:2926943ms step_avg:770.25ms -step:3850/20000 train_loss:2.0756 grad_norm:0.0799 train_time:2965493ms step_avg:770.26ms -step:3900/20000 train_loss:2.0692 grad_norm:0.0643 train_time:3004035ms step_avg:770.27ms -step:3950/20000 train_loss:2.0267 grad_norm:0.0667 train_time:3042573ms step_avg:770.27ms -step:4000/20000 train_loss:2.0478 grad_norm:0.1092 train_time:3081111ms step_avg:770.28ms -step:4000/20000 val_loss:2.0501 val_bpb:1.2142 train_time:3081158ms step_avg:770.29ms -step:4050/20000 train_loss:2.0524 grad_norm:0.0595 train_time:3119653ms step_avg:770.28ms -step:4100/20000 train_loss:1.9268 grad_norm:0.0611 train_time:3158208ms step_avg:770.29ms -step:4150/20000 train_loss:2.0567 grad_norm:0.0683 train_time:3196731ms step_avg:770.30ms -step:4200/20000 train_loss:2.0971 grad_norm:0.0701 train_time:3235281ms step_avg:770.30ms -step:4250/20000 train_loss:2.0465 grad_norm:0.0670 train_time:3273811ms step_avg:770.31ms -step:4300/20000 train_loss:2.0333 grad_norm:0.0543 train_time:3312336ms step_avg:770.31ms -step:4350/20000 train_loss:2.0258 grad_norm:0.0616 train_time:3350887ms step_avg:770.32ms -step:4400/20000 train_loss:2.0344 grad_norm:0.0614 train_time:3389423ms step_avg:770.32ms -step:4450/20000 train_loss:2.0645 grad_norm:0.0506 train_time:3427944ms step_avg:770.32ms -step:4500/20000 train_loss:2.0679 grad_norm:0.0550 train_time:3466466ms step_avg:770.33ms -step:4500/20000 val_loss:2.0466 val_bpb:1.2121 train_time:3466512ms step_avg:770.34ms -step:4550/20000 train_loss:2.0316 grad_norm:0.0607 train_time:3505002ms step_avg:770.33ms -step:4600/20000 train_loss:1.9490 grad_norm:0.0543 train_time:3543530ms step_avg:770.33ms -step:4650/20000 train_loss:2.0275 grad_norm:0.0590 train_time:3582052ms step_avg:770.33ms -step:4700/20000 train_loss:2.0597 grad_norm:0.1022 train_time:3620609ms step_avg:770.34ms -step:4750/20000 train_loss:2.0241 grad_norm:0.1000 train_time:3659144ms step_avg:770.35ms -step:4800/20000 train_loss:2.0349 grad_norm:0.0580 train_time:3697672ms step_avg:770.35ms -step:4850/20000 train_loss:2.0473 grad_norm:0.0532 train_time:3736199ms step_avg:770.35ms -step:4900/20000 train_loss:2.0297 grad_norm:0.0634 train_time:3774723ms step_avg:770.35ms -step:4950/20000 train_loss:1.9799 grad_norm:0.0504 train_time:3813238ms step_avg:770.35ms -step:5000/20000 train_loss:2.0735 grad_norm:0.0549 train_time:3851805ms step_avg:770.36ms -step:5000/20000 val_loss:2.0184 val_bpb:1.1954 train_time:3851852ms step_avg:770.37ms -step:5050/20000 train_loss:1.9940 grad_norm:0.0666 train_time:3890332ms step_avg:770.36ms -step:5100/20000 train_loss:1.9998 grad_norm:0.0478 train_time:3928885ms step_avg:770.37ms -step:5150/20000 train_loss:2.0985 grad_norm:0.0906 train_time:3967424ms step_avg:770.37ms -step:5200/20000 train_loss:2.0041 grad_norm:0.0450 train_time:4005964ms step_avg:770.38ms -step:5250/20000 train_loss:1.9757 grad_norm:0.0451 train_time:4044484ms step_avg:770.38ms -step:5300/20000 train_loss:1.9557 grad_norm:0.0544 train_time:4083005ms step_avg:770.38ms -step:5350/20000 train_loss:1.9972 grad_norm:0.0399 train_time:4121527ms step_avg:770.38ms -step:5400/20000 train_loss:2.0035 grad_norm:0.0433 train_time:4160051ms step_avg:770.38ms -step:5450/20000 train_loss:2.0130 grad_norm:0.0411 train_time:4198561ms step_avg:770.38ms -step:5500/20000 train_loss:2.0100 grad_norm:0.0376 train_time:4237081ms step_avg:770.38ms -step:5500/20000 val_loss:1.9818 val_bpb:1.1737 train_time:4237128ms step_avg:770.39ms -step:5550/20000 train_loss:1.9694 grad_norm:0.0464 train_time:4275608ms step_avg:770.38ms -step:5600/20000 train_loss:1.9396 grad_norm:0.0419 train_time:4314139ms step_avg:770.38ms -step:5650/20000 train_loss:2.0040 grad_norm:0.0377 train_time:4352662ms step_avg:770.38ms -step:5700/20000 train_loss:1.9579 grad_norm:0.0492 train_time:4391196ms step_avg:770.39ms -step:5750/20000 train_loss:1.9341 grad_norm:0.0370 train_time:4429712ms step_avg:770.38ms -step:5800/20000 train_loss:1.8494 grad_norm:0.0476 train_time:4468236ms step_avg:770.39ms -step:5850/20000 train_loss:1.8418 grad_norm:0.0404 train_time:4506763ms step_avg:770.39ms -swa:start step:5900 -step:5900/20000 train_loss:1.9171 grad_norm:0.0399 train_time:4545281ms step_avg:770.39ms -step:5950/20000 train_loss:1.9844 grad_norm:0.0432 train_time:4583947ms step_avg:770.41ms -late_qat:enabled step:5976 scale:0.1496 core_quant:on -step:6000/20000 train_loss:1.9423 grad_norm:0.0376 train_time:4656702ms step_avg:776.12ms -step:6000/20000 val_loss:1.9425 val_bpb:1.1505 train_time:4656808ms step_avg:776.13ms -step:6050/20000 train_loss:1.9037 grad_norm:0.0447 train_time:4695045ms step_avg:776.04ms -step:6100/20000 train_loss:1.9130 grad_norm:0.0341 train_time:4733411ms step_avg:775.97ms -step:6150/20000 train_loss:1.9282 grad_norm:0.0331 train_time:4771676ms step_avg:775.88ms -step:6188/20000 val_loss:1.9310 val_bpb:1.1437 train_time:4800792ms step_avg:775.82ms -stopping_early: wallclock_cap train_time:4800792ms step:6188/20000 -peak memory allocated: 28656 MiB reserved: 28704 MiB -ema:applying EMA weights -DIAGNOSTIC post_ema val_loss:1.9264 val_bpb:1.1409 eval_time:18187ms -Serialized model: 106025719 bytes -Code size: 105268 bytes -Serialized model int6+lzma: 16459152 bytes -Total submission size int6+lzma: 16564420 bytes -final_int6_roundtrip val_loss:1.9355 val_bpb:1.1463 eval_time:36060ms -final_int6_roundtrip_exact val_loss:1.93549859 val_bpb:1.14631129 -final_int6_sliding_window val_loss:1.8956 val_bpb:1.1227 stride:64 eval_time:643423ms -final_int6_sliding_window_exact val_loss:1.89556561 val_bpb:1.12266370 -final_int8_zlib_roundtrip_exact val_loss:1.89556561 val_bpb:1.12266370 -ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=2 -ttt_sliding:params unfrozen=26923598 frozen=4112 - ttt_chunk [1/1893] bpb=1.213507 time=1.2s - ttt_chunk [11/1893] bpb=1.115005 time=11.9s - ttt_chunk [21/1893] bpb=1.124115 time=22.5s - ttt_chunk [31/1893] bpb=1.129630 time=33.2s - ttt_chunk [41/1893] bpb=1.125661 time=43.9s - ttt_chunk [51/1893] bpb=1.127271 time=54.6s - ttt_chunk [61/1893] bpb=1.131076 time=65.2s - ttt_chunk [71/1893] bpb=1.129445 time=75.9s - ttt_chunk [81/1893] bpb=1.125956 time=86.6s - ttt_chunk [91/1893] bpb=1.125065 time=97.3s - ttt_chunk [101/1893] bpb=1.126017 time=107.9s - ttt_chunk [111/1893] bpb=1.126115 time=118.6s - ttt_chunk [121/1893] bpb=1.122503 time=129.3s - ttt_chunk [131/1893] bpb=1.121744 time=139.9s - ttt_chunk [141/1893] bpb=1.120618 time=150.6s - ttt_chunk [151/1893] bpb=1.120790 time=161.3s - ttt_chunk [161/1893] bpb=1.121623 time=172.0s - ttt_chunk [171/1893] bpb=1.123693 time=182.6s - ttt_chunk [181/1893] bpb=1.123772 time=193.3s - ttt_chunk [191/1893] bpb=1.126204 time=204.0s - ttt_chunk [201/1893] bpb=1.125769 time=214.6s - ttt_chunk [211/1893] bpb=1.124798 time=225.3s - ttt_chunk [221/1893] bpb=1.125694 time=236.0s - ttt_chunk [231/1893] bpb=1.125431 time=246.7s - ttt_chunk [241/1893] bpb=1.125671 time=257.3s - ttt_chunk [251/1893] bpb=1.125230 time=268.0s - ttt_chunk [261/1893] bpb=1.124541 time=278.7s - ttt_chunk [271/1893] bpb=1.123621 time=289.3s - ttt_chunk [281/1893] bpb=1.125199 time=300.0s - ttt_chunk [291/1893] bpb=1.124816 time=310.7s - ttt_chunk [301/1893] bpb=1.125690 time=321.3s - ttt_chunk [311/1893] bpb=1.125780 time=332.0s - ttt_chunk [321/1893] bpb=1.126532 time=342.7s - ttt_chunk [331/1893] bpb=1.125992 time=353.4s - ttt_chunk [341/1893] bpb=1.125619 time=364.0s - ttt_chunk [351/1893] bpb=1.126331 time=374.7s - ttt_chunk [361/1893] bpb=1.127106 time=385.4s - ttt_chunk [371/1893] bpb=1.126994 time=396.0s - ttt_chunk [381/1893] bpb=1.126771 time=406.7s - ttt_chunk [391/1893] bpb=1.127473 time=417.4s - ttt_chunk [401/1893] bpb=1.127019 time=428.0s - ttt_chunk [411/1893] bpb=1.126072 time=438.7s - ttt_chunk [421/1893] bpb=1.126211 time=449.4s - ttt_chunk [431/1893] bpb=1.126627 time=460.1s - ttt_chunk [441/1893] bpb=1.126026 time=470.7s - ttt_chunk [451/1893] bpb=1.126178 time=481.4s - ttt_chunk [461/1893] bpb=1.126064 time=492.1s - ttt_chunk [471/1893] bpb=1.125649 time=502.7s - ttt_chunk [481/1893] bpb=1.125465 time=513.4s - ttt_chunk [491/1893] bpb=1.125641 time=524.1s - ttt_chunk [501/1893] bpb=1.125403 time=534.8s - ttt_chunk [511/1893] bpb=1.124922 time=545.4s - ttt_chunk [521/1893] bpb=1.124606 time=556.1s - ttt_chunk [531/1893] bpb=1.125329 time=566.8s - ttt_chunk [541/1893] bpb=1.125457 time=577.5s - ttt_chunk [551/1893] bpb=1.124939 time=588.1s - ttt_chunk [561/1893] bpb=1.124808 time=598.8s - ttt_chunk [571/1893] bpb=1.124539 time=609.5s - ttt_chunk [581/1893] bpb=1.124189 time=620.1s - ttt_chunk [591/1893] bpb=1.123646 time=630.8s - ttt_chunk [601/1893] bpb=1.123663 time=641.5s - ttt_chunk [611/1893] bpb=1.123360 time=652.2s - ttt_chunk [621/1893] bpb=1.123207 time=662.8s - ttt_chunk [631/1893] bpb=1.122962 time=673.5s - ttt_chunk [641/1893] bpb=1.122517 time=684.2s - ttt_chunk [651/1893] bpb=1.122074 time=694.8s - ttt_chunk [661/1893] bpb=1.121970 time=705.5s - ttt_chunk [671/1893] bpb=1.121500 time=716.2s - ttt_chunk [681/1893] bpb=1.120954 time=726.9s - ttt_chunk [691/1893] bpb=1.121049 time=737.5s - ttt_chunk [701/1893] bpb=1.120237 time=748.2s - ttt_chunk [711/1893] bpb=1.120250 time=758.9s - ttt_chunk [721/1893] bpb=1.120159 time=769.5s - ttt_chunk [731/1893] bpb=1.120390 time=780.2s - ttt_chunk [741/1893] bpb=1.120277 time=790.9s - ttt_chunk [751/1893] bpb=1.119980 time=801.6s - ttt_chunk [761/1893] bpb=1.120120 time=812.2s - ttt_chunk [771/1893] bpb=1.119952 time=822.9s - ttt_chunk [781/1893] bpb=1.120123 time=833.6s - ttt_chunk [791/1893] bpb=1.119975 time=844.2s - ttt_chunk [801/1893] bpb=1.119907 time=854.9s - ttt_chunk [811/1893] bpb=1.119919 time=865.6s - ttt_chunk [821/1893] bpb=1.119806 time=876.3s - ttt_chunk [831/1893] bpb=1.119523 time=886.9s - ttt_chunk [841/1893] bpb=1.119272 time=897.6s - ttt_chunk [851/1893] bpb=1.119326 time=908.3s - ttt_chunk [861/1893] bpb=1.119397 time=919.0s - ttt_chunk [871/1893] bpb=1.119597 time=929.6s - ttt_chunk [881/1893] bpb=1.119595 time=940.3s - ttt_chunk [891/1893] bpb=1.119057 time=951.0s - ttt_chunk [901/1893] bpb=1.119066 time=961.6s - ttt_chunk [911/1893] bpb=1.118915 time=972.3s - ttt_chunk [921/1893] bpb=1.119057 time=983.0s - ttt_chunk [931/1893] bpb=1.119000 time=993.7s - ttt_chunk [941/1893] bpb=1.119211 time=1004.3s - ttt_chunk [951/1893] bpb=1.119510 time=1015.0s - ttt_chunk [961/1893] bpb=1.119814 time=1025.7s - ttt_chunk [971/1893] bpb=1.120172 time=1036.4s - ttt_chunk [981/1893] bpb=1.120386 time=1047.0s - ttt_chunk [991/1893] bpb=1.120296 time=1057.7s - ttt_chunk [1001/1893] bpb=1.120622 time=1068.4s - ttt_chunk [1011/1893] bpb=1.120769 time=1079.0s - ttt_chunk [1021/1893] bpb=1.121058 time=1089.7s - ttt_chunk [1031/1893] bpb=1.121451 time=1100.4s - ttt_chunk [1041/1893] bpb=1.121962 time=1111.0s - ttt_chunk [1051/1893] bpb=1.121834 time=1121.7s - ttt_chunk [1061/1893] bpb=1.121931 time=1132.4s - ttt_chunk [1071/1893] bpb=1.122074 time=1143.1s - ttt_chunk [1081/1893] bpb=1.122119 time=1153.7s - ttt_chunk [1091/1893] bpb=1.122380 time=1164.4s - ttt_chunk [1101/1893] bpb=1.122531 time=1175.1s - ttt_chunk [1111/1893] bpb=1.122350 time=1185.7s - ttt_chunk [1121/1893] bpb=1.122124 time=1196.4s - ttt_chunk [1131/1893] bpb=1.122020 time=1207.1s - ttt_chunk [1141/1893] bpb=1.121778 time=1217.7s - ttt_chunk [1151/1893] bpb=1.121805 time=1228.4s - ttt_chunk [1161/1893] bpb=1.121646 time=1239.1s - ttt_chunk [1171/1893] bpb=1.121470 time=1249.7s - ttt_chunk [1181/1893] bpb=1.121248 time=1260.4s - ttt_chunk [1191/1893] bpb=1.121404 time=1271.1s - ttt_chunk [1201/1893] bpb=1.121610 time=1281.8s - ttt_chunk [1211/1893] bpb=1.121209 time=1292.4s - ttt_chunk [1221/1893] bpb=1.121550 time=1303.1s - ttt_chunk [1231/1893] bpb=1.121493 time=1313.8s - ttt_chunk [1241/1893] bpb=1.121200 time=1324.4s - ttt_chunk [1251/1893] bpb=1.120654 time=1335.1s - ttt_chunk [1261/1893] bpb=1.120400 time=1345.8s - ttt_chunk [1271/1893] bpb=1.120154 time=1356.4s - ttt_chunk [1281/1893] bpb=1.119845 time=1367.1s - ttt_chunk [1291/1893] bpb=1.119603 time=1377.8s - ttt_chunk [1301/1893] bpb=1.119559 time=1388.4s - ttt_chunk [1311/1893] bpb=1.119294 time=1399.1s - ttt_chunk [1321/1893] bpb=1.119006 time=1409.8s - ttt_chunk [1331/1893] bpb=1.118778 time=1420.5s - ttt_chunk [1341/1893] bpb=1.118650 time=1431.1s - ttt_chunk [1351/1893] bpb=1.118499 time=1441.8s - ttt_chunk [1361/1893] bpb=1.118620 time=1452.5s - ttt_chunk [1371/1893] bpb=1.118833 time=1463.1s - ttt_chunk [1381/1893] bpb=1.119039 time=1473.8s - ttt_chunk [1391/1893] bpb=1.118831 time=1484.5s - ttt_chunk [1401/1893] bpb=1.118873 time=1495.1s - ttt_chunk [1411/1893] bpb=1.118989 time=1505.8s - ttt_chunk [1421/1893] bpb=1.118980 time=1516.5s - ttt_chunk [1431/1893] bpb=1.118955 time=1527.1s - ttt_chunk [1441/1893] bpb=1.119428 time=1537.8s - ttt_chunk [1451/1893] bpb=1.119298 time=1548.5s - ttt_chunk [1461/1893] bpb=1.119224 time=1559.2s - ttt_chunk [1471/1893] bpb=1.119815 time=1569.8s - ttt_chunk [1481/1893] bpb=1.119692 time=1580.5s - ttt_chunk [1491/1893] bpb=1.120064 time=1591.2s - ttt_chunk [1501/1893] bpb=1.120046 time=1601.9s - ttt_chunk [1511/1893] bpb=1.119995 time=1612.5s - ttt_chunk [1521/1893] bpb=1.120109 time=1623.2s - ttt_chunk [1531/1893] bpb=1.120314 time=1633.9s - ttt_chunk [1541/1893] bpb=1.120389 time=1644.6s - ttt_chunk [1551/1893] bpb=1.120624 time=1655.2s - ttt_chunk [1561/1893] bpb=1.120711 time=1665.9s - ttt_chunk [1571/1893] bpb=1.120856 time=1676.6s - ttt_chunk [1581/1893] bpb=1.121011 time=1687.2s - ttt_chunk [1591/1893] bpb=1.121070 time=1697.9s - ttt_chunk [1601/1893] bpb=1.121194 time=1708.6s - ttt_chunk [1611/1893] bpb=1.121454 time=1719.3s - ttt_chunk [1621/1893] bpb=1.121318 time=1729.9s - ttt_chunk [1631/1893] bpb=1.121361 time=1740.6s - ttt_chunk [1641/1893] bpb=1.121385 time=1751.3s - ttt_chunk [1651/1893] bpb=1.121438 time=1762.0s - ttt_chunk [1661/1893] bpb=1.121577 time=1772.6s - ttt_chunk [1671/1893] bpb=1.121754 time=1783.3s - ttt_chunk [1681/1893] bpb=1.121845 time=1794.0s - ttt_chunk [1691/1893] bpb=1.121951 time=1804.6s - ttt_chunk [1701/1893] bpb=1.122049 time=1815.3s - ttt_chunk [1711/1893] bpb=1.122031 time=1826.0s - ttt_chunk [1721/1893] bpb=1.121864 time=1836.7s - ttt_chunk [1731/1893] bpb=1.121961 time=1847.3s - ttt_chunk [1741/1893] bpb=1.121701 time=1858.0s - ttt_chunk [1751/1893] bpb=1.121579 time=1868.7s - ttt_chunk [1761/1893] bpb=1.121622 time=1879.3s - ttt_chunk [1771/1893] bpb=1.121568 time=1890.0s - ttt_chunk [1781/1893] bpb=1.121464 time=1900.7s - ttt_chunk [1791/1893] bpb=1.121129 time=1911.3s - ttt_chunk [1801/1893] bpb=1.121118 time=1922.0s - ttt_chunk [1811/1893] bpb=1.120975 time=1932.7s - ttt_chunk [1821/1893] bpb=1.121035 time=1943.3s - ttt_chunk [1831/1893] bpb=1.120887 time=1954.0s - ttt_chunk [1841/1893] bpb=1.120931 time=1964.7s - ttt_chunk [1851/1893] bpb=1.120761 time=1975.4s - ttt_chunk [1861/1893] bpb=1.120682 time=1986.0s - ttt_chunk [1871/1893] bpb=1.120616 time=1996.7s - ttt_chunk [1881/1893] bpb=1.120376 time=2007.4s - ttt_chunk [1891/1893] bpb=1.120360 time=2018.0s - ttt_chunk [1893/1893] bpb=1.120391 time=2019.8s -ttt_sliding:done val_loss=1.891728 val_bpb=1.120391 elapsed=2019.8s -legal_ttt val_loss:1.8917 val_bpb:1.1204 eval_time:2020286ms -legal_ttt_exact val_loss:1.89172798 val_bpb:1.12039083 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/requirements.txt b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/requirements.txt deleted file mode 100644 index e3d59eea39..0000000000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/requirements.txt +++ /dev/null @@ -1,101 +0,0 @@ -mdurl==0.1.2 -nvidia-cudnn-cu13==9.19.0.56 -aiohttp==3.13.3 -nvidia-cufile==1.15.1.6 -charset-normalizer==3.4.6 -Jinja2==3.1.6 -hf-xet==1.4.2 -nvidia-cuda-nvrtc-cu12==12.8.93 -typer==0.24.1 -attrs==26.1.0 -certifi==2026.2.25 -triton==3.6.0 -nvidia-nccl-cu12==2.29.7 -wheel==0.46.3 -nvidia-nvtx-cu12==12.8.90 -gitdb==4.0.12 -dill==0.4.1 -nvidia-cuda-cupti==13.0.85 -tqdm==4.67.3 -pandas==3.0.1 -PyYAML==6.0.3 -annotated-types==0.7.0 -annotated-doc==0.0.4 -nvidia-nccl-cu13==2.28.9 -nvidia-cufft-cu12==11.3.3.83 -nvidia-cuda-nvrtc==13.0.88 -nvidia-cudnn-cu12==9.20.0.48 -httpx==0.28.1 -packaging==26.0 -einops==0.8.2 -xxhash==3.6.0 -huggingface_hub==1.8.0 -Pygments==2.19.2 -markdown-it-py==4.0.0 -pydantic_core==2.41.5 -nvidia-cusparse-cu12==12.5.8.93 -cuda-toolkit==13.0.2 -rich==14.3.3 -six==1.17.0 -python-dateutil==2.9.0.post0 -nvidia-cusolver==12.0.4.66 -nvidia-nvshmem-cu13==3.4.5 -setuptools==81.0.0 -pyarrow==23.0.1 -typing_extensions==4.15.0 -MarkupSafe==3.0.3 -smmap==5.0.3 -filelock==3.25.2 -nvidia-nvtx==13.0.85 -multiprocess==0.70.19 -networkx==3.6.1 -pydantic==2.12.5 -nvidia-nvshmem-cu12==3.4.5 -nvidia-cublas-cu12==12.8.4.1 -anyio==4.13.0 -nvidia-cufft==12.0.0.61 -cuda-pathfinder==1.5.0 -mpmath==1.3.0 -cuda-bindings==13.2.0 -propcache==0.4.1 -yarl==1.23.0 -ninja==1.13.0 -typing-inspection==0.4.2 -idna==3.11 -h11==0.16.0 -urllib3==2.6.3 -multidict==6.7.1 -aiosignal==1.4.0 -nvidia-nvjitlink-cu12==12.8.93 -nvidia-cusparse==12.6.3.3 -aiohappyeyeballs==2.6.1 -psutil==7.2.2 -wandb==0.25.1 -protobuf==6.33.6 -click==8.3.1 -nvidia-cufile-cu12==1.13.1.3 -httpcore==1.0.9 -sentencepiece==0.2.1 -fsspec==2026.2.0 -nvidia-curand-cu12==10.3.9.90 -nvidia-curand==10.4.0.35 -GitPython==3.1.46 -pip==26.0.1 -platformdirs==4.9.4 -nvidia-cublas==13.1.0.3 -nvidia-cuda-cupti-cu12==12.8.90 -flash_attn==2.8.3 -nvidia-cusolver-cu12==11.7.3.90 -sympy==1.14.0 -torch==2.11.0 -numpy==2.4.3 -nvidia-cuda-runtime-cu12==12.8.90 -nvidia-cusparselt-cu13==0.8.0 -frozenlist==1.8.0 -sentry-sdk==2.56.0 -requests==2.33.0 -nvidia-cuda-runtime==13.0.96 -nvidia-nvjitlink==13.0.88 -nvidia-cusparselt-cu12==0.7.1 -shellingham==1.5.4 -datasets==4.8.4 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/ablation_no_rmsnorm.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/ablation_no_rmsnorm.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/ablation_no_rmsnorm.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/ablation_no_rmsnorm.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/eval_ttt_passes.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/eval_ttt_passes.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/eval_ttt_passes.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/eval_ttt_passes.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/feedback.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/feedback.py new file mode 100644 index 0000000000..dd34a36c62 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/feedback.py @@ -0,0 +1,138 @@ +"""Error feedback modules for recurrent quantization correction. + +Implements low-rank residual approximation and correction operators +to compensate for quantization error amplification in recurrent passes. + + e_k = U (V^T h_k) -- low-rank residual approx. + c_k = D_k(e_k) -- correction operator + h_{k+1} = f_{W_q}(h_k + c_k) -- corrected recurrent update +""" +from __future__ import annotations +import math +import torch +import torch.nn as nn +from torch import Tensor + + +class LowRankResidual(nn.Module): + """e_k = U (V^T h_k) with U, V in R^{d x r}.""" + + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V = nn.Parameter(torch.zeros(dim, rank)) + self.U = nn.Parameter(torch.zeros(dim, rank)) + + def forward(self, h: Tensor) -> Tensor: + return (h @ self.V) @ self.U.T + + +class DiagonalFeedback(nn.Module): + """c_k = d odot e_k.""" + + def __init__(self, dim: int, init_ones: bool = False): + super().__init__() + init_val = torch.ones(dim) if init_ones else torch.zeros(dim) + self.d = nn.Parameter(init_val) + + def forward(self, e: Tensor) -> Tensor: + return self.d.to(dtype=e.dtype) * e + + +class LowRankFeedback(nn.Module): + """c_k = U_D (V_D^T e_k) with U_D, V_D in R^{d x r}.""" + + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V_D = nn.Parameter(torch.zeros(dim, rank)) + self.U_D = nn.Parameter(torch.zeros(dim, rank)) + + def forward(self, e: Tensor) -> Tensor: + return (e @ self.V_D) @ self.U_D.T + + +class AffineJunction(nn.Module): + """c_k^{aff} = gamma_k odot h_k + beta_k.""" + + def __init__(self, dim: int): + super().__init__() + self.gamma = nn.Parameter(torch.ones(dim)) + self.beta = nn.Parameter(torch.zeros(dim)) + + def forward(self, h: Tensor) -> Tensor: + return self.gamma.to(dtype=h.dtype) * h + self.beta.to(dtype=h.dtype) + + +class ErrorFeedbackModule(nn.Module): + """Combined error-feedback path: residual -> correction -> (optional junction). + + Supports shared or per-pass correction operators. Correction is inactive + on pass 0 (the first recurrence pass sees no prior quantization residual). + + Args: + dim: model hidden dimension + rank: rank for low-rank components + feedback_mode: 'identity' | 'diagonal' | 'low_rank' + per_pass: separate correction per pass if True + num_passes: number of recurrence passes (K) + affine_junction: add an affine junction path + """ + + def __init__( + self, + dim: int, + rank: int = 2, + feedback_mode: str = "diagonal", + per_pass: bool = False, + num_passes: int = 3, + affine_junction: bool = False, + ): + super().__init__() + self.feedback_mode = feedback_mode + self.per_pass = per_pass + self.num_passes = num_passes + + self.residual = LowRankResidual(dim, rank) + + if feedback_mode == "identity": + self.correction: nn.Module | nn.ModuleList | None = None + elif feedback_mode == "diagonal": + if per_pass: + self.correction = nn.ModuleList( + [DiagonalFeedback(dim) for _ in range(num_passes)] + ) + else: + self.correction = DiagonalFeedback(dim) + elif feedback_mode == "low_rank": + if per_pass: + self.correction = nn.ModuleList( + [LowRankFeedback(dim, rank) for _ in range(num_passes)] + ) + else: + self.correction = LowRankFeedback(dim, rank) + else: + raise ValueError(f"Unknown feedback_mode: {feedback_mode}") + + self.junction: AffineJunction | None = ( + AffineJunction(dim) if affine_junction else None + ) + + def forward(self, h: Tensor, pass_idx: int) -> Tensor: + """Return correction tensor (zeros on pass 0).""" + e = self.residual(h) + if self.correction is None: + c = e + elif self.per_pass: + c = self.correction[pass_idx](e) + else: + c = self.correction(e) + if self.junction is not None: + c = c + self.junction(h) + mask = torch.tensor(1.0 if pass_idx > 0 else 0.0, device=h.device, dtype=h.dtype) + return c * mask + + def extra_repr(self) -> str: + return (f"mode={self.feedback_mode}, per_pass={self.per_pass}, " + f"passes={self.num_passes}") + + def param_count(self) -> int: + return sum(p.numel() for p in self.parameters()) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/grid_search.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/grid_search.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/grid_search.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/grid_search.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/lora-fix-plan.md b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/lora-fix-plan.md similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/lora-fix-plan.md rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/lora-fix-plan.md diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/recurrence-fixes.md b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/recurrence-fixes.md similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/recurrence-fixes.md rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/recurrence-fixes.md diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_2pass_3core.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_2pass_3core.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_2pass_3core.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_2pass_3core.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_3pass.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_3pass.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_3pass.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_3pass.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_qat.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_4pass_qat.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_qat.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_4pass_qat.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_test.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_4pass_test.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_test.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_4pass_test.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_ttt.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_4pass_ttt.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_4pass_ttt.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_4pass_ttt.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_baseline_4pass.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_baseline_4pass.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_baseline_4pass.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_baseline_4pass.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_full_1gpu.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_full_1gpu.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_full_1gpu.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_full_1gpu.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_full_4pass.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_full_4pass.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_full_4pass.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_full_4pass.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_lora_test.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_lora_test.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_lora_test.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_lora_test.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_lora_test_r8.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_lora_test_r8.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_lora_test_r8.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/run_lora_test_r8.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/smoke_passes.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/smoke_passes.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/smoke_passes.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/smoke_passes.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/smoke_test.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/smoke_test.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/smoke_test.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/smoke_test.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/stability.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/stability.py new file mode 100644 index 0000000000..a02c831638 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/stability.py @@ -0,0 +1,108 @@ +"""Stability monitoring and control for recurrent passes. + +Provides per-pass diagnostics, hidden-state clipping, learnable residual +scaling, and a cheap Jacobian proxy regulariser. +""" +from __future__ import annotations +import torch +import torch.nn as nn +from torch import Tensor +from dataclasses import dataclass, field + + +@dataclass +class PassDiagnostics: + h_norms: list[float] = field(default_factory=list) + delta_norms: list[float] = field(default_factory=list) + error_norms: list[float] = field(default_factory=list) + correction_norms: list[float] = field(default_factory=list) + growth_ratios: list[float] = field(default_factory=list) + + def reset(self): + for lst in (self.h_norms, self.delta_norms, self.error_norms, + self.correction_norms, self.growth_ratios): + lst.clear() + + def summary(self) -> dict[str, list[float]]: + return { + "h_norms": list(self.h_norms), + "delta_norms": list(self.delta_norms), + "error_norms": list(self.error_norms), + "correction_norms": list(self.correction_norms), + "growth_ratios": list(self.growth_ratios), + } + + +class RecurrentStabilizer: + """Manages stability diagnostics and optional controls for recurrence.""" + + def __init__( + self, + clip_hidden: bool = False, + clip_value: float = 10.0, + clip_mode: str = "value", + jacobian_proxy_weight: float = 0.0, + eps: float = 1e-6, + ): + self.clip_hidden = clip_hidden + self.clip_value = clip_value + self.clip_mode = clip_mode + self.jacobian_proxy_weight = jacobian_proxy_weight + self.eps = eps + self.diagnostics = PassDiagnostics() + + def clip(self, h: Tensor) -> Tensor: + if not self.clip_hidden: + return h + if self.clip_mode == "value": + return torch.clamp(h, -self.clip_value, self.clip_value) + norm = h.norm(dim=-1, keepdim=True) + scale = torch.clamp(self.clip_value / (norm + self.eps), max=1.0) + return h * scale + + def record_pass( + self, + h_prev: Tensor, + h_next: Tensor, + error: Tensor | None = None, + correction: Tensor | None = None, + ): + with torch.no_grad(): + h_pn = h_prev.float().norm().item() + h_nn = h_next.float().norm().item() + self.diagnostics.h_norms.append(h_nn) + self.diagnostics.delta_norms.append( + (h_next - h_prev).float().norm().item() + ) + self.diagnostics.growth_ratios.append(h_nn / (h_pn + self.eps)) + if error is not None: + self.diagnostics.error_norms.append(error.float().norm().item()) + if correction is not None: + self.diagnostics.correction_norms.append( + correction.float().norm().item() + ) + + def jacobian_proxy_loss(self, h_in: Tensor, h_out: Tensor) -> Tensor: + """Finite-difference proxy for Jacobian spectral norm.""" + if self.jacobian_proxy_weight <= 0: + return h_in.new_zeros(()) + delta = h_out - h_in + ratio = delta.norm() / (h_in.norm() + self.eps) + return self.jacobian_proxy_weight * torch.relu(ratio - 1.0).square() + + def reset(self): + self.diagnostics.reset() + + +class ResidualScale(nn.Module): + """Learnable per-pass residual scaling: + h_{k+1} = h_k + alpha_k * F(h_k + c_k)""" + + def __init__(self, num_passes: int, init_value: float = 1.0): + super().__init__() + self.scales = nn.Parameter( + torch.full((num_passes,), init_value, dtype=torch.float32) + ) + + def forward(self, residual: Tensor, pass_idx: int) -> Tensor: + return self.scales[pass_idx].to(dtype=residual.dtype) * residual diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/sweep_passes.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/sweep_passes.sh similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/sweep_passes.sh rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/sweep_passes.sh diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/train_gpt_recurrent.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/train_gpt_recurrent.py new file mode 100644 index 0000000000..a82ec73612 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/train_gpt_recurrent.py @@ -0,0 +1,2194 @@ +from __future__ import annotations +import copy +import glob +import io +import lzma +import math +import os +import random +import subprocess +import sys +import time +import uuid +import zlib +from pathlib import Path +try: + import zstandard + _COMPRESSOR = "zstd" +except ImportError: + _COMPRESSOR = "zlib" +import numpy as np +import sentencepiece as spm +import torch +import torch._dynamo +torch._dynamo.config.recompile_limit = 32 +import torch.distributed as dist +import torch.nn.functional as F +from torch import Tensor, nn +from torch.nn.parallel import DistributedDataParallel as DDP +_gpu_mem_frac = float(os.environ.get("CUDA_MEM_FRACTION", "0")) +if _gpu_mem_frac > 0: + torch.cuda.set_per_process_memory_fraction(_gpu_mem_frac, 0) +from flash_attn_interface import flash_attn_func as flash_attn_3_func +import argparse +from feedback import ErrorFeedbackModule +from stability import RecurrentStabilizer, ResidualScale +try: + import wandb as _wandb +except ImportError: + _wandb = None +class Hyperparameters: + data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024") + train_files = os.path.join(data_path, "fineweb_train_*.bin") + val_files = os.path.join(data_path, "fineweb_val_*.bin") + tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model") + run_id = os.environ.get("RUN_ID", str(uuid.uuid4())) + seed = int(os.environ.get("SEED", 1337)) + val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288)) + val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 4000)) + train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 500)) + iterations = int(os.environ.get("ITERATIONS", 20000)) + warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 3500)) + warmup_steps = int(os.environ.get("WARMUP_STEPS", 20)) + train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 786_432)) + train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 2048)) + eval_seq_len = int(os.environ.get("EVAL_SEQ_LEN", 2048)) + max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0)) + qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5)) + vocab_size = int(os.environ.get("VOCAB_SIZE", 1024)) + num_layers = int(os.environ.get("NUM_LAYERS", 11)) + num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4)) + model_dim = int(os.environ.get("MODEL_DIM", 512)) + num_heads = int(os.environ.get("NUM_HEADS", 8)) + mlp_mult = float(os.environ.get("MLP_MULT", 3.0)) + tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1"))) + rope_base = float(os.environ.get("ROPE_BASE", 10000.0)) + logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0)) + embed_lr = float(os.environ.get("EMBED_LR", 0.6)) + head_lr = float(os.environ.get("HEAD_LR", 0.008)) + tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.035)) + tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005)) + matrix_lr = float(os.environ.get("MATRIX_LR", 0.025)) + scalar_lr = float(os.environ.get("SCALAR_LR", 0.025)) + muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.99)) + muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5)) + muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.92)) + muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 1500)) + beta1 = float(os.environ.get("BETA1", 0.9)) + beta2 = float(os.environ.get("BETA2", 0.95)) + adam_eps = float(os.environ.get("ADAM_EPS", 1e-8)) + grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.3)) + eval_stride = int(os.environ.get("EVAL_STRIDE", 64)) + mtp_num_heads = int(os.environ.get("MTP_NUM_HEADS", 0)) + mtp_loss_weight = float(os.environ.get("MTP_LOSS_WEIGHT", 0.2)) + muon_beta2 = float(os.environ.get("MUON_BETA2", 0.95)) + swa_enabled = bool(int(os.environ.get("SWA_ENABLED", "1"))) + swa_every = int(os.environ.get("SWA_EVERY", 50)) + lawa_enabled = bool(int(os.environ.get("LAWA_ENABLED", "0"))) + lawa_k = int(os.environ.get("LAWA_K", 10)) + lawa_freq = int(os.environ.get("LAWA_FREQ", 100)) + muon_wd = float(os.environ.get("MUON_WD", 0.04)) + adam_wd = float(os.environ.get("ADAM_WD", 0.04)) + qat_enabled = bool(int(os.environ.get("QAT_ENABLED", "0"))) + bigram_vocab_size = int(os.environ.get("BIGRAM_VOCAB_SIZE", 2048)) + bigram_dim = int(os.environ.get("BIGRAM_DIM", 128)) + xsa_last_n = int(os.environ.get("XSA_LAST_N", 4)) + rope_dims = int(os.environ.get("ROPE_DIMS", 16)) + ln_scale = bool(int(os.environ.get("LN_SCALE", "1"))) + dtg_enabled = bool(int(os.environ.get("DTG_ENABLED", "0"))) + late_qat_threshold = float(os.environ.get("LATE_QAT_THRESHOLD", 0.15)) + ve_enabled = bool(int(os.environ.get("VE_ENABLED", "1"))) + ve_dim = int(os.environ.get("VE_DIM", 128)) + ve_layers = os.environ.get("VE_LAYERS", "9,10") + gated_attention = bool(int(os.environ.get("GATED_ATTENTION", "0"))) + value_residual = bool(int(os.environ.get("VALUE_RESIDUAL", "0"))) + ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "0"))) + ttt_lr = float(os.environ.get("TTT_LR", 0.002)) + ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3)) + ttt_chunk_tokens = int(os.environ.get("TTT_CHUNK_TOKENS", 32768)) + ttt_freeze_blocks = int(os.environ.get("TTT_FREEZE_BLOCKS", 2)) + ttt_momentum = float(os.environ.get("TTT_MOMENTUM", 0.9)) + ttt_batch_seqs = int(os.environ.get("TTT_BATCH_SEQS", 32)) + ttt_grad_clip = float(os.environ.get("TTT_GRAD_CLIP", 1.0)) + # recurrence + core_start = int(os.environ.get("CORE_START", 3)) + core_end = int(os.environ.get("CORE_END", 8)) + num_passes = int(os.environ.get("NUM_PASSES", 1)) + core_quant_bits = int(os.environ.get("CORE_QUANT_BITS", 6)) + core_quant_enabled = bool(int(os.environ.get("CORE_QUANT_ENABLED", "0"))) + lora_rank = int(os.environ.get("LORA_RANK", 0)) + +# --- Batched Newton-Schulz orthogonalization --- + +def zeropower_via_newtonschulz5(G: Tensor, steps: int = 5, eps: float = 1e-7) -> Tensor: + """Batched Newton-Schulz orthogonalization. G: (B,M,N) or (M,N).""" + a, b, c = (3.4445, -4.7750, 2.0315) + was_2d = G.ndim == 2 + if was_2d: + G = G.unsqueeze(0) + X = G.bfloat16() + transposed = X.size(-2) > X.size(-1) + if transposed: + X = X.mT + X = X / (X.norm(dim=(-2, -1), keepdim=True) + eps) + for _ in range(steps): + A = X @ X.mT + B = b * A + c * (A @ A) + X = a * X + B @ X + if transposed: + X = X.mT + if was_2d: + X = X.squeeze(0) + return X + +# --- Parallel Muon optimizer --- + +class Muon(torch.optim.Optimizer): + """Parallel Muon: post-backward reduce-scatter -> local NS5 -> all-gather. + + No DDP for bank params. After backward, this optimizer: + 1. Launches async reduce-scatter for all banks (biggest first) + 2. Returns control so Adam can step on small params while RS is in-flight + 3. Waits for each RS, runs local NS5 on the shard, launches async all-gather + 4. Each all-gather overlaps with next bank's NS5 + """ + def __init__(self, params, lr: float, momentum: float, backend_steps: int, + nesterov: bool = True, weight_decay: float = 0.0): + super().__init__( + params, + dict(lr=lr, momentum=momentum, backend_steps=backend_steps, + nesterov=nesterov, weight_decay=weight_decay), + ) + self._built = False + + def _build(self): + self._distributed = dist.is_available() and dist.is_initialized() + self._world_size = dist.get_world_size() if self._distributed else 1 + self._rank = dist.get_rank() if self._distributed else 0 + ws = self._world_size + + self._bank_meta = [] + for group in self.param_groups: + for p in group["params"]: + B = p.shape[0] + padded_B = ((B + ws - 1) // ws) * ws + shard_B = padded_B // ws + tail = p.shape[1:] + dev = p.device + self._bank_meta.append({ + 'p': p, + 'B': B, + 'padded_grad': torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), + 'shard': torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), + 'shard_mom': torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), + 'full_update': torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), + 'scale': max(1, p.shape[-2] / p.shape[-1]) ** 0.5, + }) + # Sort by size descending -- launch biggest reduce-scatters first + self._bank_meta.sort(key=lambda m: -m['p'].numel()) + self._built = True + + def launch_reduce_scatters(self): + """Phase 1: launch async reduce-scatter for all banks. Call right after backward.""" + if not self._built: + self._build() + if not self._distributed: + return + self._rs_futures = [] + for m in self._bank_meta: + p = m['p'] + if p.grad is None: + self._rs_futures.append(None) + continue + pg = m['padded_grad'] + pg[:m['B']].copy_(p.grad.bfloat16()) + if pg.shape[0] > m['B']: + pg[m['B']:].zero_() + fut = dist.reduce_scatter_tensor(m['shard'], pg, op=dist.ReduceOp.AVG, async_op=True) + self._rs_futures.append(fut) + + @torch.no_grad() + def step(self, closure=None): + """Phase 3: wait for RS, local NS5, all-gather. Call AFTER Adam steps.""" + loss = None + if closure is not None: + with torch.enable_grad(): + loss = closure() + + if not self._built: + self._build() + + for group in self.param_groups: + lr = group["lr"] + momentum = group["momentum"] + backend_steps = group["backend_steps"] + nesterov = group["nesterov"] + wd = group.get("weight_decay", 0.0) + + prev_ag_handle = None + prev_m = None + + sharded = self._distributed and hasattr(self, '_rs_futures') + + for i, m in enumerate(self._bank_meta): + p = m['p'] + if p.grad is None: + continue + + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m['p'] + upd = prev_m['full_update'][:prev_m['B']] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m['scale']) + + if sharded and self._rs_futures[i] is not None: + self._rs_futures[i].wait() + g = m['shard'] + buf = m['shard_mom'] + else: + g = p.grad.bfloat16() + state = self.state[p] + if "momentum_buffer" not in state: + state["momentum_buffer"] = torch.zeros_like(g) + buf = state["momentum_buffer"] + + buf.mul_(momentum).add_(g) + if nesterov: + update = g.add(buf, alpha=momentum) + else: + update = buf + + update = zeropower_via_newtonschulz5(update, steps=backend_steps) + + if sharded: + prev_ag_handle = dist.all_gather_into_tensor( + m['full_update'], update, async_op=True) + prev_m = m + else: + if wd > 0.0: + p.data.mul_(1.0 - lr * wd) + p.add_(update.to(dtype=p.dtype), alpha=-lr * m['scale']) + + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m['p'] + upd = prev_m['full_update'][:prev_m['B']] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m['scale']) + + if hasattr(self, '_rs_futures'): + del self._rs_futures + + return loss + +# --- Tokenizer evaluation helpers --- + +def build_sentencepiece_luts( + sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device +) -> tuple[Tensor, Tensor, Tensor]: + sp_vocab_size = int(sp.vocab_size()) + table_size = max(sp_vocab_size, vocab_size) + base_bytes_np = np.zeros((table_size,), dtype=np.int16) + has_leading_space_np = np.zeros((table_size,), dtype=np.bool_) + is_boundary_token_np = np.ones((table_size,), dtype=np.bool_) + for token_id in range(sp_vocab_size): + if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id): + continue + is_boundary_token_np[token_id] = False + if sp.is_byte(token_id): + base_bytes_np[token_id] = 1 + continue + piece = sp.id_to_piece(token_id) + if piece.startswith("\u2581"): + has_leading_space_np[token_id] = True + piece = piece[1:] + base_bytes_np[token_id] = len(piece.encode("utf-8")) + return ( + torch.tensor(base_bytes_np, dtype=torch.int16, device=device), + torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), + torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), + ) +def load_validation_tokens(pattern: str, seq_len: int) -> Tensor: + files = [Path(p) for p in sorted(glob.glob(pattern))] + if not files: + raise FileNotFoundError(f"No files found for pattern: {pattern}") + tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() + usable = ((tokens.numel() - 1) // seq_len) * seq_len + if usable <= 0: + raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") + return tokens[: usable + 1] +def eval_val( + args: Hyperparameters, + model: nn.Module, + rank: int, + world_size: int, + device: torch.device, + grad_accum_steps: int, + val_tokens: Tensor, + base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + eval_seq_len: int | None = None, +) -> tuple[float, float]: + seq_len = eval_seq_len or args.train_seq_len + local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps) + if local_batch_tokens < seq_len: + raise ValueError( + "VAL_BATCH_SIZE must provide at least one sequence per rank; " + f"got VAL_BATCH_SIZE={args.val_batch_size}, WORLD_SIZE={world_size}, " + f"GRAD_ACCUM_STEPS={grad_accum_steps}, seq_len={seq_len}" + ) + local_batch_seqs = local_batch_tokens // seq_len + total_seqs = (val_tokens.numel() - 1) // seq_len + seq_start = (total_seqs * rank) // world_size + seq_end = (total_seqs * (rank + 1)) // world_size + val_loss_sum = torch.zeros((), device=device, dtype=torch.float64) + val_token_count = torch.zeros((), device=device, dtype=torch.float64) + val_byte_count = torch.zeros((), device=device, dtype=torch.float64) + model.eval() + with torch.inference_mode(): + for batch_seq_start in range(seq_start, seq_end, local_batch_seqs): + batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end) + raw_start = batch_seq_start * seq_len + raw_end = batch_seq_end * seq_len + 1 + local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + batch_loss = model(x, y).detach() + batch_token_count = float(y.numel()) + val_loss_sum += batch_loss.to(torch.float64) * batch_token_count + val_token_count += batch_token_count + prev_ids = x.reshape(-1) + tgt_ids = y.reshape(-1) + token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16) + token_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16) + val_byte_count += token_bytes.to(torch.float64).sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM) + val_loss = val_loss_sum / val_token_count + bits_per_token = val_loss.item() / math.log(2.0) + tokens_per_byte = val_token_count.item() / val_byte_count.item() + model.train() + return float(val_loss.item()), float(bits_per_token * tokens_per_byte) + +# --- Quantization helpers --- + +CONTROL_TENSOR_NAME_PATTERNS = tuple( + pattern + for pattern in os.environ.get( + "CONTROL_TENSOR_NAME_PATTERNS", + "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,smear,dtg_gate,ve_layer_scales,ve_shared.scale,attn_gate,vr_lambda", + ).split(",") + if pattern +) +INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple( + pattern + for pattern in os.environ.get( + "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS", + ",".join(CONTROL_TENSOR_NAME_PATTERNS), + ).split(",") + if pattern +) +INT8_KEEP_FLOAT_MAX_NUMEL = 65_536 +INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16 +INT8_PER_ROW_SCALE_DTYPE = torch.float16 +INT8_CLIP_PERCENTILE = 99.99984 +INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0 +def tensor_nbytes(t: Tensor) -> int: + return int(t.numel()) * int(t.element_size()) +def keep_float_tensor(name: str, t: Tensor, passthrough_orig_dtypes: dict[str, str]) -> Tensor: + if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS): + return t.float().contiguous() + if t.dtype in {torch.float32, torch.bfloat16}: + passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.") + return t.to(dtype=INT8_KEEP_FLOAT_STORE_DTYPE).contiguous() + return t +def quantize_float_tensor(t: Tensor) -> tuple[Tensor, Tensor]: + t32 = t.float() + if t32.ndim == 2: + clip_abs = ( + torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1) + if t32.numel() + else torch.empty((t32.shape[0],), dtype=torch.float32) + ) + clipped = torch.maximum(torch.minimum(t32, clip_abs[:, None]), -clip_abs[:, None]) + scale = (clip_abs / 127.0).clamp_min(1.0 / 127.0) + q = torch.clamp(torch.round(clipped / scale[:, None]), -127, 127).to(torch.int8).contiguous() + return q, scale.to(dtype=INT8_PER_ROW_SCALE_DTYPE).contiguous() + clip_abs = float(torch.quantile(t32.abs().flatten(), INT8_CLIP_Q).item()) if t32.numel() else 0.0 + scale = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, dtype=torch.float32) + q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), -127, 127).to(torch.int8).contiguous() + return q, scale +def quantize_state_dict_int8(state_dict: dict[str, Tensor]): + quantized: dict[str, Tensor] = {} + scales: dict[str, Tensor] = {} + dtypes: dict[str, str] = {} + passthrough: dict[str, Tensor] = {} + passthrough_orig_dtypes: dict[str, str] = {} + qmeta: dict[str, dict[str, object]] = {} + stats = dict.fromkeys( + ("param_count", "num_tensors", "num_float_tensors", "num_nonfloat_tensors", "baseline_tensor_bytes", "int8_payload_bytes"), + 0, + ) + for name, tensor in state_dict.items(): + t = tensor.detach().to("cpu").contiguous() + stats["param_count"] += int(t.numel()) + stats["num_tensors"] += 1 + stats["baseline_tensor_bytes"] += tensor_nbytes(t) + if not t.is_floating_point(): + stats["num_nonfloat_tensors"] += 1 + passthrough[name] = t + stats["int8_payload_bytes"] += tensor_nbytes(t) + continue + if t.numel() <= INT8_KEEP_FLOAT_MAX_NUMEL: + kept = keep_float_tensor(name, t, passthrough_orig_dtypes) + passthrough[name] = kept + stats["int8_payload_bytes"] += tensor_nbytes(kept) + continue + stats["num_float_tensors"] += 1 + q, s = quantize_float_tensor(t) + if s.ndim > 0: + qmeta[name] = {"scheme": "per_row", "axis": 0} + quantized[name] = q + scales[name] = s + dtypes[name] = str(t.dtype).removeprefix("torch.") + stats["int8_payload_bytes"] += tensor_nbytes(q) + tensor_nbytes(s) + obj: dict[str, object] = { + "__quant_format__": "int8_clean_per_row_v1", + "quantized": quantized, + "scales": scales, + "dtypes": dtypes, + "passthrough": passthrough, + } + if qmeta: + obj["qmeta"] = qmeta + if passthrough_orig_dtypes: + obj["passthrough_orig_dtypes"] = passthrough_orig_dtypes + return obj, stats +def dequantize_state_dict_int8(obj: dict[str, object]) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + qmeta = obj.get("qmeta", {}) + passthrough_orig_dtypes = obj.get("passthrough_orig_dtypes", {}) + for name, q in obj["quantized"].items(): + dtype = getattr(torch, obj["dtypes"][name]) + s = obj["scales"][name] + if qmeta.get(name, {}).get("scheme") == "per_row" or s.ndim > 0: + s = s.to(dtype=torch.float32) + out[name] = (q.float() * s.view(q.shape[0], *([1] * (q.ndim - 1)))).to(dtype=dtype).contiguous() + else: + scale = float(s.item()) + out[name] = (q.float() * scale).to(dtype=dtype).contiguous() + for name, t in obj["passthrough"].items(): + out_t = t.detach().to("cpu").contiguous() + orig_dtype = passthrough_orig_dtypes.get(name) + if isinstance(orig_dtype, str): + out_t = out_t.to(dtype=getattr(torch, orig_dtype)).contiguous() + out[name] = out_t + return out + +# --- Data loading --- + +def load_data_shard(file: Path) -> Tensor: + header_bytes = 256 * np.dtype(" None: + self.file_idx = (self.file_idx + 1) % len(self.files) + self.tokens = load_data_shard(self.files[self.file_idx]) + self.pos = 0 + def take(self, n: int) -> Tensor: + chunks: list[Tensor] = [] + remaining = n + while remaining > 0: + avail = self.tokens.numel() - self.pos + if avail <= 0: + self._advance_file() + continue + k = min(remaining, avail) + chunks.append(self.tokens[self.pos : self.pos + k]) + self.pos += k + remaining -= k + return chunks[0] if len(chunks) == 1 else torch.cat(chunks) +class DistributedTokenLoader: + def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device): + self.rank = rank + self.world_size = world_size + self.device = device + self.stream = TokenStream(pattern) + def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]: + local_tokens = global_tokens // (self.world_size * grad_accum_steps) + per_rank_span = local_tokens + 1 + chunk = self.stream.take(per_rank_span * self.world_size) + start = self.rank * per_rank_span + local = chunk[start : start + per_rank_span].to(dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True) + +# --- Transformer modules --- + +class RMSNorm(nn.Module): + def __init__(self, eps: float | None = None): + super().__init__() + self.eps = eps + def forward(self, x: Tensor) -> Tensor: + return F.rms_norm(x, (x.size(-1),), eps=self.eps) +class CastedLinear(nn.Linear): + _qat_enabled: bool = False + def forward(self, x: Tensor) -> Tensor: + w = self.weight.to(x.dtype) + if CastedLinear._qat_enabled and self.training and w.ndim == 2: + with torch.no_grad(): + w32 = self.weight.float() + row_max = w32.abs().amax(dim=1) + scale = (row_max / 31.0).clamp_min(1.0 / 31.0) + w_q = (torch.clamp(torch.round(w32 / scale[:, None]), -32, 31) * scale[:, None]).to(x.dtype) + w = w + (w_q - w).detach() + bias = self.bias.to(x.dtype) if self.bias is not None else None + return F.linear(x, w, bias) +def restore_low_dim_params_to_fp32(module: nn.Module) -> None: + with torch.no_grad(): + for name, param in module.named_parameters(): + if (param.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32: + param.data = param.data.float() +class Rotary(nn.Module): + def __init__(self, dim: int, base: float = 10000.0, train_seq_len: int = 1024, rope_dims: int = 0): + super().__init__() + self.dim = dim + self.base = base + self.train_seq_len = train_seq_len + self.rope_dims = rope_dims if rope_dims > 0 else dim + inv_freq = 1.0 / (base ** (torch.arange(0, self.rope_dims, 2, dtype=torch.float32) / self.rope_dims)) + self.register_buffer("inv_freq", inv_freq, persistent=False) + self._seq_len_cached = 0 + self._cos_cached: Tensor | None = None + self._sin_cached: Tensor | None = None + def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]: + if ( + self._cos_cached is None + or self._sin_cached is None + or self._seq_len_cached != seq_len + or self._cos_cached.device != device + ): + rd = self.rope_dims + if seq_len > self.train_seq_len: + scale = seq_len / self.train_seq_len + new_base = self.base * (scale ** (rd / (rd - 2))) + inv_freq = 1.0 / (new_base ** (torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd)) + else: + inv_freq = self.inv_freq.to(device) + t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype) + freqs = torch.outer(t, inv_freq) + self._cos_cached = freqs.cos()[None, :, None, :] + self._sin_cached = freqs.sin()[None, :, None, :] + self._seq_len_cached = seq_len + return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype) +def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor, rope_dims: int = 0) -> Tensor: + if rope_dims > 0 and rope_dims < x.size(-1): + x_rope, x_pass = x[..., :rope_dims], x[..., rope_dims:] + half = rope_dims // 2 + x1, x2 = x_rope[..., :half], x_rope[..., half:] + x_rope = torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) + return torch.cat((x_rope, x_pass), dim=-1) + half = x.size(-1) // 2 + x1, x2 = x[..., :half], x[..., half:] + return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) + +class CausalSelfAttention(nn.Module): + def __init__( + self, + dim: int, + num_heads: int, + num_kv_heads: int, + rope_base: float, + qk_gain_init: float, + gated_attention: bool = False, + value_residual: bool = False, + ): + super().__init__() + if dim % num_heads != 0: + raise ValueError("model_dim must be divisible by num_heads") + if num_heads % num_kv_heads != 0: + raise ValueError("num_heads must be divisible by num_kv_heads") + self.num_heads = num_heads + self.num_kv_heads = num_kv_heads + self.head_dim = dim // num_heads + if self.head_dim % 2 != 0: + raise ValueError("head_dim must be even for RoPE") + # No CastedLinear -- weights come from banks + self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32)) + self.rope_dims = 0 # set by GPT.__init__ for partial RoPE + self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=1024) + self.use_xsa = False # set by GPT.__init__ for deep layers only + # Gated attention and value residual (non-banked small params) + self.gated_attention = gated_attention + if gated_attention: + self.attn_gate = nn.Linear(dim, num_heads, bias=True) + nn.init.zeros_(self.attn_gate.weight) + nn.init.constant_(self.attn_gate.bias, 4.0) + self.value_residual = value_residual + if value_residual: + self.vr_lambda = nn.Parameter(torch.tensor([0.5, 0.5], dtype=torch.float32)) + def _xsa_efficient(self, y: Tensor, v: Tensor) -> Tensor: + """Efficient XSA: subtract self-value projection via GQA-aware reshape (no repeat_interleave). + y: [B, T, H, D], v: [B, T, Hkv, D]. H must be divisible by Hkv.""" + B, T, H, D = y.shape + Hkv = v.size(-2) + group = H // Hkv + y_g = y.reshape(B, T, Hkv, group, D) # [B, T, Hkv, group, D] + vn = F.normalize(v, dim=-1).unsqueeze(-2) # [B, T, Hkv, 1, D] -- broadcast ready + proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn + return (y_g - proj).reshape(B, T, H, D) + def forward(self, x: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, v_embed: Tensor | None = None, v0: Tensor | None = None) -> tuple[Tensor, Tensor | None]: + bsz, seqlen, dim = x.shape + q = F.linear(x, q_w.to(x.dtype)).reshape(bsz, seqlen, self.num_heads, self.head_dim) + k = F.linear(x, k_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + v = F.linear(x, v_w.to(x.dtype)) + if v_embed is not None: + v = v + v_embed + v = v.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + raw_v = v if self.value_residual else None + if self.value_residual and v0 is not None: + lam = self.vr_lambda.to(dtype=v.dtype) + v = lam[0] * v0 + lam[1] * v + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = self.rotary(seqlen, x.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, self.rope_dims) + k = apply_rotary_emb(k, cos, sin, self.rope_dims) + q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None] + y = flash_attn_3_func(q, k, v, causal=True) + if self.use_xsa: + y = self._xsa_efficient(y, v) + if self.gated_attention: + # gate shape: (bsz, seqlen, num_heads) -> (bsz, seqlen, num_heads, 1) for B,T,H,D layout + gate = torch.sigmoid(self.attn_gate(x)).unsqueeze(-1) + y = y * gate + y = y.reshape(bsz, seqlen, dim) + return F.linear(y, out_w.to(x.dtype)), raw_v + +class SmearGate(nn.Module): + def __init__(self, dim: int): + super().__init__() + self.gate = nn.Parameter(torch.zeros(dim, dtype=torch.float32)) + def forward(self, x: Tensor) -> Tensor: + g = torch.sigmoid(self.gate.to(dtype=x.dtype))[None, None, :] + x_prev = torch.cat([torch.zeros_like(x[:, :1]), x[:, :-1]], dim=1) + return (1 - g) * x + g * x_prev + +class BigramHashEmbedding(nn.Module): + def __init__(self, bigram_vocab_size: int, bigram_dim: int, model_dim: int): + super().__init__() + self.bigram_vocab_size = bigram_vocab_size + self.embed = nn.Embedding(bigram_vocab_size, bigram_dim) + nn.init.zeros_(self.embed.weight) + self.proj = CastedLinear(bigram_dim, model_dim, bias=False) if bigram_dim != model_dim else None + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.05, dtype=torch.float32)) + def bigram_hash(self, tokens: Tensor) -> Tensor: + t = tokens.to(torch.int32) + mod = self.bigram_vocab_size - 1 + out = torch.empty_like(t) + out[..., 0] = mod + out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod + return out.long() + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(self.bigram_hash(token_ids)) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) + +class ValueEmbedding(nn.Module): + """Reinject token identity into attention values at specific layers. + Each table maps vocab tokens to a low-dim embedding, projected to model_dim.""" + def __init__(self, vocab_size: int, ve_dim: int, model_dim: int): + super().__init__() + self.embed = nn.Embedding(vocab_size, ve_dim) + nn.init.normal_(self.embed.weight, std=0.01) + self.proj = CastedLinear(ve_dim, model_dim, bias=False) if ve_dim != model_dim else None + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.1, dtype=torch.float32)) + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(token_ids) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) + +class MLP(nn.Module): + def __init__(self, dim: int, mlp_mult: int): + super().__init__() + # No CastedLinear -- weights come from banks + def forward(self, x: Tensor, up_w: Tensor, down_w: Tensor) -> Tensor: + x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5) + return F.linear(x.square(), down_w.to(x.dtype)) + +class Block(nn.Module): + def __init__( + self, + dim: int, + num_heads: int, + num_kv_heads: int, + mlp_mult: int, + rope_base: float, + qk_gain_init: float, + layer_idx: int = 0, + ln_scale: bool = False, + dtg: bool = False, + gated_attention: bool = False, + value_residual: bool = False, + ): + super().__init__() + self.attn_norm = RMSNorm() + self.mlp_norm = RMSNorm() + self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init, + gated_attention=gated_attention, value_residual=value_residual) + self.mlp = MLP(dim, mlp_mult) + self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float()) + self.ln_scale_factor = 1.0 / math.sqrt(layer_idx + 1) if ln_scale else 1.0 + if dtg: + self.dtg_gate = nn.Linear(dim, 1, bias=True) + nn.init.zeros_(self.dtg_gate.weight) + nn.init.constant_(self.dtg_gate.bias, 2.0) + else: + self.dtg_gate = None + def forward(self, x: Tensor, x0: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, up_w: Tensor, down_w: Tensor, v_embed: Tensor | None = None, v0: Tensor | None = None) -> tuple[Tensor, Tensor | None]: + mix = self.resid_mix.to(dtype=x.dtype) + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + attn_out, raw_v = self.attn(self.attn_norm(x_in) * self.ln_scale_factor, q_w, k_w, v_w, out_w, v_embed=v_embed, v0=v0) + x_out = x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) + if self.dtg_gate is not None: + gate = torch.sigmoid(self.dtg_gate(x_in.detach())) + x_out = x_in + gate * (x_out - x_in) + return x_out, raw_v + +def _fake_quantize(w: Tensor, bits: int = 6) -> Tensor: + clip_range = (1 << (bits - 1)) - 1 + w32 = w.float() + if w32.ndim >= 2: + row_max = w32.abs().amax(dim=-1) + scale = (row_max / clip_range).clamp_min(1.0 / clip_range) + dims = (slice(None),) * (w32.ndim - 1) + (None,) + w_q = (torch.clamp(torch.round(w32 / scale[dims]), -clip_range, clip_range) * scale[dims]).to(w.dtype) + else: + amax = w32.abs().max() + scale = (amax / clip_range).clamp_min(1.0 / clip_range) + w_q = (torch.clamp(torch.round(w32 / scale), -clip_range, clip_range) * scale).to(w.dtype) + return w + (w_q - w).detach() + +class GPT(nn.Module): + def __init__( + self, + vocab_size: int, + num_layers: int, + model_dim: int, + num_heads: int, + num_kv_heads: int, + mlp_mult: int, + tie_embeddings: bool, + tied_embed_init_std: float, + logit_softcap: float, + rope_base: float, + qk_gain_init: float, + mtp_num_heads: int = 0, + mtp_loss_weight: float = 0.1, + bigram_vocab_size: int = 0, + bigram_dim: int = 128, + xsa_last_n: int = 0, + rope_dims: int = 0, + ln_scale: bool = False, + dtg: bool = False, + ve_enabled: bool = False, + ve_dim: int = 128, + ve_layers: str = "9,10", + gated_attention: bool = False, + value_residual: bool = False, + core_start: int = 3, + core_end: int = 8, + num_passes: int = 1, + core_quant_bits: int = 6, + core_quant_enabled: bool = False, + residual_scale: nn.Module | None = None, + interpass_rmsnorm: bool = True, + lora_rank: int = 0, + ): + super().__init__() + self._ve_target_dim = num_kv_heads * (model_dim // num_heads) + if logit_softcap <= 0.0: + raise ValueError(f"logit_softcap must be positive, got {logit_softcap}") + self.tie_embeddings = tie_embeddings + self.tied_embed_init_std = tied_embed_init_std + self.logit_softcap = logit_softcap + self.value_residual = value_residual + self.mtp_num_heads = mtp_num_heads + self.mtp_loss_weight = mtp_loss_weight + self.core_start = core_start + self.core_end = min(core_end, num_layers) + self.interpass_rmsnorm = interpass_rmsnorm + self.num_passes = num_passes + self.core_quant_bits = core_quant_bits + self.core_quant_enabled = core_quant_enabled + self.num_stem = core_start + self.num_core = self.core_end - core_start + self.num_tail = num_layers - self.core_end + self.residual_scale = residual_scale + self.lora_rank = lora_rank + self.tok_emb = nn.Embedding(vocab_size, model_dim) + self.bigram = BigramHashEmbedding(bigram_vocab_size, bigram_dim, model_dim) if bigram_vocab_size > 0 else None + self.smear = SmearGate(model_dim) + self.num_skip_weights = min(self.num_stem, self.num_tail) + self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32)) + # Parameter banks: contiguous 3D tensors for batched optimizer + head_dim = model_dim // num_heads + kv_dim = num_kv_heads * head_dim + mlp_dim = int(mlp_mult * model_dim) + self.num_layers = num_layers + self.qo_bank = nn.Parameter(torch.empty(2 * num_layers, model_dim, model_dim)) + self.kv_bank = nn.Parameter(torch.empty(2 * num_layers, kv_dim, model_dim)) + self.mlp_up_bank = nn.Parameter(torch.empty(num_layers, mlp_dim, model_dim)) + self.mlp_down_bank = nn.Parameter(torch.empty(num_layers, model_dim, mlp_dim)) + # Per-pass LoRA adapters for recurrent core (scaled B @ A added to bank weights) + self._lora_scale = 1.0 / math.sqrt(lora_rank) if lora_rank > 0 else 1.0 + self.register_buffer('_lora_step_mul', torch.ones((), dtype=torch.float32), persistent=False) + if lora_rank > 0 and self.num_core > 0 and num_passes > 1: + nc, np_, r = self.num_core, num_passes, lora_rank + for wname, in_d, out_d in [ + ("q", model_dim, model_dim), ("out", model_dim, model_dim), + ("k", model_dim, kv_dim), ("v", model_dim, kv_dim), + ("up", model_dim, mlp_dim), ("down", mlp_dim, model_dim), + ]: + A = nn.Parameter(torch.empty(np_, nc, r, in_d)) + nn.init.normal_(A, mean=0.0, std=0.01) + B = nn.Parameter(torch.zeros(np_, nc, out_d, r)) + setattr(self, f"lora_A_{wname}", A) + setattr(self, f"lora_B_{wname}", B) + self.blocks = nn.ModuleList( + [ + Block( + model_dim, + num_heads, + num_kv_heads, + mlp_mult, + rope_base, + qk_gain_init, + layer_idx=i, + ln_scale=ln_scale, + dtg=dtg, + gated_attention=gated_attention, + value_residual=value_residual, + ) + for i in range(num_layers) + ] + ) + if rope_dims > 0: + head_dim = model_dim // num_heads + for block in self.blocks: + block.attn.rope_dims = rope_dims + block.attn.rotary = Rotary(head_dim, base=rope_base, train_seq_len=1024, rope_dims=rope_dims) + self.ve_layer_indices = [int(x) for x in ve_layers.split(",") if x.strip()] if ve_enabled else [] + kv_dim_ve = self._ve_target_dim + if self.ve_layer_indices: + self.ve_shared = ValueEmbedding(vocab_size, ve_dim, kv_dim_ve) + self.ve_layer_scales = nn.ParameterList( + [nn.Parameter(torch.ones(1, dtype=torch.float32)) for _ in self.ve_layer_indices] + ) + else: + self.ve_shared = None + self.ve_layer_scales = nn.ParameterList() + self.value_embeds = nn.ModuleList() # keep empty for compat + self.final_norm = RMSNorm() + self.lm_head = None if tie_embeddings else CastedLinear(model_dim, vocab_size, bias=False) + if self.lm_head is not None: + self.lm_head._zero_init = True + self.mtp_heads = nn.ModuleList( + [CastedLinear(model_dim, vocab_size, bias=False) for _ in range(mtp_num_heads)] + ) + for head in self.mtp_heads: + head._zero_init = True + if xsa_last_n > 0: + for i in range(max(0, num_layers - xsa_last_n), num_layers): + if i < core_start or i >= self.core_end: + self.blocks[i].attn.use_xsa = True + self._init_weights() + def _init_weights(self) -> None: + if self.tie_embeddings: + nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std) + n = self.num_layers + proj_scale = 1.0 / math.sqrt(2 * n) + # Init banks: orthogonal, with proj layers scaled down and out/down zero-init + for i in range(n): + nn.init.orthogonal_(self.qo_bank.data[i], gain=1.0) # Q + nn.init.zeros_(self.qo_bank.data[n + i]) # Out (zero init) + nn.init.orthogonal_(self.kv_bank.data[i], gain=1.0) # K + nn.init.orthogonal_(self.kv_bank.data[n + i], gain=1.0) # V + nn.init.orthogonal_(self.mlp_up_bank.data[i], gain=1.0) # MLP up + nn.init.zeros_(self.mlp_down_bank.data[i]) # MLP down (zero init) + # Scale proj layers (out_proj and mlp_down are "proj" layers) + self.qo_bank.data[n + i].mul_(proj_scale) + self.mlp_down_bank.data[i].mul_(proj_scale) + # Init remaining nn.Linear modules (bigram proj, mtp heads, lm_head) + for name, module in self.named_modules(): + if isinstance(module, nn.Linear): + if getattr(module, "_zero_init", False): + nn.init.zeros_(module.weight) + elif module.weight.ndim == 2 and module.weight.shape[0] >= 64 and module.weight.shape[1] >= 64: + nn.init.orthogonal_(module.weight, gain=1.0) + def _get_ve(self, layer_idx: int, input_ids: Tensor, ve_cache: dict | None = None) -> Tensor | None: + """Get value embedding for a specific layer using shared table + per-layer scale.""" + if self.ve_shared is None or layer_idx not in self.ve_layer_indices: + return None + if ve_cache is not None and 've' not in ve_cache: + ve_cache['ve'] = self.ve_shared(input_ids) + ve_base = ve_cache['ve'] if ve_cache is not None else self.ve_shared(input_ids) + ve_idx = self.ve_layer_indices.index(layer_idx) + return ve_base * self.ve_layer_scales[ve_idx].to(dtype=ve_base.dtype) + def _get_bank_weights(self, bi: int) -> tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]: + n = self.num_layers + q_w = self.qo_bank[bi] + out_w = self.qo_bank[n + bi] + k_w = self.kv_bank[bi] + v_w = self.kv_bank[n + bi] + up_w = self.mlp_up_bank[bi] + down_w = self.mlp_down_bank[bi] + if self.core_quant_enabled and self.training and self.core_start <= bi < self.core_end: + q_w = _fake_quantize(q_w, self.core_quant_bits) + out_w = _fake_quantize(out_w, self.core_quant_bits) + k_w = _fake_quantize(k_w, self.core_quant_bits) + v_w = _fake_quantize(v_w, self.core_quant_bits) + up_w = _fake_quantize(up_w, self.core_quant_bits) + down_w = _fake_quantize(down_w, self.core_quant_bits) + return q_w, k_w, v_w, out_w, up_w, down_w + + def _forward_hidden(self, input_ids: Tensor, feedback_fn=None, + stabilizer=None) -> tuple[Tensor, Tensor, Tensor]: + n = self.num_layers + x = self.tok_emb(input_ids) + if self.bigram is not None: + x = x + self.bigram(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + x = self.smear(x) + x0 = x + v0 = None + skips: list[Tensor] = [] + ve_cache: dict = {} + # --- STEM --- + for i in range(self.core_start): + ve = self._get_ve(i, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) + x, raw_v = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + if v0 is None and raw_v is not None: + v0 = raw_v + skips.append(x) + # --- RECURRENT CORE (Fixes 1, 2, 5) --- + h_core_in = x + for k in range(self.num_passes): + if k > 0 and self.interpass_rmsnorm: + x = F.rms_norm(x, (x.size(-1),)) + if feedback_fn is not None: + x = x + feedback_fn(x, k) + if stabilizer is not None: + x = stabilizer.clip(x) + x_before_pass = x + for j in range(self.core_start, self.core_end): + h_prev = x + ve = self._get_ve(j, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(j) + if self.lora_rank > 0: + ci = j - self.core_start + s = self._lora_scale * self._lora_step_mul + q_w = q_w + s * (self.lora_B_q[k, ci] @ self.lora_A_q[k, ci]) + k_w = k_w + s * (self.lora_B_k[k, ci] @ self.lora_A_k[k, ci]) + v_w = v_w + s * (self.lora_B_v[k, ci] @ self.lora_A_v[k, ci]) + out_w = out_w + s * (self.lora_B_out[k, ci] @ self.lora_A_out[k, ci]) + up_w = up_w + s * (self.lora_B_up[k, ci] @ self.lora_A_up[k, ci]) + down_w = down_w + s * (self.lora_B_down[k, ci] @ self.lora_A_down[k, ci]) + x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + if v0 is None and raw_v is not None: + v0 = raw_v + if stabilizer is not None and self.training and not torch.compiler.is_compiling(): + stabilizer.record_pass(h_prev, x) + if self.residual_scale is not None and k > 0: + delta = x - x_before_pass + x = x_before_pass + self.residual_scale(delta, k) + h_core_out = x + # --- TAIL --- + for i in range(self.core_end, n): + ti = i - self.core_end + if ti < len(skips): + x = x + self.skip_weights[ti].to(dtype=x.dtype)[None, None, :] * skips.pop() + ve = self._get_ve(i, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) + x, _ = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + x = self.final_norm(x) + return x, h_core_in, h_core_out + + def forward(self, input_ids: Tensor, target_ids: Tensor, + feedback_fn=None, stabilizer=None) -> Tensor: + x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) + x_flat = x.reshape(-1, x.size(-1)) + targets = target_ids.reshape(-1) + if self.tie_embeddings: + logits_proj = F.linear(x_flat, self.tok_emb.weight) + else: + if self.lm_head is None: + raise RuntimeError("lm_head is required when tie_embeddings=False") + logits_proj = self.lm_head(x_flat) + logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + main_loss = F.cross_entropy(logits.float(), targets, reduction="mean") + if self.training and self.mtp_num_heads > 0 and self.mtp_loss_weight > 0.0: + _, seqlen, dim = x.shape + mtp_loss_sum = x.new_zeros(()) + mtp_loss_count = 0 + for k, mtp_head in enumerate(self.mtp_heads): + valid_t = seqlen - (k + 1) + if valid_t <= 0: + continue + mtp_hidden = x[:, :valid_t, :].reshape(-1, dim) + mtp_targets = target_ids[:, k + 1 :].reshape(-1) + mtp_logits_proj = mtp_head(mtp_hidden) + mtp_logits = self.logit_softcap * torch.tanh(mtp_logits_proj / self.logit_softcap) + mtp_loss_sum = mtp_loss_sum + F.cross_entropy(mtp_logits.float(), mtp_targets, reduction="mean") + mtp_loss_count += 1 + if mtp_loss_count > 0: + main_loss = main_loss + self.mtp_loss_weight * (mtp_loss_sum / mtp_loss_count) + if stabilizer is not None and stabilizer.jacobian_proxy_weight > 0: + main_loss = main_loss + stabilizer.jacobian_proxy_loss(h_core_in, h_core_out) + return main_loss + + def forward_logits(self, input_ids: Tensor, + feedback_fn=None, stabilizer=None) -> Tensor: + """Return logits (bsz, seq_len, vocab) without computing loss.""" + x, _, _ = self._forward_hidden(input_ids, feedback_fn, stabilizer) + if self.tie_embeddings: + logits_proj = F.linear(x, self.tok_emb.weight) + else: + logits_proj = self.lm_head(x) + return self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + +# --- Sliding window evaluation --- + +def eval_val_sliding( + args: Hyperparameters, + base_model: nn.Module, + rank: int, + world_size: int, + device: torch.device, + val_tokens: Tensor, + base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + stride: int, + batch_seqs: int = 32, + eval_seq_len: int | None = None, +) -> tuple[float, float]: + """Sliding window evaluation: each token scored with maximum context.""" + seq_len = eval_seq_len or args.train_seq_len + total_tokens = val_tokens.numel() - 1 + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= 1] + total_windows = len(window_starts) + my_s = (total_windows * rank) // world_size + my_e = (total_windows * (rank + 1)) // world_size + my_windows = window_starts[my_s:my_e] + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + base_model.eval() + compiled_logits = torch.compile(base_model.forward_logits, dynamic=False, fullgraph=True) + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk[:-1] + y_batch[i, :wlen] = chunk[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = compiled_logits(x_batch) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), + reduction="none", + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt = y_batch[i, s:wlen] + prev = x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + val_loss = (loss_sum / token_count).item() + bits_per_token = val_loss / math.log(2.0) + tokens_per_byte = token_count.item() / byte_count.item() + base_model.train() + return val_loss, bits_per_token * tokens_per_byte + + +def eval_val_sliding_ttt( + args: Hyperparameters, base_model: nn.Module, rank: int, world_size: int, + device: torch.device, val_tokens: Tensor, base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, is_boundary_token_lut: Tensor, + stride: int, batch_seqs: int = 32, log0=print, +) -> tuple[float, float]: + """Legal score-first TTT (PR #461 recipe): score each chunk with sliding windows, + then train on it. Every token scored BEFORE any update that could use it.""" + seq_len = args.train_seq_len + total_tokens = val_tokens.numel() - 1 + ttt_chunk = args.ttt_chunk_tokens + + # Pre-compute all window starts + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0] + + # Assign each window to a chunk based on the first token it scores + num_chunks = (total_tokens + ttt_chunk - 1) // ttt_chunk + chunk_windows: list[list[int]] = [[] for _ in range(num_chunks)] + for ws in window_starts: + end = min(ws + seq_len, total_tokens) + wlen = end - ws + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_start = ws + s + ci = min(scored_start // ttt_chunk, num_chunks - 1) + chunk_windows[ci].append(ws) + + log0(f"ttt_sliding:start chunks={num_chunks} chunk_tokens={ttt_chunk} " + f"total_windows={len(window_starts)} stride={stride} " + f"ttt_lr={args.ttt_lr} ttt_epochs={args.ttt_epochs} " + f"freeze_blocks={args.ttt_freeze_blocks}") + + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + + # Freeze first N blocks + frozen_block_ids = set(range(min(args.ttt_freeze_blocks, len(base_model.blocks)))) + ttt_params = [] + for name, p in base_model.named_parameters(): + freeze = False + for bi in frozen_block_ids: + if f"blocks.{bi}." in name: + freeze = True + break + if freeze: + p.requires_grad_(False) + else: + p.requires_grad_(True) + ttt_params.append(p) + + log0(f"ttt_sliding:params unfrozen={sum(p.numel() for p in ttt_params)} " + f"frozen={sum(p.numel() for p in base_model.parameters() if not p.requires_grad)}") + + optimizer = torch.optim.SGD(ttt_params, lr=args.ttt_lr, momentum=args.ttt_momentum) + t0 = time.perf_counter() + + for ci in range(num_chunks): + windows = chunk_windows[ci] + if not windows: + continue + chunk_start = ci * ttt_chunk + chunk_end = min((ci + 1) * ttt_chunk, total_tokens) + + # --- Phase 1: SCORE this chunk's windows (inference_mode) --- + my_s = (len(windows) * rank) // world_size + my_e = (len(windows) * (rank + 1)) // world_size + my_windows = windows[my_s:my_e] + + base_model.eval() + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk_tok = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk_tok[:-1] + y_batch[i, :wlen] = chunk_tok[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = base_model.forward_logits(x_batch) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), reduction="none", + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt, prev = y_batch[i, s:wlen], x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + + # --- Phase 2: TRAIN on this chunk (already scored = legal) --- + is_last_chunk = (ci == num_chunks - 1) + if not is_last_chunk and args.ttt_epochs > 0: + base_model.train() + chunk_seqs = (chunk_end - chunk_start) // seq_len + if chunk_seqs > 0: + cos_lr = args.ttt_lr * 0.5 * (1.0 + math.cos(math.pi * ci / max(num_chunks - 1, 1))) + for pg in optimizer.param_groups: + pg['lr'] = cos_lr + my_seq_s = (chunk_seqs * rank) // world_size + my_seq_e = (chunk_seqs * (rank + 1)) // world_size + my_chunk_seqs = my_seq_e - my_seq_s + for _ep in range(args.ttt_epochs): + for bs in range(0, my_chunk_seqs, args.ttt_batch_seqs): + be = min(bs + args.ttt_batch_seqs, my_chunk_seqs) + actual_bs = my_seq_s + bs + start_tok = chunk_start + actual_bs * seq_len + end_tok = chunk_start + (my_seq_s + be) * seq_len + 1 + if end_tok > val_tokens.numel(): + continue + local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + optimizer.zero_grad(set_to_none=True) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = base_model(x, y) + loss.backward() + if world_size > 1: + for p in ttt_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + torch.nn.utils.clip_grad_norm_(ttt_params, args.ttt_grad_clip) + optimizer.step() + + if rank == 0 and (ci % 10 == 0 or ci == num_chunks - 1): + elapsed = time.perf_counter() - t0 + rl = loss_sum.item() / max(token_count.item(), 1) + rbpb = rl / math.log(2.0) * (token_count.item() / max(byte_count.item(), 1)) if token_count.item() > 0 else 0.0 + log0(f" ttt_chunk [{ci+1}/{num_chunks}] bpb={rbpb:.6f} time={elapsed:.1f}s") + + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + + val_loss = (loss_sum / token_count).item() + val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item()) + + for p in base_model.parameters(): + p.requires_grad_(True) + base_model.eval() + + log0(f"ttt_sliding:done val_loss={val_loss:.6f} val_bpb={val_bpb:.6f} " + f"elapsed={time.perf_counter() - t0:.1f}s") + return val_loss, val_bpb + + +# --- GPTQ-lite int6 quantization --- + +def _classify_param(name: str) -> str: + if "tok_emb" in name or "lm_head" in name: + return "embed" + if ".mlp." in name: + return "mlp" + if ".attn." in name or (".proj." in name and ".mlp." not in name): + return "attn" + return "other" +def quantize_int6_per_row(t: Tensor, clip_range: int = 31) -> tuple[Tensor, Tensor]: + t32 = t.float() + if t32.ndim == 2: + best_q, best_s, best_err = None, None, float('inf') + for pct in [0.9990, 0.9995, 0.9999, 0.99999, 1.0]: + if pct < 1.0: + row_clip = torch.quantile(t32.abs(), pct, dim=1) + else: + row_clip = t32.abs().amax(dim=1) + s = (row_clip / clip_range).clamp_min(1.0 / clip_range).to(torch.float16) + q = torch.clamp(torch.round(t32 / s.float()[:, None]), -clip_range, clip_range).to(torch.int8) + recon = q.float() * s.float()[:, None] + err = (t32 - recon).pow(2).mean().item() + if err < best_err: + best_q, best_s, best_err = q, s, err + return best_q, best_s + amax = t32.abs().max().item() + scale = torch.tensor(amax / clip_range if amax > 0 else 1.0, dtype=torch.float16) + q = torch.clamp(torch.round(t32 / scale.float()), -clip_range, clip_range).to(torch.int8) + return q, scale + +def _unbank_state_dict(sd: dict[str, Tensor], num_layers: int) -> dict[str, Tensor]: + """Convert 3D bank tensors into individual 2D tensors with standard names.""" + out: dict[str, Tensor] = {} + n = num_layers + for name, tensor in sd.items(): + if name == "qo_bank": + for i in range(n): + out[f"blocks.{i}.attn.c_q.weight"] = tensor[i] + out[f"blocks.{i}.attn.proj.weight"] = tensor[n + i] + elif name == "kv_bank": + for i in range(n): + out[f"blocks.{i}.attn.c_k.weight"] = tensor[i] + out[f"blocks.{i}.attn.c_v.weight"] = tensor[n + i] + elif name == "mlp_up_bank": + for i in range(n): + out[f"blocks.{i}.mlp.fc.weight"] = tensor[i] + elif name == "mlp_down_bank": + for i in range(n): + out[f"blocks.{i}.mlp.proj.weight"] = tensor[i] + else: + out[name] = tensor + return out + +def _rebank_state_dict(sd: dict[str, Tensor], num_layers: int, template_sd: dict[str, Tensor]) -> dict[str, Tensor]: + """Convert individual 2D tensors back into 3D bank tensors.""" + out: dict[str, Tensor] = {} + n = num_layers + # Reconstruct banks from individual weight keys + qo_slices = [None] * (2 * n) + kv_slices = [None] * (2 * n) + up_slices = [None] * n + down_slices = [None] * n + consumed = set() + for i in range(n): + qk = f"blocks.{i}.attn.c_q.weight" + if qk in sd: + qo_slices[i] = sd[qk] + consumed.add(qk) + ok = f"blocks.{i}.attn.proj.weight" + if ok in sd: + qo_slices[n + i] = sd[ok] + consumed.add(ok) + kk = f"blocks.{i}.attn.c_k.weight" + if kk in sd: + kv_slices[i] = sd[kk] + consumed.add(kk) + vk = f"blocks.{i}.attn.c_v.weight" + if vk in sd: + kv_slices[n + i] = sd[vk] + consumed.add(vk) + fk = f"blocks.{i}.mlp.fc.weight" + if fk in sd: + up_slices[i] = sd[fk] + consumed.add(fk) + dk = f"blocks.{i}.mlp.proj.weight" + if dk in sd: + down_slices[i] = sd[dk] + consumed.add(dk) + out["qo_bank"] = torch.stack(qo_slices).to(dtype=template_sd["qo_bank"].dtype) + out["kv_bank"] = torch.stack(kv_slices).to(dtype=template_sd["kv_bank"].dtype) + out["mlp_up_bank"] = torch.stack(up_slices).to(dtype=template_sd["mlp_up_bank"].dtype) + out["mlp_down_bank"] = torch.stack(down_slices).to(dtype=template_sd["mlp_down_bank"].dtype) + for name, tensor in sd.items(): + if name not in consumed: + out[name] = tensor + return out + +def mixed_quantize_int6(state_dict: dict[str, Tensor], int6_cats: set[str]): + num_layers_total = max( + (int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")), + default=0, + ) + 1 + late_k_layers = set(range(num_layers_total - 2, num_layers_total)) + result: dict[str, Tensor] = {} + meta: dict[str, object] = {} + for name, tensor in state_dict.items(): + t = tensor.detach().cpu().contiguous() + cat = _classify_param(name) + if not t.is_floating_point() or t.numel() <= 65536: + result[name] = t.to(torch.float16) if t.is_floating_point() else t + meta[name] = "passthrough" + continue + if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS): + result[name] = t.float() + meta[name] = "passthrough_ctrl" + continue + if cat in int6_cats and t.ndim >= 1: + q, s = quantize_int6_per_row(t) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int6"} + else: + q, s = quantize_float_tensor(t) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int8"} + return result, meta +def dequantize_mixed_int6(result: dict[str, Tensor], meta: dict[str, object], + template_sd: dict[str, Tensor]) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + for name, orig in template_sd.items(): + info = meta.get(name) + if info is None: + continue + orig_dtype = orig.dtype + if info in ("passthrough", "passthrough_ctrl", "passthrough_fp16"): + t = result[name] + if t.dtype == torch.float16 and orig_dtype in (torch.float32, torch.bfloat16): + t = t.to(orig_dtype) + out[name] = t + continue + q, s = result[name + ".q"], result[name + ".scale"] + if s.ndim > 0: + out[name] = (q.float() * s.float().view(q.shape[0], *([1] * (q.ndim - 1)))).to(orig_dtype) + else: + out[name] = (q.float() * float(s.item())).to(orig_dtype) + return out + +# --- Training --- + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser(description="Recurrent SOTA with stabilization") + g = parser.add_argument_group("feedback") + g.add_argument("--feedback-rank", type=int, default=2) + g.add_argument("--feedback-mode", type=str, default="diagonal", + choices=["identity", "diagonal", "low_rank", "none"]) + g.add_argument("--per-pass-feedback", action="store_true") + g.add_argument("--affine-junction", action="store_true") + g = parser.add_argument_group("stability") + g.add_argument("--clip-hidden", action="store_true") + g.add_argument("--clip-value", type=float, default=10.0) + g.add_argument("--residual-scale-init", type=float, default=0.5) + g.add_argument("--jacobian-proxy-weight", type=float, default=0.01) + g.add_argument("--no-interpass-rmsnorm", action="store_true") + g.add_argument("--lora-rank", type=int, default=0) + g.add_argument("--lora-warmup-steps", type=int, default=0, + help="Linearly ramp LoRA scale from 0 to 1 over this many steps.") + g = parser.add_argument_group("eval-only") + g.add_argument("--eval-only-passes", type=int, default=None, + help="Skip training; load final_model.pt and run TTT eval with this many passes.") + g.add_argument("--eval-only-checkpoint", type=str, default="final_model.pt", + help="Checkpoint path for --eval-only-passes mode.") + return parser.parse_args() + +def main() -> None: + cli = parse_args() + code = Path(__file__).read_text(encoding="utf-8") + args = Hyperparameters() + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + if world_size <= 0: + raise ValueError(f"WORLD_SIZE must be positive, got {world_size}") + if 8 % world_size != 0: + raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral") + grad_accum_steps = 8 // world_size + grad_scale = 1.0 / grad_accum_steps + if not torch.cuda.is_available(): + raise RuntimeError("CUDA is required") + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + master_process = rank == 0 + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp + enable_cudnn_sdp(False) + enable_flash_sdp(True) + enable_mem_efficient_sdp(False) + enable_math_sdp(False) + logfile = None + if master_process: + os.makedirs("logs", exist_ok=True) + logfile = f"logs/{args.run_id}.txt" + print(logfile) + def log0(msg: str, console: bool = True) -> None: + if not master_process: + return + if console: + print(msg) + if logfile is not None: + with open(logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + log0(code, console=False) + log0("=" * 100, console=False) + log0(f"Running Python {sys.version}", console=False) + log0(f"Running PyTorch {torch.__version__}", console=False) + log0( + subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=False).stdout, + console=False, + ) + log0("=" * 100, console=False) + random.seed(args.seed) + np.random.seed(args.seed) + torch.manual_seed(args.seed) + torch.cuda.manual_seed_all(args.seed) + if not args.tokenizer_path.endswith(".model"): + raise ValueError(f"Script only setup for SentencePiece .model file: {args.tokenizer_path}") + sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) + if int(sp.vocab_size()) != args.vocab_size: + raise ValueError( + f"VOCAB_SIZE={args.vocab_size} does not match tokenizer vocab_size={int(sp.vocab_size())}" + ) + dataset_dir = Path(args.data_path).resolve() + actual_train_files = len(list(dataset_dir.glob("fineweb_train_*.bin"))) + effective_eval_seq_len = args.eval_seq_len if args.eval_seq_len > 0 else args.train_seq_len + val_seq_len = max(args.train_seq_len, effective_eval_seq_len) + val_tokens = load_validation_tokens(args.val_files, val_seq_len) + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts( + sp, args.vocab_size, device + ) + log0(f"val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path={args.tokenizer_path}") + log0(f"train_loader:dataset:{dataset_dir.name} train_shards:{actual_train_files}") + log0(f"val_loader:shards pattern={args.val_files} tokens:{val_tokens.numel() - 1}") + CastedLinear._qat_enabled = args.qat_enabled + base_model = GPT( + vocab_size=args.vocab_size, + num_layers=args.num_layers, + model_dim=args.model_dim, + num_heads=args.num_heads, + num_kv_heads=args.num_kv_heads, + mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, + tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, + rope_base=args.rope_base, + qk_gain_init=args.qk_gain_init, + mtp_num_heads=args.mtp_num_heads, + mtp_loss_weight=args.mtp_loss_weight, + bigram_vocab_size=args.bigram_vocab_size, + bigram_dim=args.bigram_dim, + xsa_last_n=args.xsa_last_n, + rope_dims=args.rope_dims, + ln_scale=args.ln_scale, + dtg=args.dtg_enabled, + ve_enabled=args.ve_enabled, + ve_dim=args.ve_dim, + ve_layers=args.ve_layers, + gated_attention=args.gated_attention, + value_residual=args.value_residual, + core_start=args.core_start, + core_end=args.core_end, + num_passes=args.num_passes, + core_quant_bits=args.core_quant_bits, + core_quant_enabled=args.core_quant_enabled, + residual_scale=None, + interpass_rmsnorm=not cli.no_interpass_rmsnorm, + lora_rank=cli.lora_rank or args.lora_rank, + ).to(device).bfloat16() + # Banks stay FP32 (like CastedLinear weights), cast to BF16 in forward + base_model.qo_bank.data = base_model.qo_bank.data.float() + base_model.kv_bank.data = base_model.kv_bank.data.float() + base_model.mlp_up_bank.data = base_model.mlp_up_bank.data.float() + base_model.mlp_down_bank.data = base_model.mlp_down_bank.data.float() + for module in base_model.modules(): + if isinstance(module, CastedLinear): + module.float() + restore_low_dim_params_to_fp32(base_model) + # --- feedback / stabilizer --- + feedback = None + feedback_fn = None + stabilizer = None + residual_scale = None + extra_scalar_params: list[nn.Parameter] = [] + if cli.feedback_mode != "none" and args.num_passes > 1: + feedback = ErrorFeedbackModule( + dim=args.model_dim, rank=cli.feedback_rank, + feedback_mode=cli.feedback_mode, + per_pass=cli.per_pass_feedback, + num_passes=args.num_passes, + affine_junction=cli.affine_junction, + ).to(device).bfloat16() + restore_low_dim_params_to_fp32(feedback) + extra_scalar_params.extend(feedback.parameters()) + def feedback_fn(h, pass_idx): + return feedback(h, pass_idx) + log0(f"feedback: mode={cli.feedback_mode} rank={cli.feedback_rank} " + f"per_pass={cli.per_pass_feedback} params={sum(p.numel() for p in feedback.parameters())}") + if args.num_passes > 1: + stabilizer = RecurrentStabilizer( + clip_hidden=cli.clip_hidden, clip_value=cli.clip_value, + jacobian_proxy_weight=cli.jacobian_proxy_weight) + if cli.residual_scale_init != 1.0: + residual_scale = ResidualScale(args.num_passes, cli.residual_scale_init).to(device) + base_model.residual_scale = residual_scale + extra_scalar_params.extend(residual_scale.parameters()) + lora_params: list[nn.Parameter] = [] + if base_model.lora_rank > 0: + lora_params = [p for n, p in base_model.named_parameters() if "lora_" in n] + for p in lora_params: + p.data = p.data.float() + log0(f"lora: rank={base_model.lora_rank} params={sum(p.numel() for p in lora_params)}") + log0(f"recurrence: core_start={args.core_start} core_end={args.core_end} " + f"num_passes={args.num_passes} stem={base_model.num_stem} " + f"core={base_model.num_core} tail={base_model.num_tail}") + + # --- Eval-only mode: load checkpoint, override passes, run TTT, exit --- + if cli.eval_only_passes is not None: + ckpt_path = cli.eval_only_checkpoint + log0(f"eval_only: loading checkpoint {ckpt_path}") + ckpt_sd = torch.load(ckpt_path, map_location=device, weights_only=True) + base_model.load_state_dict(ckpt_sd, strict=True) + base_model.qo_bank.data = base_model.qo_bank.data.float() + base_model.kv_bank.data = base_model.kv_bank.data.float() + base_model.mlp_up_bank.data = base_model.mlp_up_bank.data.float() + base_model.mlp_down_bank.data = base_model.mlp_down_bank.data.float() + for module in base_model.modules(): + if isinstance(module, CastedLinear): + module.float() + restore_low_dim_params_to_fp32(base_model) + target_passes = cli.eval_only_passes + trained_passes = base_model.num_passes + log0(f"eval_only: overriding num_passes {trained_passes} -> {target_passes}") + base_model.num_passes = target_passes + if base_model.residual_scale is not None: + old_scales = base_model.residual_scale.scales.data + if target_passes != old_scales.shape[0]: + new_scales = torch.full((target_passes,), cli.residual_scale_init, + dtype=torch.float32, device=old_scales.device) + copy_len = min(target_passes, old_scales.shape[0]) + new_scales[:copy_len] = old_scales[:copy_len] + base_model.residual_scale.scales = nn.Parameter(new_scales) + log0(f"eval_only: ResidualScale padded/trimmed {old_scales.shape[0]} -> {target_passes}") + base_model.eval() + log0(f"eval_only: running TTT with {target_passes} passes") + ttt_loss, ttt_bpb = eval_val_sliding_ttt( + args, base_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, log0=log0, + ) + log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f}") + log0(f"legal_ttt_exact val_loss:{ttt_loss:.8f} val_bpb:{ttt_bpb:.8f}") + if distributed: + dist.destroy_process_group() + return + + # No DDP -- Parallel Muon handles bank grad communication via reduce-scatter, + # and non-bank grads are manually all-reduced before Adam steps. + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + model = compiled_model + + # Optimizer split: + # - 4 parameter banks -> Muon (batched Newton-Schulz) + # - token embedding -> Adam + # - scalars/control tensors -> Adam + # - bigram proj, mtp heads, VE proj -> Adam (small matrix params not worth banking) + matrix_params = [ + base_model.qo_bank, base_model.kv_bank, + base_model.mlp_up_bank, base_model.mlp_down_bank, + ] + block_named_params = list(base_model.blocks.named_parameters()) + scalar_params = [ + p + for name, p in block_named_params + if p.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS) + ] + if base_model.skip_weights.numel() > 0: + scalar_params.append(base_model.skip_weights) + scalar_params.append(base_model.smear.gate) + if base_model.bigram is not None: + scalar_params.append(base_model.bigram.scale) + token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr + tok_params = [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}] + if base_model.bigram is not None: + tok_params.append({"params": [base_model.bigram.embed.weight], "lr": token_lr, "base_lr": token_lr}) + if base_model.bigram.proj is not None: + scalar_params.append(base_model.bigram.proj.weight) + if base_model.ve_shared is not None: + tok_params.append({"params": [base_model.ve_shared.embed.weight], "lr": token_lr, "base_lr": token_lr}) + if base_model.ve_shared.proj is not None: + scalar_params.append(base_model.ve_shared.proj.weight) + scalar_params.append(base_model.ve_shared.scale) + for s in base_model.ve_layer_scales: + scalar_params.append(s) + optimizer_tok = torch.optim.AdamW( + tok_params, + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + weight_decay=args.adam_wd, + fused=True, + ) + optimizer_muon = Muon( + matrix_params, + lr=args.matrix_lr, + momentum=args.muon_momentum, + backend_steps=args.muon_backend_steps, + weight_decay=args.muon_wd, + ) + for group in optimizer_muon.param_groups: + group["base_lr"] = args.matrix_lr + scalar_params.extend(extra_scalar_params) + optimizer_scalar = torch.optim.AdamW( + [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}], + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + weight_decay=args.adam_wd, + fused=True, + ) + # Non-bank params that need manual all-reduce (replicated across GPUs) + replicated_params = list(optimizer_tok.param_groups[0]["params"]) + for pg in optimizer_tok.param_groups[1:]: + replicated_params.extend(pg["params"]) + replicated_params.extend(scalar_params) + + optimizer_head = None + if base_model.lm_head is not None: + optimizer_head = torch.optim.Adam( + [{"params": [base_model.lm_head.weight], "lr": args.head_lr, "base_lr": args.head_lr}], + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + fused=True, + ) + replicated_params.append(base_model.lm_head.weight) + optimizer_lora = None + if lora_params: + lora_lr = args.scalar_lr * 0.1 + optimizer_lora = torch.optim.AdamW( + [{"params": lora_params, "lr": lora_lr, "base_lr": lora_lr}], + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + weight_decay=args.adam_wd, + fused=True, + ) + replicated_params.extend(lora_params) + log0(f"lora_optimizer: lr={lora_lr} (scalar_lr * 0.1)") + optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar] + if optimizer_head is not None: + optimizers.append(optimizer_head) + if optimizer_lora is not None: + optimizers.append(optimizer_lora) + n_params = sum(p.numel() for p in base_model.parameters()) + mtp_params = sum(p.numel() for p in base_model.mtp_heads.parameters()) + log0(f"model_params:{n_params}") + log0(f"mtp_num_heads:{args.mtp_num_heads} mtp_loss_weight:{args.mtp_loss_weight} mtp_params:{mtp_params}") + xsa_layers = [i for i, b in enumerate(base_model.blocks) if b.attn.use_xsa] + log0(f"XSA:last_{args.xsa_last_n} active_layers:{xsa_layers}") + log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}") + log0("sdp_backends:cudnn=False flash=True mem_efficient=False math=False") + log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}") + log0( + f"tie_embeddings:{args.tie_embeddings} embed_lr:{token_lr} " + f"head_lr:{args.head_lr if base_model.lm_head is not None else 0.0} " + f"matrix_lr:{args.matrix_lr} scalar_lr:{args.scalar_lr}" + ) + log0( + f"train_batch_tokens:{args.train_batch_tokens} train_seq_len:{args.train_seq_len} " + f"iterations:{args.iterations} warmup_steps:{args.warmup_steps} " + f"max_wallclock_seconds:{args.max_wallclock_seconds:.3f}" + ) + log0(f"seed:{args.seed}") + use_wandb = _wandb is not None and rank == 0 and os.environ.get("WANDB_DISABLED", "0") != "1" + if use_wandb: + _wandb.init( + project=os.environ.get("WANDB_PROJECT", "parameter-golf"), + name=os.environ.get("WANDB_NAME", f"recurrent_p{args.num_passes}_s{args.seed}"), + config={ + "num_layers": args.num_layers, "model_dim": args.model_dim, + "num_passes": args.num_passes, "core_start": args.core_start, + "core_end": args.core_end, "seed": args.seed, + "train_batch_tokens": args.train_batch_tokens, + "train_seq_len": args.train_seq_len, "iterations": args.iterations, + "matrix_lr": args.matrix_lr, "scalar_lr": args.scalar_lr, + "feedback_mode": cli.feedback_mode, "feedback_rank": cli.feedback_rank, + "jacobian_proxy_weight": cli.jacobian_proxy_weight, + "residual_scale_init": cli.residual_scale_init, + "interpass_rmsnorm": not cli.no_interpass_rmsnorm, + "n_params": sum(p.numel() for p in base_model.parameters()), + }, + reinit=True, + ) + log0("wandb:initialized") + train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) + def zero_grad_all() -> None: + for opt in optimizers: + opt.zero_grad(set_to_none=True) + max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None + def lr_mul(step: int, elapsed_ms: float) -> float: + if args.warmdown_iters <= 0: + return 1.0 + if max_wallclock_ms is None: + warmdown_start = max(args.iterations - args.warmdown_iters, 0) + return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0) if warmdown_start <= step < args.iterations else 1.0 + step_ms = elapsed_ms / max(step, 1) + warmdown_ms = args.warmdown_iters * step_ms + remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) + return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0 + if args.warmup_steps > 0: + initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()} + initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] + model.train() + for warmup_step in range(args.warmup_steps): + zero_grad_all() + for micro_step in range(grad_accum_steps): + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + (warmup_loss * grad_scale).backward() + if distributed: + for p in base_model.parameters(): + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + if feedback is not None: + for p in feedback.parameters(): + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + for opt in optimizers: + opt.step() + zero_grad_all() + if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps: + log0(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}") + base_model.load_state_dict(initial_model_state, strict=True) + for opt, state in zip(optimizers, initial_optimizer_states, strict=True): + opt.load_state_dict(state) + zero_grad_all() + train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) + swa_state: dict[str, Tensor] | None = None + swa_count = 0 + from collections import deque + lawa_queue: deque[dict[str, Tensor]] = deque(maxlen=args.lawa_k) + _all_state = dict(base_model.state_dict()) + if feedback is not None: + for k, v in feedback.state_dict().items(): + _all_state[f"_fb.{k}"] = v + ema_state = {name: t.detach().float().clone() for name, t in _all_state.items()} + ema_decay = 0.997 + training_time_ms = 0.0 + stop_after_step: int | None = None + torch.cuda.synchronize() + t0 = time.perf_counter() + step = 0 + while True: + last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step) + should_validate = last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0) + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1000.0 * (time.perf_counter() - t0) + val_loss, val_bpb = eval_val( + args, + model, + rank, + world_size, + device, + grad_accum_steps, + val_tokens, + base_bytes_lut, + has_leading_space_lut, + is_boundary_token_lut, + ) + diag_str = "" + if stabilizer is not None and stabilizer.diagnostics.h_norms: + hn = [f"{v:.1f}" for v in stabilizer.diagnostics.h_norms[-args.num_passes*base_model.num_core:]] + gr = [f"{v:.3f}" for v in stabilizer.diagnostics.growth_ratios[-args.num_passes*base_model.num_core:]] + diag_str = f" h_norms={hn} growth={gr}" + stabilizer.reset() + log0( + f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} " + f"train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms / max(step, 1):.2f}ms" + f"{diag_str}" + ) + if use_wandb: + wb_data = {"val_loss": val_loss, "val_bpb": val_bpb} + if stabilizer is not None and stabilizer.diagnostics.growth_ratios: + wb_data["max_growth"] = max(stabilizer.diagnostics.growth_ratios) + wb_data["mean_growth"] = sum(stabilizer.diagnostics.growth_ratios) / len(stabilizer.diagnostics.growth_ratios) + _wandb.log(wb_data, step=step) + torch.cuda.synchronize() + t0 = time.perf_counter() + if last_step: + if stop_after_step is not None and step < args.iterations: + log0( + f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms " + f"step:{step}/{args.iterations}" + ) + break + elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + scale = lr_mul(step, elapsed_ms) + if args.late_qat_threshold > 0 and step > 100 and scale < args.late_qat_threshold and not CastedLinear._qat_enabled: + CastedLinear._qat_enabled = True + base_model.core_quant_enabled = True + log0(f"late_qat:enabled step:{step} scale:{scale:.4f} core_quant:on") + if base_model.lora_rank > 0 and cli.lora_warmup_steps > 0: + base_model._lora_step_mul.fill_(min(step / cli.lora_warmup_steps, 1.0)) + zero_grad_all() + train_loss = torch.zeros((), device=device) + for micro_step in range(grad_accum_steps): + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + train_loss += loss.detach() + (loss * grad_scale).backward() + train_loss /= grad_accum_steps + frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0 + muon_momentum = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum + for group in optimizer_muon.param_groups: + group["momentum"] = muon_momentum + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group["base_lr"] * scale + grad_norm = None + if args.grad_clip_norm > 0: + grad_norm = torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm) + # === 3-phase overlapped optimizer step === + # Phase 1: Launch async reduce-scatter for banks (biggest first) + optimizer_muon.launch_reduce_scatters() + # Phase 2: All-reduce non-bank grads + step Adam (while bank RS is in-flight) + if distributed: + for p in replicated_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + optimizer_tok.step() + optimizer_scalar.step() + if optimizer_head is not None: + optimizer_head.step() + if optimizer_lora is not None: + optimizer_lora.step() + # Phase 3: Wait for RS, local NS5, all-gather (banks processed last) + optimizer_muon.step() + zero_grad_all() + # EMA update + with torch.no_grad(): + _cur = dict(base_model.state_dict()) + if feedback is not None: + for k, v in feedback.state_dict().items(): + _cur[f"_fb.{k}"] = v + for name, t in _cur.items(): + ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay) + step += 1 + approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + if args.swa_enabled and scale < 0.2 and step % args.swa_every == 0: + if swa_state is None: + swa_state = {name: t.detach().cpu().clone() for name, t in base_model.state_dict().items()} + swa_count = 1 + log0(f"swa:start step:{step}") + else: + for name, t in base_model.state_dict().items(): + swa_state[name] += t.detach().cpu() + swa_count += 1 + if args.lawa_enabled and step % args.lawa_freq == 0: + lawa_queue.append({name: t.detach().cpu().clone() for name, t in base_model.state_dict().items()}) + should_log_train = ( + args.train_log_every > 0 + and (step <= 10 or step % args.train_log_every == 0 or stop_after_step is not None) + ) + if should_log_train: + tl = train_loss.item() + gn_str = f" grad_norm:{grad_norm:.4f}" if grad_norm is not None else "" + log0( + f"step:{step}/{args.iterations} train_loss:{tl:.4f}{gn_str} " + f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms / step:.2f}ms" + ) + if use_wandb: + wlog = {"train_loss": tl, "step_avg_ms": approx_training_time_ms / step, "lr_scale": scale} + if grad_norm is not None: + wlog["grad_norm"] = float(grad_norm) + _wandb.log(wlog, step=step) + reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms + if distributed and max_wallclock_ms is not None: + reached_cap_tensor = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX) + reached_cap = bool(reached_cap_tensor.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + log0( + f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB " + f"reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB" + ) + # Apply weight averaging + if args.lawa_enabled and len(lawa_queue) > 1: + log0(f"lawa:applying LAWA averaging k={len(lawa_queue)}") + current_state = base_model.state_dict() + avg_state = {name: torch.zeros(t.shape, dtype=torch.float32, device='cpu') for name, t in current_state.items()} + for snap in lawa_queue: + for name in avg_state: + avg_state[name] += snap[name].float() + for name in avg_state: + avg_state[name] /= len(lawa_queue) + avg_state[name] = avg_state[name].to(dtype=current_state[name].dtype) + base_model.load_state_dict(avg_state, strict=True) + else: + log0("ema:applying EMA weights") + current_state = base_model.state_dict() + model_ema = {k: v for k, v in ema_state.items() if not k.startswith("_fb.")} + avg_state = {name: model_ema[name].to(dtype=current_state[name].dtype) for name in current_state} + base_model.load_state_dict(avg_state, strict=True) + if feedback is not None: + fb_ema = {k.removeprefix("_fb."): v for k, v in ema_state.items() if k.startswith("_fb.")} + fb_state = feedback.state_dict() + fb_avg = {k: fb_ema[k].to(dtype=fb_state[k].dtype) for k in fb_state} + feedback.load_state_dict(fb_avg, strict=True) + torch.cuda.synchronize() + t_diag = time.perf_counter() + diag_val_loss, diag_val_bpb = eval_val( + args, compiled_model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + ) + torch.cuda.synchronize() + log0( + f"DIAGNOSTIC post_ema val_loss:{diag_val_loss:.4f} val_bpb:{diag_val_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_diag):.0f}ms" + ) + full_state_dict = base_model.state_dict() + export_sd = {k: v for k, v in full_state_dict.items() if "mtp_heads" not in k} + excluded_mtp = sum(int(t.numel()) for k, t in full_state_dict.items() if "mtp_heads" in k) + if excluded_mtp > 0: + log0(f"export_excluding_mtp_params:{excluded_mtp}") + if master_process: + torch.save(export_sd, "final_model.pt") + model_bytes = os.path.getsize("final_model.pt") + code_bytes = len(code.encode("utf-8")) + log0(f"Serialized model: {model_bytes} bytes") + log0(f"Code size: {code_bytes} bytes") + # Unbank 3D tensors into individual 2D tensors for quantization + sd_cpu = {k: v.detach().cpu() for k, v in export_sd.items()} + unbanked_sd = _unbank_state_dict(sd_cpu, args.num_layers) + quant_result, quant_meta = mixed_quantize_int6(unbanked_sd, {"mlp", "attn"}) + quant_buf = io.BytesIO() + torch.save({"w": quant_result, "m": quant_meta}, quant_buf) + quant_raw = quant_buf.getvalue() + quant_blob = lzma.compress(quant_raw, preset=6) + if master_process: + with open("final_model.int6.ptz", "wb") as f: + f.write(quant_blob) + quant_file_bytes = len(quant_blob) + code_bytes = len(code.encode("utf-8")) + log0(f"Serialized model int6+lzma: {quant_file_bytes} bytes") + log0(f"Total submission size int6+lzma: {quant_file_bytes + code_bytes} bytes") + if distributed: + dist.barrier() + with open("final_model.int6.ptz", "rb") as f: + quant_blob_disk = f.read() + quant_state = torch.load( + io.BytesIO(lzma.decompress(quant_blob_disk)), + map_location="cpu", + ) + deq_unbanked = dequantize_mixed_int6(quant_state["w"], quant_state["m"], unbanked_sd) + # Re-bank the dequantized tensors + deq_state = _rebank_state_dict(deq_unbanked, args.num_layers, sd_cpu) + eval_model = GPT( + vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim, + num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init, + mtp_num_heads=0, mtp_loss_weight=0.0, + bigram_vocab_size=args.bigram_vocab_size, bigram_dim=args.bigram_dim, + xsa_last_n=args.xsa_last_n, + rope_dims=args.rope_dims, ln_scale=args.ln_scale, dtg=args.dtg_enabled, + ve_enabled=args.ve_enabled, ve_dim=args.ve_dim, ve_layers=args.ve_layers, + gated_attention=args.gated_attention, value_residual=args.value_residual, + core_start=args.core_start, core_end=args.core_end, + num_passes=args.num_passes, + interpass_rmsnorm=not cli.no_interpass_rmsnorm, + lora_rank=cli.lora_rank or args.lora_rank, + ).to(device).bfloat16() + if residual_scale is not None: + eval_rs = ResidualScale(args.num_passes, cli.residual_scale_init).to(device) + eval_model.residual_scale = eval_rs + eval_model.qo_bank.data = eval_model.qo_bank.data.float() + eval_model.kv_bank.data = eval_model.kv_bank.data.float() + eval_model.mlp_up_bank.data = eval_model.mlp_up_bank.data.float() + eval_model.mlp_down_bank.data = eval_model.mlp_down_bank.data.float() + for m in eval_model.modules(): + if isinstance(m, CastedLinear): + m.float() + restore_low_dim_params_to_fp32(eval_model) + eval_model.load_state_dict(deq_state, strict=True) + compiled_eval = torch.compile(eval_model, dynamic=False, fullgraph=True) + torch.cuda.synchronize() + t_qeval = time.perf_counter() + q_val_loss, q_val_bpb = eval_val( + args, compiled_eval, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + eval_seq_len=effective_eval_seq_len, + ) + torch.cuda.synchronize() + log0( + f"final_int6_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms" + ) + log0(f"final_int6_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}") + sw_seq_len = effective_eval_seq_len + if args.eval_stride > 0 and args.eval_stride < sw_seq_len: + torch.cuda.synchronize() + t_slide = time.perf_counter() + sw_val_loss, sw_val_bpb = eval_val_sliding( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, + eval_seq_len=sw_seq_len, + ) + torch.cuda.synchronize() + log0( + f"final_int6_sliding_window val_loss:{sw_val_loss:.4f} val_bpb:{sw_val_bpb:.4f} " + f"stride:{args.eval_stride} eval_time:{1000.0 * (time.perf_counter() - t_slide):.0f}ms" + ) + log0(f"final_int6_sliding_window_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}") + log0(f"final_int8_zlib_roundtrip_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}") + if args.eval_stride > 0 and args.eval_stride != 64 and 64 < sw_seq_len: + torch.cuda.synchronize() + t_slide64 = time.perf_counter() + sw64_val_loss, sw64_val_bpb = eval_val_sliding( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=64, + eval_seq_len=sw_seq_len, + ) + torch.cuda.synchronize() + log0( + f"final_int6_sliding_window_s64 val_loss:{sw64_val_loss:.4f} val_bpb:{sw64_val_bpb:.4f} " + f"stride:64 eval_time:{1000.0 * (time.perf_counter() - t_slide64):.0f}ms" + ) + log0(f"final_int6_sliding_window_s64_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}") + log0(f"final_int8_zlib_roundtrip_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}") + # Legal score-first TTT (PR #461 recipe) + if args.ttt_enabled: + torch.cuda.synchronize() + t_ttt = time.perf_counter() + ttt_loss, ttt_bpb = eval_val_sliding_ttt( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, log0=log0, + ) + torch.cuda.synchronize() + log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_ttt):.0f}ms") + log0(f"legal_ttt_exact val_loss:{ttt_loss:.8f} val_bpb:{ttt_bpb:.8f}") + if use_wandb: + _wandb.finish() + if distributed: + dist.destroy_process_group() +if __name__ == "__main__": + main() diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/latest-run similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/latest-run diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_102057-7kghfexn/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_102057-7kghfexn/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_102057-7kghfexn/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_103645-sx3ojo04/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103645-sx3ojo04/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_103645-sx3ojo04/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_103737-eswa6xue/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_103737-eswa6xue/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_103737-eswa6xue/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_104722-1bh0d9xu/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_104722-1bh0d9xu/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_104722-1bh0d9xu/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_104722-1bh0d9xu/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_104722-1bh0d9xu/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_104722-1bh0d9xu/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_104722-1bh0d9xu/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_105405-dd5xlg1l/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_105405-dd5xlg1l/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_105405-dd5xlg1l/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_105405-dd5xlg1l/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_105405-dd5xlg1l/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_105405-dd5xlg1l/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_105405-dd5xlg1l/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_110050-pmxdy841/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_110050-pmxdy841/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_110050-pmxdy841/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_110050-pmxdy841/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_110050-pmxdy841/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110050-pmxdy841/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_110050-pmxdy841/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_110735-wq77le9z/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_110735-wq77le9z/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_110735-wq77le9z/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_110735-wq77le9z/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_110735-wq77le9z/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_110735-wq77le9z/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_110735-wq77le9z/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_111419-rr366tug/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_111419-rr366tug/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_111419-rr366tug/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_111419-rr366tug/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_111419-rr366tug/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_111419-rr366tug/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_111419-rr366tug/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_112256-w5b84094/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_112256-w5b84094/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_112256-w5b84094/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_112256-w5b84094/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_112256-w5b84094/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_112256-w5b84094/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_112256-w5b84094/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_113135-l86ibk0l/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_113135-l86ibk0l/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_113135-l86ibk0l/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_113135-l86ibk0l/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_113135-l86ibk0l/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_113135-l86ibk0l/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_113135-l86ibk0l/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_114011-n43r2rb3/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_114011-n43r2rb3/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_114011-n43r2rb3/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_114011-n43r2rb3/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_114011-n43r2rb3/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114011-n43r2rb3/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_114011-n43r2rb3/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_114848-v63k38ck/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_114848-v63k38ck/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_114848-v63k38ck/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115141-bq7ercgn/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115141-bq7ercgn/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115141-bq7ercgn/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115741-648c04nz/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115741-648c04nz/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115741-648c04nz/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115741-648c04nz/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115741-648c04nz/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115741-648c04nz/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115741-648c04nz/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115749-qjzpp5d7/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115749-qjzpp5d7/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115749-qjzpp5d7/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115749-qjzpp5d7/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115749-qjzpp5d7/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115749-qjzpp5d7/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115749-qjzpp5d7/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115924-xtlv4t52/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115924-xtlv4t52/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115924-xtlv4t52/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115924-xtlv4t52/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115924-xtlv4t52/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_115924-xtlv4t52/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_115924-xtlv4t52/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_120745-6rfmco93/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_120745-6rfmco93/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_120745-6rfmco93/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_120745-6rfmco93/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_120745-6rfmco93/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_120745-6rfmco93/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_120745-6rfmco93/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_123015-qgrbnv6t/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_123015-qgrbnv6t/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_123015-qgrbnv6t/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_123015-qgrbnv6t/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_123015-qgrbnv6t/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_123015-qgrbnv6t/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_123015-qgrbnv6t/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_124119-cf7n2jes/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_124119-cf7n2jes/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_124119-cf7n2jes/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_125242-meaoom9b/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_125242-meaoom9b/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_125242-meaoom9b/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_125242-meaoom9b/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_125242-meaoom9b/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_125242-meaoom9b/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_125242-meaoom9b/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_130804-ce1th47g/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_130804-ce1th47g/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_130804-ce1th47g/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_130804-ce1th47g/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_130804-ce1th47g/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_130804-ce1th47g/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_130804-ce1th47g/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_132011-5rznhtcu/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132011-5rznhtcu/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_132011-5rznhtcu/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_132634-x2sl50qr/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_132634-x2sl50qr/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_132634-x2sl50qr/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_133036-mo9kb26s/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133036-mo9kb26s/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_133036-mo9kb26s/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_133353-atic3pnd/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_133353-atic3pnd/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_133353-atic3pnd/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_133353-atic3pnd/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_133353-atic3pnd/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133353-atic3pnd/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_133353-atic3pnd/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_133558-nwftkz5m/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_133558-nwftkz5m/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_133558-nwftkz5m/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_151933-hmzzit2n/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_151933-hmzzit2n/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_151933-hmzzit2n/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_152723-snlqmhr2/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_152723-snlqmhr2/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_152723-snlqmhr2/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_154455-y5p28i5r/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_154455-y5p28i5r/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_154455-y5p28i5r/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_154455-y5p28i5r/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_154455-y5p28i5r/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_154455-y5p28i5r/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_154455-y5p28i5r/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_163311-3jt79ap8/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_163311-3jt79ap8/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_163311-3jt79ap8/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_171548-0nylbkj0/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_171548-0nylbkj0/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_171548-0nylbkj0/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_172142-fes18y77/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_172142-fes18y77/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_172142-fes18y77/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_173159-f48k3ztp/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_173159-f48k3ztp/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_173159-f48k3ztp/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_174230-5r5mblr1/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174230-5r5mblr1/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_174230-5r5mblr1/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_174616-zd1j5kg5/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174616-zd1j5kg5/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_174616-zd1j5kg5/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_174902-ponrb7vw/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_174902-ponrb7vw/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_174902-ponrb7vw/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_175915-2o340uez/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_175915-2o340uez/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_175915-2o340uez/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_180447-w7yechln/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_180447-w7yechln/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_180447-w7yechln/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_181843-pwdgzvp6/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_181843-pwdgzvp6/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_181843-pwdgzvp6/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_182338-0jlwabms/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_182338-0jlwabms/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_182338-0jlwabms/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_183144-z6wj6zap/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_183144-z6wj6zap/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_183144-z6wj6zap/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_184007-n9zy31jn/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_184007-n9zy31jn/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_184007-n9zy31jn/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_184007-n9zy31jn/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_184007-n9zy31jn/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_184007-n9zy31jn/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_184007-n9zy31jn/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-8n0ize2o/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-8n0ize2o/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-8n0ize2o/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-8n0ize2o/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-8n0ize2o/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-8n0ize2o/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-8n0ize2o/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-9ekyp2ua/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-9ekyp2ua/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-9ekyp2ua/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-9ekyp2ua/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-9ekyp2ua/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-9ekyp2ua/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-9ekyp2ua/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-jsf1k0pv/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-jsf1k0pv/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-jsf1k0pv/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-jsf1k0pv/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-jsf1k0pv/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_193530-jsf1k0pv/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_193530-jsf1k0pv/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_194337-b8eb1lhl/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_194337-b8eb1lhl/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_194337-b8eb1lhl/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_200826-h4wnno7e/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_200826-h4wnno7e/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_200826-h4wnno7e/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_203403-fzb1y9o8/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_203403-fzb1y9o8/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_203403-fzb1y9o8/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_203403-fzb1y9o8/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_203403-fzb1y9o8/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_203403-fzb1y9o8/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_203403-fzb1y9o8/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_205139-3z8g4kez/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_205139-3z8g4kez/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_205139-3z8g4kez/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_205139-3z8g4kez/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_205139-3z8g4kez/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_205139-3z8g4kez/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_205139-3z8g4kez/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-ngp8wevn/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-ngp8wevn/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-ngp8wevn/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-ngp8wevn/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-ngp8wevn/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-ngp8wevn/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-ngp8wevn/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-w0ibl7rl/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-w0ibl7rl/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-w0ibl7rl/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-w0ibl7rl/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-w0ibl7rl/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-w0ibl7rl/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-w0ibl7rl/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-z24l5l1s/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-z24l5l1s/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-z24l5l1s/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-z24l5l1s/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-z24l5l1s/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_223030-z24l5l1s/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_223030-z24l5l1s/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224439-a7ap9e0a/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224439-a7ap9e0a/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224439-a7ap9e0a/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224439-a7ap9e0a/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224439-a7ap9e0a/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-a7ap9e0a/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224439-a7ap9e0a/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224439-i5qgsj3x/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224439-i5qgsj3x/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224439-i5qgsj3x/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224439-i5qgsj3x/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224439-i5qgsj3x/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224439-i5qgsj3x/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224439-i5qgsj3x/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224440-chtxlxg3/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224440-chtxlxg3/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224440-chtxlxg3/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224440-chtxlxg3/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224440-chtxlxg3/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_224440-chtxlxg3/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_224440-chtxlxg3/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230156-qltwebo4/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230156-qltwebo4/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230156-qltwebo4/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230156-qltwebo4/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230156-qltwebo4/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230156-qltwebo4/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-43bipylb/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-43bipylb/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-43bipylb/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-43bipylb/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-43bipylb/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-43bipylb/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-fsi4c82a/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-fsi4c82a/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-fsi4c82a/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-fsi4c82a/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-fsi4c82a/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-fsi4c82a/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-jkh80zal/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-jkh80zal/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-jkh80zal/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-jkh80zal/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-jkh80zal/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-jkh80zal/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-zcabiozu/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-zcabiozu/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-zcabiozu/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-zcabiozu/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-zcabiozu/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260326_230158-zcabiozu/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260327_080959-p8sqkbqa/files/config.yaml similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/config.yaml rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260327_080959-p8sqkbqa/files/config.yaml diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260327_080959-p8sqkbqa/files/wandb-metadata.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/wandb-metadata.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260327_080959-p8sqkbqa/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260327_080959-p8sqkbqa/files/wandb-summary.json similarity index 100% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260327_080959-p8sqkbqa/files/wandb-summary.json rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback_BACKUP/wandb/run-20260327_080959-p8sqkbqa/files/wandb-summary.json From efd2b591906cdc556b015233394ff3cff1bbd298 Mon Sep 17 00:00:00 2001 From: nesta Date: Fri, 27 Mar 2026 14:52:24 +0000 Subject: [PATCH 08/23] Add *.pt, *.ptz, *.wandb to .gitignore Prevent large model checkpoints and wandb binaries from being tracked. Made-with: Cursor --- .gitignore | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index 8272195c85..a6f2bcabc9 100644 --- a/.gitignore +++ b/.gitignore @@ -10,4 +10,7 @@ data/docs_selected.jsonl .venv logs/ *.log -*.txt \ No newline at end of file +*.txt +*.pt +*.ptz +*.wandb \ No newline at end of file From 186ef4d885267adbf3bdd25a048f34180b89c9fc Mon Sep 17 00:00:00 2001 From: nesta Date: Sun, 29 Mar 2026 19:28:45 +0000 Subject: [PATCH 09/23] it works.... but memory is slightly to high :( --- .../run_submission.sh | 12 ++- .../train_gpt_recurrent.py | 102 ++++++++---------- .../wandb/latest-run | 1 + .../files/wandb-metadata.json | 51 +++++++++ 4 files changed, 106 insertions(+), 60 deletions(-) create mode 120000 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_174929-189wwan5/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh index ebc722608d..9350d4b4a3 100755 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh @@ -15,7 +15,7 @@ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" -# --- Architecture (matches SOTA PR #549 / PR #414 stack) --- +# --- Architecture (11 layers, matching baseline capacity) --- export NUM_LAYERS=11 export MODEL_DIM=512 export NUM_HEADS=8 @@ -28,13 +28,13 @@ export VE_ENABLED=1 export VE_DIM=128 export VE_LAYERS="9,10" -# --- Training schedule (matches SOTA 8xH100 settings) --- -export ITERATIONS=9000 +# --- Training schedule (tuned for 11-layer 2-pass model @ ~112ms/step on 8xH100) --- +export ITERATIONS=5200 export MAX_WALLCLOCK_SECONDS=600 export VAL_LOSS_EVERY=500 export TRAIN_LOG_EVERY=50 export WARMUP_STEPS=20 -export WARMDOWN_ITERS=3500 +export WARMDOWN_ITERS=2000 export TRAIN_BATCH_TOKENS=786432 export TRAIN_SEQ_LEN=2048 export EVAL_SEQ_LEN=2048 @@ -69,9 +69,11 @@ export TTT_GRAD_CLIP=1.0 # --- Recurrence (our contribution) --- export CORE_START=4 export CORE_END=7 -export NUM_PASSES=2 +export NUM_PASSES=1 export EVAL_PASSES=4 export CORE_QUANT_ENABLED=0 +# Progressive: 1-pass until step 4500, then ramp 2->3->4 +export PASSES_SCHEDULE="0:1,4500:2,5500:3,6000:4" # --- W&B --- export WANDB_PROJECT="parameter-golf" diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py index 88aeda6f06..2bd13d920e 100644 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py @@ -117,6 +117,8 @@ class Hyperparameters: core_quant_bits = int(os.environ.get("CORE_QUANT_BITS", 6)) core_quant_enabled = bool(int(os.environ.get("CORE_QUANT_ENABLED", "0"))) eval_passes = int(os.environ.get("EVAL_PASSES", 0)) + # Progressive passes schedule: comma-separated "step:passes" pairs, e.g. "0:1,4500:2,5500:3,6000:4" + passes_schedule_str = os.environ.get("PASSES_SCHEDULE", "") # --- Batched Newton-Schulz orthogonalization --- @@ -1311,6 +1313,14 @@ def _classify_param(name: str) -> str: if ".attn." in name or (".proj." in name and ".mlp." not in name): return "attn" return "other" + +def _extract_layer_idx(name: str) -> int | None: + if not name.startswith("blocks."): + return None + parts = name.split(".") + if len(parts) >= 2 and parts[1].isdigit(): + return int(parts[1]) + return None def quantize_int6_per_row(t: Tensor, clip_range: int = 31) -> tuple[Tensor, Tensor]: t32 = t.float() if t32.ndim == 2: @@ -1399,7 +1409,8 @@ def _rebank_state_dict(sd: dict[str, Tensor], num_layers: int, template_sd: dict out[name] = tensor return out -def mixed_quantize_int6(state_dict: dict[str, Tensor], int6_cats: set[str]): +def mixed_quantize_int6(state_dict: dict[str, Tensor], int6_cats: set[str], + core_start: int = -1, core_end: int = -1): num_layers_total = max( (int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")), default=0, @@ -1590,12 +1601,22 @@ def log0(msg: str, console: bool = True) -> None: stabilizer = None residual_scale = None extra_scalar_params: list[nn.Parameter] = [] - if cli.feedback_mode != "none" and args.num_passes > 1: + # Parse progressive passes schedule + passes_schedule: list[tuple[int, int]] = [] + if args.passes_schedule_str: + for entry in args.passes_schedule_str.split(","): + s, p = entry.strip().split(":") + passes_schedule.append((int(s), int(p))) + passes_schedule.sort(key=lambda x: x[0]) + max_passes = max((p for _, p in passes_schedule), default=args.num_passes) + max_passes = max(max_passes, args.eval_passes if args.eval_passes > 0 else args.num_passes) + needs_recurrence = max_passes > 1 + if cli.feedback_mode != "none" and needs_recurrence: feedback = ErrorFeedbackModule( dim=args.model_dim, rank=cli.feedback_rank, feedback_mode=cli.feedback_mode, per_pass=cli.per_pass_feedback, - num_passes=args.num_passes, + num_passes=max_passes, affine_junction=cli.affine_junction, ).to(device).bfloat16() restore_low_dim_params_to_fp32(feedback) @@ -1604,17 +1625,18 @@ def feedback_fn(h, pass_idx): return feedback(h, pass_idx) log0(f"feedback: mode={cli.feedback_mode} rank={cli.feedback_rank} " f"per_pass={cli.per_pass_feedback} params={sum(p.numel() for p in feedback.parameters())}") - if args.num_passes > 1: + if needs_recurrence: stabilizer = RecurrentStabilizer( clip_hidden=cli.clip_hidden, clip_value=cli.clip_value, jacobian_proxy_weight=cli.jacobian_proxy_weight) if cli.residual_scale_init != 1.0: - residual_scale = ResidualScale(args.num_passes, cli.residual_scale_init).to(device) + residual_scale = ResidualScale(max_passes, cli.residual_scale_init).to(device) base_model.residual_scale = residual_scale extra_scalar_params.extend(residual_scale.parameters()) + sched_str = f" schedule={passes_schedule}" if passes_schedule else "" log0(f"recurrence: core_start={args.core_start} core_end={args.core_end} " - f"num_passes={args.num_passes} stem={base_model.num_stem} " - f"core={base_model.num_core} tail={base_model.num_tail}") + f"num_passes={args.num_passes} max_passes={max_passes} stem={base_model.num_stem} " + f"core={base_model.num_core} tail={base_model.num_tail}{sched_str}") # No DDP -- Parallel Muon handles bank grad communication via reduce-scatter, # and non-bank grads are manually all-reduced before Adam steps. @@ -1755,8 +1777,13 @@ def lr_mul(step: int, elapsed_ms: float) -> float: if args.warmup_steps > 0: initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()} initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] + _precompile_passes = sorted(set(p for _, p in passes_schedule) - {args.num_passes}) if passes_schedule else [] + _precompile_start = args.warmup_steps - len(_precompile_passes) model.train() for warmup_step in range(args.warmup_steps): + if _precompile_passes and warmup_step >= _precompile_start: + _pc_idx = warmup_step - _precompile_start + base_model.num_passes = _precompile_passes[_pc_idx] zero_grad_all() for micro_step in range(grad_accum_steps): x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) @@ -1776,6 +1803,9 @@ def lr_mul(step: int, elapsed_ms: float) -> float: zero_grad_all() if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps: log0(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}") + base_model.num_passes = args.num_passes + if stabilizer is not None: + stabilizer.reset() base_model.load_state_dict(initial_model_state, strict=True) for opt, state in zip(optimizers, initial_optimizer_states, strict=True): opt.load_state_dict(state) @@ -1842,6 +1872,14 @@ def lr_mul(step: int, elapsed_ms: float) -> float: break elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) scale = lr_mul(step, elapsed_ms) + if passes_schedule: + target_passes = args.num_passes + for threshold_step, p in passes_schedule: + if step >= threshold_step: + target_passes = p + if target_passes != base_model.num_passes: + base_model.num_passes = target_passes + log0(f"progressive_passes: step:{step} num_passes:{target_passes}") if args.late_qat_threshold > 0 and step > 100 and scale < args.late_qat_threshold and not CastedLinear._qat_enabled: CastedLinear._qat_enabled = True base_model.core_quant_enabled = True @@ -1985,6 +2023,7 @@ def lr_mul(step: int, elapsed_ms: float) -> float: copy_len = min(eval_num_passes, old_s.shape[0]) new_s[:copy_len] = old_s[:copy_len] base_model.residual_scale.scales = nn.Parameter(new_s) + export_sd = {k: v for k, v in base_model.state_dict().items() if "mtp_heads" not in k} # Unbank 3D tensors into individual 2D tensors for quantization sd_cpu = {k: v.detach().cpu() for k, v in export_sd.items()} unbanked_sd = _unbank_state_dict(sd_cpu, args.num_layers) @@ -2038,54 +2077,7 @@ def lr_mul(step: int, elapsed_ms: float) -> float: m.float() restore_low_dim_params_to_fp32(eval_model) eval_model.load_state_dict(deq_state, strict=True) - compiled_eval = torch.compile(eval_model, dynamic=False, fullgraph=True) - torch.cuda.synchronize() - t_qeval = time.perf_counter() - q_val_loss, q_val_bpb = eval_val( - args, compiled_eval, rank, world_size, device, grad_accum_steps, - val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, - eval_seq_len=effective_eval_seq_len, - ) - torch.cuda.synchronize() - log0( - f"final_int6_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} " - f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms" - ) - log0(f"final_int6_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}") - sw_seq_len = effective_eval_seq_len - if args.eval_stride > 0 and args.eval_stride < sw_seq_len: - torch.cuda.synchronize() - t_slide = time.perf_counter() - sw_val_loss, sw_val_bpb = eval_val_sliding( - args, eval_model, rank, world_size, device, - val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, - stride=args.eval_stride, - eval_seq_len=sw_seq_len, - ) - torch.cuda.synchronize() - log0( - f"final_int6_sliding_window val_loss:{sw_val_loss:.4f} val_bpb:{sw_val_bpb:.4f} " - f"stride:{args.eval_stride} eval_time:{1000.0 * (time.perf_counter() - t_slide):.0f}ms" - ) - log0(f"final_int6_sliding_window_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}") - log0(f"final_int8_zlib_roundtrip_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}") - if args.eval_stride > 0 and args.eval_stride != 64 and 64 < sw_seq_len: - torch.cuda.synchronize() - t_slide64 = time.perf_counter() - sw64_val_loss, sw64_val_bpb = eval_val_sliding( - args, eval_model, rank, world_size, device, - val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, - stride=64, - eval_seq_len=sw_seq_len, - ) - torch.cuda.synchronize() - log0( - f"final_int6_sliding_window_s64 val_loss:{sw64_val_loss:.4f} val_bpb:{sw64_val_bpb:.4f} " - f"stride:64 eval_time:{1000.0 * (time.perf_counter() - t_slide64):.0f}ms" - ) - log0(f"final_int6_sliding_window_s64_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}") - log0(f"final_int8_zlib_roundtrip_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}") - # Legal score-first TTT (PR #461 recipe) + # Legal score-first TTT (PR #461 recipe) -- skip intermediate evals to maximize TTT time budget if args.ttt_enabled: torch.cuda.synchronize() t_ttt = time.perf_counter() diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run new file mode 120000 index 0000000000..2046f8dbb2 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run @@ -0,0 +1 @@ +run-20260329_174929-189wwan5 \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_174929-189wwan5/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_174929-189wwan5/files/wandb-metadata.json new file mode 100644 index 0000000000..6d773a9706 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_174929-189wwan5/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-29T17:49:29.115638Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "efd2b591906cdc556b015233394ff3cff1bbd298" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "42232123392" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "vk6xis8pi7ne999ea8bb8o017j7p4iw9" +} \ No newline at end of file From 0375751244eeb7a472968ecab738e82207af1242 Mon Sep 17 00:00:00 2001 From: nesta Date: Sun, 29 Mar 2026 20:14:02 +0000 Subject: [PATCH 10/23] great performance of 1.114 --- .../files/config.yaml | 98 +++++++++++++++++++ .../files/wandb-summary.json | 1 + 2 files changed, 99 insertions(+) create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_174929-189wwan5/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_174929-189wwan5/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_174929-189wwan5/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_174929-189wwan5/files/config.yaml new file mode 100644 index 0000000000..ce9007023d --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_174929-189wwan5/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + vk6xis8pi7ne999ea8bb8o017j7p4iw9: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "42232123392" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python + git: + commit: efd2b591906cdc556b015233394ff3cff1bbd298 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-29T17:49:29.115638Z" + writerId: vk6xis8pi7ne999ea8bb8o017j7p4iw9 + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 7 +core_start: + value: 4 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 6500 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927712 +num_layers: + value: 11 +num_passes: + value: 1 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_174929-189wwan5/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_174929-189wwan5/files/wandb-summary.json new file mode 100644 index 0000000000..151be01452 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_174929-189wwan5/files/wandb-summary.json @@ -0,0 +1 @@ +{"_timestamp":1.774811522902973e+09,"_step":6500,"val_loss":1.9179250494584494,"val_bpb":1.1359032457871592,"lr_scale":0.0004,"grad_norm":0.03739667311310768,"_runtime":7802.380040465,"_wandb":{"runtime":7802},"train_loss":1.9827053546905518,"step_avg_ms":697.5618967327575} \ No newline at end of file From caa6e4d787aa6c7e21bd120c08e7ea12775ed0fd Mon Sep 17 00:00:00 2001 From: nesta Date: Sun, 29 Mar 2026 21:18:17 +0000 Subject: [PATCH 11/23] changes to try to reduce 0.222 mb --- .../agent.md | 130 ++++++++++++++++++ .../run_earlyqat.sh | 91 ++++++++++++ .../run_nofeedback.sh | 93 +++++++++++++ .../run_submission.sh | 6 +- .../train_gpt_recurrent.py | 4 + .../wandb/latest-run | 2 +- .../files/wandb-metadata.json | 51 +++++++ 7 files changed, 373 insertions(+), 4 deletions(-) create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/agent.md create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh create mode 100755 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_nofeedback.sh create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_204103-e5ujh1d7/files/wandb-metadata.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/agent.md b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/agent.md new file mode 100644 index 0000000000..509d153d04 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/agent.md @@ -0,0 +1,130 @@ +# Agent Context: Recurrent Depth Experiments + +## Goal + +Beat the current SOTA of **1.1194 bpb** on the Parameter Golf 10-min / 8xH100 / 16MB track. The significance threshold is **0.005 nats** improvement (~1.1164 bpb or lower). + +## Architecture + +11-layer transformer with depth recurrence: layers 4-6 are the "core" block, reused multiple times. Progressive training ramps passes from 1→2→3→4 during training, then evaluates with 4 passes. ResidualScale (learnable per-pass scalars) and Jacobian proxy loss keep recurrence contractive. + +Key modules: +- `train_gpt_recurrent.py` — main training/eval script (~100KB) +- `feedback.py` — ErrorFeedbackModule (diagonal, rank 2, 2560 params) +- `stability.py` — RecurrentStabilizer + ResidualScale + +## Current Best Result (1-GPU, progressive_1to4) + +- **TTT bpb: 1.1147** (1.8820 nats) — beats SOTA by 0.009 nats +- Pre-TTT (quantized, 4-pass): 1.1526 bpb +- Artifact size: **16,222,054 bytes — OVER the 16,000,000 limit by 222KB** +- Model compressed (int6+lzma preset 6): 16,122,576 bytes +- Code: 99,478 bytes +- Log: `logs/progressive_1to4.txt` + +## The Size Problem + +The 16MB limit is **decimal 16,000,000 bytes** (confirmed in repo README.md line 171: "The cap is decimal 16MB, i.e. 16,000,000 total bytes, not 16 MiB / 16,777,216 bytes"). + +The compressed model ALONE (16.1MB) exceeds the limit. LZMA preset 7/8/9 tested — they make it **worse** (the data is already at LZMA's sweet spot at preset 6). The size overshoot is due to weight entropy from progressive training producing less compressible weight distributions compared to the baseline. + +The SOTA baseline (same architecture, non-recurrent) compressed to 15.99MB. Our model has identical parameter count (26,927,712) but the progressive training changes weight distributions. + +## Three Run Scripts (all 8xH100, 600s wallclock) + +### 1. `run_submission.sh` — Winning config +- LATE_QAT_THRESHOLD=0.15 (~200 QAT steps on 8 GPUs) +- feedback-mode diagonal +- Best bpb but artifact oversized + +### 2. `run_earlyqat.sh` — Earlier QAT for smaller artifact +- LATE_QAT_THRESHOLD=0.25 (~400 QAT steps on 8 GPUs) +- feedback-mode diagonal +- Hypothesis: more QAT steps = lower weight entropy = better compression + +### 3. `run_nofeedback.sh` — No feedback + early QAT +- LATE_QAT_THRESHOLD=0.25 +- feedback-mode none +- Hypothesis: feedback module was NEVER used at eval/TTT time (bug — `eval_val_sliding_ttt` never passes `feedback_fn` to forward calls). So the model trained WITH feedback corrections it won't have at eval. Removing feedback from training should make the model learn to be stable without corrections, potentially improving eval quality AND removing 2560 unused training params from the optimizer. + +## Key Training Config (shared across all runs) + +``` +ITERATIONS=6500 # wallclock cap stops it at ~6100-6200 steps on 8 GPUs +WARMDOWN_ITERS=2500 # time-based when wallclock active +PASSES_SCHEDULE="0:1,4500:2,5500:3,6000:4" +NUM_PASSES=1 # initial passes +EVAL_PASSES=4 # override at eval time +CORE_START=4 CORE_END=7 # layers 4-6 are the recurrent core +TRAIN_BATCH_TOKENS=786432 # matches 8-GPU effective batch +SWA_ENABLED=1 SWA_EVERY=50 +--residual-scale-init 0.5 +--jacobian-proxy-weight 0.1 +--no-interpass-rmsnorm +``` + +## lr_mul is Time-Based on 8 GPUs + +When `MAX_WALLCLOCK_SECONDS=600` is set, the `lr_mul` function switches from step-based to time-based warmdown. `WARMDOWN_ITERS` controls the warmdown duration as `warmdown_iters * step_avg_ms` in real time. SWA triggers at `scale < 0.2`, Late QAT at `scale < threshold`. These auto-adapt to step speed. + +Estimated 8-GPU timeline: +- Steps 0-4499: 1-pass, ~87ms/step +- Steps 4500-5499: 2-pass, ~116ms/step +- Steps 5500-5999: 3-pass, ~140ms/step +- Steps 6000+: 4-pass, ~178ms/step +- QAT 0.25 triggers ~step 5750, QAT 0.15 ~step 5950 +- SWA starts ~step 5800 +- Training ends ~step 6150 (wallclock cap) + +## Pre-compilation of Progressive Passes + +`torch.compile` traces are cached during warmup. The last N warmup steps cycle through each pass count variant so all compiled graphs are ready before the timed training loop. No recompilation overhead during training. + +## The Feedback Bug + +`eval_val_sliding_ttt` calls `base_model.forward_logits(x_batch)` (scoring) and `base_model(x, y)` (TTT training) WITHOUT passing `feedback_fn`. Both `forward()` and `forward_logits()` accept `feedback_fn=None` as default. The `_forward_hidden` method applies feedback at line 1001-1002: + +```python +if feedback_fn is not None: + x = x + feedback_fn(x, k) +``` + +So during eval/TTT, this is always skipped. The model was trained expecting corrections between passes but never gets them at inference. This is why `run_nofeedback.sh` exists as an experiment. + +The feedback weights ARE maintained through EMA (lines 1987-1991) and exist in memory, but they are NOT in the exported artifact (`export_sd` only contains `base_model.state_dict()` minus mtp_heads). + +## Quantization Pipeline + +1. EMA weights loaded into base_model +2. `num_passes` overridden from training value to `EVAL_PASSES=4` +3. ResidualScale padded for extra passes (init 0.5 for new passes) +4. `export_sd` captured (re-captured AFTER ResidualScale resize — this was a critical bug fix) +5. State dict unbanked (3D parameter banks → individual 2D weight matrices) +6. `mixed_quantize_int6` quantizes weights: int6 for mlp/attn categories, fp16/fp32 for small params +7. `torch.save` → `lzma.compress(preset=6)` → `final_model.int6.ptz` +8. Decompressed and loaded into fresh `eval_model` for TTT + +## Files on Disk + +- `final_model.pt` — uncompressed model (106MB), can be re-quantized offline +- `final_model.int6.ptz` — compressed artifact (16.1MB), from the progressive_1to4 run +- `logs/progressive_1to4.txt` — the winning run log (1.1147 bpb) +- `logs/full5600.txt`, `full5600_v2.txt` — earlier constant 2-pass runs +- `logs/full_11L_int8core.txt` — failed 11L int8-core experiment +- `logs/smoke_mixedprec*.txt` — early mixed-precision smoke tests +- `2026-03-26_RecurrentSOTA_Feedback_BACKUP/` — backup of the folder before progressive changes + +## What to Watch For in New Runs + +1. **Artifact size** — must be under 16,000,000 bytes total (model + code). The key line is `Total submission size int6+lzma: XXXXX bytes`. +2. **TTT bpb** — must beat 1.1194 by at least 0.005 nats. Look for `legal_ttt_exact`. +3. **QAT step count** — look for `late_qat:enabled step:XXXX`. More QAT steps = better compression but potentially worse loss. +4. **SWA start** — look for `swa:start step:XXXX`. +5. **Wallclock stop** — look for `stopping_early: wallclock_cap`. + +## Competition Submission Format + +- All counted code must live in a single `train_gpt.py` script (per README.md). Currently we have 3 files — feedback.py and stability.py should be inlined before final submission. +- Need 3 seeds for statistical significance (p < 0.01). +- `submission.json` needs to be filled with 3-seed mean bpb and bytes_total. +- The PR goes to https://github.com/nestamidavaine/parameter-golf (fork of openai/parameter-golf). diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh new file mode 100755 index 0000000000..f57425b4d8 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh @@ -0,0 +1,91 @@ +#!/bin/bash +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +if [ -f /home/nesta/parameter-golf/.env ]; then + set -a; source /home/nesta/parameter-golf/.env; set +a +fi + +export PYTHONUNBUFFERED=1 +export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + +# --- Data paths --- +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" + +# --- Architecture (11 layers, matching baseline capacity) --- +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" + +# --- Training schedule (progressive 1->4 passes, wallclock-capped at 600s on 8xH100) --- +export ITERATIONS=6500 +export MAX_WALLCLOCK_SECONDS=600 +export VAL_LOSS_EVERY=500 +export TRAIN_LOG_EVERY=50 +export WARMUP_STEPS=20 +export WARMDOWN_ITERS=2500 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=64 + +# --- Optimizer (matches SOTA) --- +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=1500 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 + +# --- Weight averaging & quantization --- +# EARLY QAT: threshold 0.25 (vs 0.15 in winning config) to reduce weight entropy +export SWA_ENABLED=1 +export SWA_EVERY=50 +export LATE_QAT_THRESHOLD=0.25 + +# --- TTT (matches SOTA, freeze_blocks=0) --- +export TTT_ENABLED=1 +export TTT_LR=0.002 +export TTT_EPOCHS=3 +export TTT_CHUNK_TOKENS=32768 +export TTT_FREEZE_BLOCKS=0 +export TTT_MOMENTUM=0.9 +export TTT_BATCH_SEQS=32 +export TTT_GRAD_CLIP=1.0 + +# --- Recurrence (our contribution) --- +export CORE_START=4 +export CORE_END=7 +export NUM_PASSES=1 +export EVAL_PASSES=4 +export CORE_QUANT_ENABLED=0 +# Progressive: 1-pass until step 4500, then ramp 2->3->4 +export PASSES_SCHEDULE="0:1,4500:2,5500:3,6000:4" + +# --- W&B --- +export WANDB_PROJECT="parameter-golf" + +export SEED=1337 +export WANDB_NAME="earlyqat_025" +export RUN_ID="earlyqat_025" + +torchrun --standalone --nproc_per_node=8 train_gpt_recurrent.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + 2>&1 | tee logs/earlyqat_025.txt diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_nofeedback.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_nofeedback.sh new file mode 100755 index 0000000000..3c5d54e432 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_nofeedback.sh @@ -0,0 +1,93 @@ +#!/bin/bash +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" + +if [ -f /home/nesta/parameter-golf/.env ]; then + set -a; source /home/nesta/parameter-golf/.env; set +a +fi + +export PYTHONUNBUFFERED=1 +export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + +# --- Data paths --- +export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" +export TOKENIZER_PATH="../../../data/tokenizers/fineweb_1024_bpe.model" + +# --- Architecture (11 layers, matching baseline capacity) --- +export NUM_LAYERS=11 +export MODEL_DIM=512 +export NUM_HEADS=8 +export NUM_KV_HEADS=4 +export BIGRAM_VOCAB_SIZE=1536 +export XSA_LAST_N=4 +export ROPE_DIMS=16 +export LN_SCALE=1 +export VE_ENABLED=1 +export VE_DIM=128 +export VE_LAYERS="9,10" + +# --- Training schedule (progressive 1->4 passes, wallclock-capped at 600s on 8xH100) --- +export ITERATIONS=6500 +export MAX_WALLCLOCK_SECONDS=600 +export VAL_LOSS_EVERY=500 +export TRAIN_LOG_EVERY=50 +export WARMUP_STEPS=20 +export WARMDOWN_ITERS=2500 +export TRAIN_BATCH_TOKENS=786432 +export TRAIN_SEQ_LEN=2048 +export EVAL_SEQ_LEN=2048 +export EVAL_STRIDE=64 + +# --- Optimizer (matches SOTA) --- +export MATRIX_LR=0.025 +export SCALAR_LR=0.025 +export TIED_EMBED_LR=0.035 +export MUON_MOMENTUM=0.99 +export MUON_MOMENTUM_WARMUP_START=0.92 +export MUON_MOMENTUM_WARMUP_STEPS=1500 +export MUON_WD=0.04 +export ADAM_WD=0.04 +export GRAD_CLIP_NORM=0.3 + +# --- Weight averaging & quantization --- +# EARLY QAT: threshold 0.25 + NO FEEDBACK MODULE +# Feedback was never used at eval/TTT time (bug), so removing it from training +# means the model learns to be stable without corrections it won't have at eval +export SWA_ENABLED=1 +export SWA_EVERY=50 +export LATE_QAT_THRESHOLD=0.25 + +# --- TTT (matches SOTA, freeze_blocks=0) --- +export TTT_ENABLED=1 +export TTT_LR=0.002 +export TTT_EPOCHS=3 +export TTT_CHUNK_TOKENS=32768 +export TTT_FREEZE_BLOCKS=0 +export TTT_MOMENTUM=0.9 +export TTT_BATCH_SEQS=32 +export TTT_GRAD_CLIP=1.0 + +# --- Recurrence (our contribution) --- +export CORE_START=4 +export CORE_END=7 +export NUM_PASSES=1 +export EVAL_PASSES=4 +export CORE_QUANT_ENABLED=0 +# Progressive: 1-pass until step 4500, then ramp 2->3->4 +export PASSES_SCHEDULE="0:1,4500:2,5500:3,6000:4" + +# --- W&B --- +export WANDB_PROJECT="parameter-golf" + +export SEED=1337 +export WANDB_NAME="nofeedback_earlyqat" +export RUN_ID="nofeedback_earlyqat" + +torchrun --standalone --nproc_per_node=8 train_gpt_recurrent.py \ + --feedback-mode none \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.1 \ + --no-interpass-rmsnorm \ + 2>&1 | tee logs/nofeedback_earlyqat.txt diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh index 9350d4b4a3..d4b8447954 100755 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh @@ -28,13 +28,13 @@ export VE_ENABLED=1 export VE_DIM=128 export VE_LAYERS="9,10" -# --- Training schedule (tuned for 11-layer 2-pass model @ ~112ms/step on 8xH100) --- -export ITERATIONS=5200 +# --- Training schedule (progressive 1->4 passes, wallclock-capped at 600s on 8xH100) --- +export ITERATIONS=6500 export MAX_WALLCLOCK_SECONDS=600 export VAL_LOSS_EVERY=500 export TRAIN_LOG_EVERY=50 export WARMUP_STEPS=20 -export WARMDOWN_ITERS=2000 +export WARMDOWN_ITERS=2500 export TRAIN_BATCH_TOKENS=786432 export TRAIN_SEQ_LEN=2048 export EVAL_SEQ_LEN=2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py index 2bd13d920e..20817b3ddb 100644 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py @@ -2037,6 +2037,10 @@ def lr_mul(step: int, elapsed_ms: float) -> float: f.write(quant_blob) quant_file_bytes = len(quant_blob) code_bytes = len(code.encode("utf-8")) + for extra in ["feedback.py", "stability.py"]: + p = Path(__file__).parent / extra + if p.exists(): + code_bytes += p.stat().st_size log0(f"Serialized model int6+lzma: {quant_file_bytes} bytes") log0(f"Total submission size int6+lzma: {quant_file_bytes + code_bytes} bytes") if distributed: diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run index 2046f8dbb2..9f686dc6c6 120000 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/latest-run @@ -1 +1 @@ -run-20260329_174929-189wwan5 \ No newline at end of file +run-20260329_204103-e5ujh1d7 \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_204103-e5ujh1d7/files/wandb-metadata.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_204103-e5ujh1d7/files/wandb-metadata.json new file mode 100644 index 0000000000..976c1119a0 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_204103-e5ujh1d7/files/wandb-metadata.json @@ -0,0 +1,51 @@ +{ + "os": "Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39", + "python": "CPython 3.12.3", + "startedAt": "2026-03-29T20:41:03.814826Z", + "args": [ + "--feedback-mode", + "diagonal", + "--feedback-rank", + "2", + "--residual-scale-init", + "0.5", + "--jacobian-proxy-weight", + "0.1", + "--no-interpass-rmsnorm" + ], + "program": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePath": "records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", + "codePathLocal": "train_gpt_recurrent.py", + "git": { + "remote": "https://github.com/nestamidavaine/parameter-golf.git", + "commit": "0375751244eeb7a472968ecab738e82207af1242" + }, + "email": "nesta.midavaine@prosus.com", + "root": "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback", + "host": "computeinstance-e00c09e8zde17qbk32", + "executable": "/home/nesta/parameter-golf/.venv/bin/python", + "cpu_count": 8, + "cpu_count_logical": 16, + "gpu": "NVIDIA H200", + "gpu_count": 1, + "disk": { + "/": { + "total": "1330227675136", + "used": "42672181248" + } + }, + "memory": { + "total": "211069919232" + }, + "gpu_nvidia": [ + { + "name": "NVIDIA H200", + "memoryTotal": "150754820096", + "cudaCores": 16896, + "architecture": "Hopper", + "uuid": "GPU-e312faf2-f704-c38a-00a2-ba4137b99846" + } + ], + "cudaVersion": "13.0", + "writerId": "4meixsdgz5im6n6dg04xzn4o4hrbs3lg" +} \ No newline at end of file From a981c414302079d4b51aad01418c0640d0a52106 Mon Sep 17 00:00:00 2001 From: nesta Date: Sun, 29 Mar 2026 21:34:51 +0000 Subject: [PATCH 12/23] clean up submission format --- .../DIFF_VS_BASELINE.md | 120 +++++++++++ .../README.md | 2 +- .../agent.md | 10 +- .../run_earlyqat.sh | 2 +- .../run_nofeedback.sh | 2 +- .../run_submission.sh | 2 +- .../{train_gpt_recurrent.py => train_gpt.py} | 196 +++++++++++++++++- 7 files changed, 319 insertions(+), 15 deletions(-) create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/DIFF_VS_BASELINE.md rename records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/{train_gpt_recurrent.py => train_gpt.py} (93%) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/DIFF_VS_BASELINE.md b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/DIFF_VS_BASELINE.md new file mode 100644 index 0000000000..22dfe1c92c --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/DIFF_VS_BASELINE.md @@ -0,0 +1,120 @@ +# Differences vs Baseline (LeakyReLU + Legal TTT + Parallel Muon) + +Baseline: `records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/train_gpt.py` +Recurrent: `records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py` + +Everything not listed below is identical (tokenizer, data loading, optimizer, SWA, EMA, TTT algorithm, quantization scheme, attention, MLP, block structure, embeddings). + +--- + +## Architecture + +### Layer split changed +- **Baseline**: encoder (first half) / decoder (second half) with skip connections between them +- **Recurrent**: stem (layers 0..core_start-1) / core (layers core_start..core_end-1, looped N times) / tail (layers core_end..num_layers-1) with skip connections between stem and tail + +### New: Recurrent core with progressive passes +- Core layers are run `num_passes` times per forward pass +- `PASSES_SCHEDULE` ramps passes during training: `"0:1,4500:2,5500:3,6000:4"` +- `EVAL_PASSES=4` overrides pass count at eval time (train cheap, eval deep) + +### New: ResidualScale (inlined from stability.py) +- Learnable per-pass scalar `alpha_k` contracts residual: `h_{k+1} = h_k + alpha_k * delta` +- Init 0.5, learned during training +- Prevents hidden state magnitude growth across passes + +### New: ErrorFeedbackModule (inlined from feedback.py) +- Low-rank residual approximation + diagonal correction between passes +- 2560 params (rank=2, dim=512) +- Inactive on pass 0, active on subsequent passes +- **Known bug**: never passed to eval/TTT forward calls, so corrections are absent at inference + +### New: Jacobian proxy loss +- Regularization: `lambda * relu(||delta||/||h|| - 1)^2` with lambda=0.1 +- Penalizes hidden state growth ratio > 1.0, enforcing contractive dynamics +- Only during training, only on core block + +### XSA skips core layers +- Baseline: XSA on last N layers unconditionally +- Recurrent: XSA on last N layers BUT skips core layers (4-6) since they run multiple times + +### New: `_fake_quantize` for core weights +- STE-based fake int6 quantization applied to core bank weights during training +- Starts disabled (`CORE_QUANT_ENABLED=0`), auto-enabled when late QAT triggers +- Separate from baseline's `CastedLinear` QAT since core weights come from parameter banks + +### New: `_forward_hidden` method +- Shared implementation for both `forward()` and `forward_logits()` +- Returns `(x, h_core_in, h_core_out)` for Jacobian proxy loss +- Both accept `feedback_fn=None, stabilizer=None` kwargs + +--- + +## Training Loop + +### Progressive passes schedule +- Each step checks schedule and dynamically updates `base_model.num_passes` +- Pre-compilation: last N warmup steps cycle through each pass count variant to cache `torch.compile` graphs (zero recompilation overhead during training) + +### Forward pass signature +- **Baseline**: `loss = model(x, y)` +- **Recurrent**: `loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer)` + +### Late QAT extended +- Adds `step > 100` guard +- Also enables `base_model.core_quant_enabled = True` for bank weight fake quantization + +### EMA includes feedback module +- Feedback weights stored with `_fb.` prefix in EMA state +- Separated back out when applying EMA (model gets model keys, feedback gets `_fb.` keys) + +### WandB integration (new) +- Conditional logging of train_loss, val_loss, val_bpb, grad_norm, step_avg_ms, lr_scale, growth ratios + +### Stability diagnostics +- Per-step recording of h_norms, growth_ratios from stabilizer +- Logged at validation steps, then reset + +--- + +## Post-Training / Evaluation + +### Intermediate evaluations removed +- **Baseline**: runs int6 roundtrip eval, sliding window eval (both strides), then TTT +- **Recurrent**: skips all intermediate evals, goes straight to TTT to maximize time budget + +### Eval passes override +- After export, `num_passes` changed from training value to `EVAL_PASSES` +- `ResidualScale.scales` padded for extra passes (init 0.5 for new entries) +- `export_sd` re-captured after resize + +### No torch.compile on eval model +- **Baseline**: compiles eval model before quantized evaluation +- **Recurrent**: uses eval model directly (no compilation before TTT) + +--- + +## New CLI Arguments + +| Argument | Default | Purpose | +|---|---|---| +| `--feedback-mode` | diagonal | identity/diagonal/low_rank/none | +| `--feedback-rank` | 2 | Rank for low-rank components | +| `--per-pass-feedback` | False | Separate correction per pass | +| `--residual-scale-init` | 0.5 | Init value for per-pass scaling | +| `--jacobian-proxy-weight` | 0.01 | Jacobian proxy regularization weight | +| `--no-interpass-rmsnorm` | False | Disable RMSNorm between passes | +| `--clip-hidden` | False | Enable hidden-state clipping | +| `--clip-value` | 10.0 | Clipping threshold | + +## New Hyperparameters (env vars) + +| Variable | Default | Purpose | +|---|---|---| +| `CORE_START` | 3 | First layer of recurrent core | +| `CORE_END` | 8 | Last layer (exclusive) of recurrent core | +| `NUM_PASSES` | 1 | Initial number of recurrent passes | +| `EVAL_PASSES` | 0 | Override pass count for evaluation (0=use NUM_PASSES) | +| `PASSES_SCHEDULE` | "" | Progressive schedule, e.g. "0:1,4500:2,5500:3,6000:4" | +| `CORE_QUANT_BITS` | 6 | Bit-width for core fake quantization | +| `CORE_QUANT_ENABLED` | 0 | Initial state (auto-enabled by late QAT) | diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md index 5244d3f503..7eef173d47 100644 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md @@ -188,7 +188,7 @@ MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \ ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \ CORE_START=4 CORE_END=7 NUM_PASSES=2 EVAL_PASSES=4 \ SEED=1337 \ -torchrun --standalone --nproc_per_node=8 train_gpt_recurrent.py \ +torchrun --standalone --nproc_per_node=8 train_gpt.py \ --feedback-mode diagonal --feedback-rank 2 \ --residual-scale-init 0.5 \ --jacobian-proxy-weight 0.1 \ diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/agent.md b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/agent.md index 509d153d04..ac903adf14 100644 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/agent.md +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/agent.md @@ -8,10 +8,10 @@ Beat the current SOTA of **1.1194 bpb** on the Parameter Golf 10-min / 8xH100 / 11-layer transformer with depth recurrence: layers 4-6 are the "core" block, reused multiple times. Progressive training ramps passes from 1→2→3→4 during training, then evaluates with 4 passes. ResidualScale (learnable per-pass scalars) and Jacobian proxy loss keep recurrence contractive. -Key modules: -- `train_gpt_recurrent.py` — main training/eval script (~100KB) -- `feedback.py` — ErrorFeedbackModule (diagonal, rank 2, 2560 params) -- `stability.py` — RecurrentStabilizer + ResidualScale +Key modules (all inlined into single file per competition rules): +- `train_gpt.py` — main training/eval script with ErrorFeedbackModule, RecurrentStabilizer, and ResidualScale inlined at the top +- `feedback.py` — original source (kept for reference, no longer imported) +- `stability.py` — original source (kept for reference, no longer imported) ## Current Best Result (1-GPU, progressive_1to4) @@ -124,7 +124,7 @@ The feedback weights ARE maintained through EMA (lines 1987-1991) and exist in m ## Competition Submission Format -- All counted code must live in a single `train_gpt.py` script (per README.md). Currently we have 3 files — feedback.py and stability.py should be inlined before final submission. +- All counted code must live in a single `train_gpt.py` script (per README.md). feedback.py and stability.py have been inlined into train_gpt.py. - Need 3 seeds for statistical significance (p < 0.01). - `submission.json` needs to be filled with 3-seed mean bpb and bytes_total. - The PR goes to https://github.com/nestamidavaine/parameter-golf (fork of openai/parameter-golf). diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh index f57425b4d8..d6d62ab867 100755 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh @@ -83,7 +83,7 @@ export SEED=1337 export WANDB_NAME="earlyqat_025" export RUN_ID="earlyqat_025" -torchrun --standalone --nproc_per_node=8 train_gpt_recurrent.py \ +torchrun --standalone --nproc_per_node=8 train_gpt.py \ --feedback-mode diagonal --feedback-rank 2 \ --residual-scale-init 0.5 \ --jacobian-proxy-weight 0.1 \ diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_nofeedback.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_nofeedback.sh index 3c5d54e432..764ae84504 100755 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_nofeedback.sh +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_nofeedback.sh @@ -85,7 +85,7 @@ export SEED=1337 export WANDB_NAME="nofeedback_earlyqat" export RUN_ID="nofeedback_earlyqat" -torchrun --standalone --nproc_per_node=8 train_gpt_recurrent.py \ +torchrun --standalone --nproc_per_node=8 train_gpt.py \ --feedback-mode none \ --residual-scale-init 0.5 \ --jacobian-proxy-weight 0.1 \ diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh index d4b8447954..2ca6611288 100755 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_submission.sh @@ -90,7 +90,7 @@ for SEED in 1337 42 2025; do echo "" echo "=== SEED=${SEED} started $(date) ===" - torchrun --standalone --nproc_per_node=8 train_gpt_recurrent.py \ + torchrun --standalone --nproc_per_node=8 train_gpt.py \ --feedback-mode diagonal --feedback-rank 2 \ --residual-scale-init 0.5 \ --jacobian-proxy-weight 0.1 \ diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py similarity index 93% rename from records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py rename to records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py index 20817b3ddb..3e78c2c6ee 100644 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py @@ -29,10 +29,198 @@ _gpu_mem_frac = float(os.environ.get("CUDA_MEM_FRACTION", "0")) if _gpu_mem_frac > 0: torch.cuda.set_per_process_memory_fraction(_gpu_mem_frac, 0) +from dataclasses import dataclass, field from flash_attn_interface import flash_attn_func as flash_attn_3_func import argparse -from feedback import ErrorFeedbackModule -from stability import RecurrentStabilizer, ResidualScale + +# ── Stability monitoring and control for recurrent passes ── + +@dataclass +class PassDiagnostics: + h_norms: list[float] = field(default_factory=list) + delta_norms: list[float] = field(default_factory=list) + error_norms: list[float] = field(default_factory=list) + correction_norms: list[float] = field(default_factory=list) + growth_ratios: list[float] = field(default_factory=list) + + def reset(self): + for lst in (self.h_norms, self.delta_norms, self.error_norms, + self.correction_norms, self.growth_ratios): + lst.clear() + + def summary(self) -> dict[str, list[float]]: + return { + "h_norms": list(self.h_norms), + "delta_norms": list(self.delta_norms), + "error_norms": list(self.error_norms), + "correction_norms": list(self.correction_norms), + "growth_ratios": list(self.growth_ratios), + } + + +class RecurrentStabilizer: + def __init__( + self, + clip_hidden: bool = False, + clip_value: float = 10.0, + clip_mode: str = "value", + jacobian_proxy_weight: float = 0.0, + eps: float = 1e-6, + ): + self.clip_hidden = clip_hidden + self.clip_value = clip_value + self.clip_mode = clip_mode + self.jacobian_proxy_weight = jacobian_proxy_weight + self.eps = eps + self.diagnostics = PassDiagnostics() + + def clip(self, h: Tensor) -> Tensor: + if not self.clip_hidden: + return h + if self.clip_mode == "value": + return torch.clamp(h, -self.clip_value, self.clip_value) + norm = h.norm(dim=-1, keepdim=True) + scale = torch.clamp(self.clip_value / (norm + self.eps), max=1.0) + return h * scale + + def record_pass( + self, + h_prev: Tensor, + h_next: Tensor, + error: Tensor | None = None, + correction: Tensor | None = None, + ): + with torch.no_grad(): + h_pn = h_prev.float().norm().item() + h_nn = h_next.float().norm().item() + self.diagnostics.h_norms.append(h_nn) + self.diagnostics.delta_norms.append( + (h_next - h_prev).float().norm().item() + ) + self.diagnostics.growth_ratios.append(h_nn / (h_pn + self.eps)) + if error is not None: + self.diagnostics.error_norms.append(error.float().norm().item()) + if correction is not None: + self.diagnostics.correction_norms.append( + correction.float().norm().item() + ) + + def jacobian_proxy_loss(self, h_in: Tensor, h_out: Tensor) -> Tensor: + if self.jacobian_proxy_weight <= 0: + return h_in.new_zeros(()) + delta = h_out - h_in + ratio = delta.norm() / (h_in.norm() + self.eps) + return self.jacobian_proxy_weight * torch.relu(ratio - 1.0).square() + + def reset(self): + self.diagnostics.reset() + + +class ResidualScale(nn.Module): + def __init__(self, num_passes: int, init_value: float = 1.0): + super().__init__() + self.scales = nn.Parameter( + torch.full((num_passes,), init_value, dtype=torch.float32) + ) + + def forward(self, residual: Tensor, pass_idx: int) -> Tensor: + return self.scales[pass_idx].to(dtype=residual.dtype) * residual + + +# ── Error feedback modules for recurrent quantization correction ── + +class LowRankResidual(nn.Module): + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V = nn.Parameter(torch.zeros(dim, rank)) + self.U = nn.Parameter(torch.zeros(dim, rank)) + + def forward(self, h: Tensor) -> Tensor: + return (h @ self.V) @ self.U.T + + +class DiagonalFeedback(nn.Module): + def __init__(self, dim: int, init_ones: bool = False): + super().__init__() + init_val = torch.ones(dim) if init_ones else torch.zeros(dim) + self.d = nn.Parameter(init_val) + + def forward(self, e: Tensor) -> Tensor: + return self.d.to(dtype=e.dtype) * e + + +class LowRankFeedback(nn.Module): + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V_D = nn.Parameter(torch.zeros(dim, rank)) + self.U_D = nn.Parameter(torch.zeros(dim, rank)) + + def forward(self, e: Tensor) -> Tensor: + return (e @ self.V_D) @ self.U_D.T + + +class AffineJunction(nn.Module): + def __init__(self, dim: int): + super().__init__() + self.gamma = nn.Parameter(torch.ones(dim)) + self.beta = nn.Parameter(torch.zeros(dim)) + + def forward(self, h: Tensor) -> Tensor: + return self.gamma.to(dtype=h.dtype) * h + self.beta.to(dtype=h.dtype) + + +class ErrorFeedbackModule(nn.Module): + def __init__( + self, + dim: int, + rank: int = 2, + feedback_mode: str = "diagonal", + per_pass: bool = False, + num_passes: int = 3, + affine_junction: bool = False, + ): + super().__init__() + self.feedback_mode = feedback_mode + self.per_pass = per_pass + self.num_passes = num_passes + self.residual = LowRankResidual(dim, rank) + if feedback_mode == "identity": + self.correction: nn.Module | nn.ModuleList | None = None + elif feedback_mode == "diagonal": + if per_pass: + self.correction = nn.ModuleList( + [DiagonalFeedback(dim) for _ in range(num_passes)] + ) + else: + self.correction = DiagonalFeedback(dim) + elif feedback_mode == "low_rank": + if per_pass: + self.correction = nn.ModuleList( + [LowRankFeedback(dim, rank) for _ in range(num_passes)] + ) + else: + self.correction = LowRankFeedback(dim, rank) + else: + raise ValueError(f"Unknown feedback_mode: {feedback_mode}") + self.junction: AffineJunction | None = ( + AffineJunction(dim) if affine_junction else None + ) + + def forward(self, h: Tensor, pass_idx: int) -> Tensor: + e = self.residual(h) + if self.correction is None: + c = e + elif self.per_pass: + c = self.correction[pass_idx](e) + else: + c = self.correction(e) + if self.junction is not None: + c = c + self.junction(h) + mask = torch.tensor(1.0 if pass_idx > 0 else 0.0, device=h.device, dtype=h.dtype) + return c * mask + + def param_count(self) -> int: + return sum(p.numel() for p in self.parameters()) try: import wandb as _wandb except ImportError: @@ -2037,10 +2225,6 @@ def lr_mul(step: int, elapsed_ms: float) -> float: f.write(quant_blob) quant_file_bytes = len(quant_blob) code_bytes = len(code.encode("utf-8")) - for extra in ["feedback.py", "stability.py"]: - p = Path(__file__).parent / extra - if p.exists(): - code_bytes += p.stat().st_size log0(f"Serialized model int6+lzma: {quant_file_bytes} bytes") log0(f"Total submission size int6+lzma: {quant_file_bytes + code_bytes} bytes") if distributed: From fca62ae22428f0b51425d91d8cce79b4cd54ada7 Mon Sep 17 00:00:00 2001 From: nestamidavaine Date: Sun, 29 Mar 2026 23:07:08 +0000 Subject: [PATCH 13/23] changes --- .../run_earlyqat.sh | 18 +++----- .../run_nofeedback.sh | 6 +-- .../train_gpt.py | 45 +++++++------------ 3 files changed, 25 insertions(+), 44 deletions(-) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh index d6d62ab867..c2c2ad3114 100755 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh @@ -9,7 +9,7 @@ if [ -f /home/nesta/parameter-golf/.env ]; then fi export PYTHONUNBUFFERED=1 -export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True +export PYTORCH_ALLOC_CONF=expandable_segments:True # --- Data paths --- export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" @@ -29,12 +29,12 @@ export VE_DIM=128 export VE_LAYERS="9,10" # --- Training schedule (progressive 1->4 passes, wallclock-capped at 600s on 8xH100) --- -export ITERATIONS=6500 +export ITERATIONS=9000 export MAX_WALLCLOCK_SECONDS=600 export VAL_LOSS_EVERY=500 export TRAIN_LOG_EVERY=50 -export WARMUP_STEPS=20 -export WARMDOWN_ITERS=2500 +export WARMUP_STEPS=24 +export WARMDOWN_ITERS=3500 export TRAIN_BATCH_TOKENS=786432 export TRAIN_SEQ_LEN=2048 export EVAL_SEQ_LEN=2048 @@ -55,7 +55,7 @@ export GRAD_CLIP_NORM=0.3 # EARLY QAT: threshold 0.25 (vs 0.15 in winning config) to reduce weight entropy export SWA_ENABLED=1 export SWA_EVERY=50 -export LATE_QAT_THRESHOLD=0.25 +export LATE_QAT_THRESHOLD=0.30 # --- TTT (matches SOTA, freeze_blocks=0) --- export TTT_ENABLED=1 @@ -76,16 +76,12 @@ export CORE_QUANT_ENABLED=0 # Progressive: 1-pass until step 4500, then ramp 2->3->4 export PASSES_SCHEDULE="0:1,4500:2,5500:3,6000:4" -# --- W&B --- -export WANDB_PROJECT="parameter-golf" - export SEED=1337 -export WANDB_NAME="earlyqat_025" -export RUN_ID="earlyqat_025" +export RUN_ID="earlyqat_v2" torchrun --standalone --nproc_per_node=8 train_gpt.py \ --feedback-mode diagonal --feedback-rank 2 \ --residual-scale-init 0.5 \ --jacobian-proxy-weight 0.1 \ --no-interpass-rmsnorm \ - 2>&1 | tee logs/earlyqat_025.txt + 2>&1 | tee logs/earlyqat_v2.txt diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_nofeedback.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_nofeedback.sh index 764ae84504..e0c61b0024 100755 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_nofeedback.sh +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_nofeedback.sh @@ -9,7 +9,7 @@ if [ -f /home/nesta/parameter-golf/.env ]; then fi export PYTHONUNBUFFERED=1 -export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True +export PYTORCH_ALLOC_CONF=expandable_segments:True # --- Data paths --- export DATA_PATH="../../../data/datasets/fineweb10B_sp1024" @@ -78,11 +78,7 @@ export CORE_QUANT_ENABLED=0 # Progressive: 1-pass until step 4500, then ramp 2->3->4 export PASSES_SCHEDULE="0:1,4500:2,5500:3,6000:4" -# --- W&B --- -export WANDB_PROJECT="parameter-golf" - export SEED=1337 -export WANDB_NAME="nofeedback_earlyqat" export RUN_ID="nofeedback_earlyqat" torchrun --standalone --nproc_per_node=8 train_gpt.py \ diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py index 3e78c2c6ee..2bda2cf69f 100644 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py @@ -221,10 +221,7 @@ def forward(self, h: Tensor, pass_idx: int) -> Tensor: def param_count(self) -> int: return sum(p.numel() for p in self.parameters()) -try: - import wandb as _wandb -except ImportError: - _wandb = None +_wandb = None class Hyperparameters: data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024") train_files = os.path.join(data_path, "fineweb_train_*.bin") @@ -1926,27 +1923,7 @@ def feedback_fn(h, pass_idx): f"max_wallclock_seconds:{args.max_wallclock_seconds:.3f}" ) log0(f"seed:{args.seed}") - use_wandb = _wandb is not None and rank == 0 and os.environ.get("WANDB_DISABLED", "0") != "1" - if use_wandb: - _wandb.init( - project=os.environ.get("WANDB_PROJECT", "parameter-golf"), - name=os.environ.get("WANDB_NAME", f"recurrent_p{args.num_passes}_s{args.seed}"), - config={ - "num_layers": args.num_layers, "model_dim": args.model_dim, - "num_passes": args.num_passes, "core_start": args.core_start, - "core_end": args.core_end, "seed": args.seed, - "train_batch_tokens": args.train_batch_tokens, - "train_seq_len": args.train_seq_len, "iterations": args.iterations, - "matrix_lr": args.matrix_lr, "scalar_lr": args.scalar_lr, - "feedback_mode": cli.feedback_mode, "feedback_rank": cli.feedback_rank, - "jacobian_proxy_weight": cli.jacobian_proxy_weight, - "residual_scale_init": cli.residual_scale_init, - "interpass_rmsnorm": not cli.no_interpass_rmsnorm, - "n_params": sum(p.numel() for p in base_model.parameters()), - }, - reinit=True, - ) - log0("wandb:initialized") + use_wandb = False train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) def zero_grad_all() -> None: for opt in optimizers: @@ -1966,12 +1943,22 @@ def lr_mul(step: int, elapsed_ms: float) -> float: initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()} initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] _precompile_passes = sorted(set(p for _, p in passes_schedule) - {args.num_passes}) if passes_schedule else [] - _precompile_start = args.warmup_steps - len(_precompile_passes) + _qat_precompile_passes = _precompile_passes[-2:] if len(_precompile_passes) >= 2 else _precompile_passes[:] + _total_precompile = len(_precompile_passes) + len(_qat_precompile_passes) + _precompile_start = args.warmup_steps - _total_precompile model.train() for warmup_step in range(args.warmup_steps): - if _precompile_passes and warmup_step >= _precompile_start: + if warmup_step >= _precompile_start: _pc_idx = warmup_step - _precompile_start - base_model.num_passes = _precompile_passes[_pc_idx] + if _pc_idx < len(_precompile_passes): + base_model.num_passes = _precompile_passes[_pc_idx] + CastedLinear._qat_enabled = False + base_model.core_quant_enabled = False + else: + _qat_idx = _pc_idx - len(_precompile_passes) + base_model.num_passes = _qat_precompile_passes[_qat_idx] + CastedLinear._qat_enabled = True + base_model.core_quant_enabled = True zero_grad_all() for micro_step in range(grad_accum_steps): x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) @@ -1992,6 +1979,8 @@ def lr_mul(step: int, elapsed_ms: float) -> float: if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps: log0(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}") base_model.num_passes = args.num_passes + CastedLinear._qat_enabled = args.qat_enabled + base_model.core_quant_enabled = args.core_quant_enabled if stabilizer is not None: stabilizer.reset() base_model.load_state_dict(initial_model_state, strict=True) From e1764c3dd32120bcfd41c257c6574d76ede02bef Mon Sep 17 00:00:00 2001 From: nestamidavaine Date: Sun, 29 Mar 2026 23:59:24 +0000 Subject: [PATCH 14/23] Strip dead features (MTP, DTG, LAWA, bigram, VE, gated_attn, value_residual) to reduce code+model size - Removed unused modules and their code paths from train_gpt.py (106KB -> 78KB) - Set BIGRAM_VOCAB_SIZE=0, VE_ENABLED=0 to save parameters - Updated run_earlyqat.sh for v3: TTT_CHUNK_TOKENS=49152, RUN_ID=earlyqat_v3 - Added train_gpt_old.py backup and run_ttt_only.py helper Made-with: Cursor --- .../run_earlyqat.sh | 10 +- .../run_ttt_only.py | 90 + .../train_gpt.py | 3997 ++++++++--------- .../train_gpt_old.py | 2275 ++++++++++ 4 files changed, 4181 insertions(+), 2191 deletions(-) create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_ttt_only.py create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_old.py diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh index c2c2ad3114..ff6b0847d7 100755 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh @@ -20,11 +20,11 @@ export NUM_LAYERS=11 export MODEL_DIM=512 export NUM_HEADS=8 export NUM_KV_HEADS=4 -export BIGRAM_VOCAB_SIZE=1536 +export BIGRAM_VOCAB_SIZE=0 export XSA_LAST_N=4 export ROPE_DIMS=16 export LN_SCALE=1 -export VE_ENABLED=1 +export VE_ENABLED=0 export VE_DIM=128 export VE_LAYERS="9,10" @@ -61,7 +61,7 @@ export LATE_QAT_THRESHOLD=0.30 export TTT_ENABLED=1 export TTT_LR=0.002 export TTT_EPOCHS=3 -export TTT_CHUNK_TOKENS=32768 +export TTT_CHUNK_TOKENS=49152 export TTT_FREEZE_BLOCKS=0 export TTT_MOMENTUM=0.9 export TTT_BATCH_SEQS=32 @@ -77,11 +77,11 @@ export CORE_QUANT_ENABLED=0 export PASSES_SCHEDULE="0:1,4500:2,5500:3,6000:4" export SEED=1337 -export RUN_ID="earlyqat_v2" +export RUN_ID="earlyqat_v3" torchrun --standalone --nproc_per_node=8 train_gpt.py \ --feedback-mode diagonal --feedback-rank 2 \ --residual-scale-init 0.5 \ --jacobian-proxy-weight 0.1 \ --no-interpass-rmsnorm \ - 2>&1 | tee logs/earlyqat_v2.txt + 2>&1 | tee logs/earlyqat_v3.txt diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_ttt_only.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_ttt_only.py new file mode 100644 index 0000000000..85888d2a03 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_ttt_only.py @@ -0,0 +1,90 @@ +"""Standalone TTT eval: loads quantized model and runs TTT with configurable epochs.""" +import os, sys, time, math, io, lzma, torch, torch.distributed as dist +os.environ.setdefault("PYTHONUNBUFFERED", "1") + +sys.path.insert(0, os.path.dirname(__file__)) +from train_gpt import ( + Hyperparameters, GPT, ResidualScale, CastedLinear, + dequantize_mixed_int6, _rebank_state_dict, _unbank_state_dict, + load_validation_tokens, build_sentencepiece_luts, + eval_val_sliding_ttt, restore_low_dim_params_to_fp32, +) +import sentencepiece as spm + +def main(): + args = Hyperparameters() + distributed = "RANK" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + from torch.backends.cuda import enable_flash_sdp, enable_cudnn_sdp, enable_mem_efficient_sdp, enable_math_sdp + enable_cudnn_sdp(False); enable_flash_sdp(True); enable_mem_efficient_sdp(False); enable_math_sdp(False) + + def log0(msg): + if rank == 0: print(msg) + + sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) + val_tokens = load_validation_tokens(args.val_files, args.train_seq_len) + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts(sp, args.vocab_size, device) + + eval_passes = int(os.environ.get("EVAL_PASSES", "4")) + residual_scale_init = float(os.environ.get("RESIDUAL_SCALE_INIT", "0.5")) + + with open("final_model.int6.ptz", "rb") as f: + quant_blob = f.read() + quant_state = torch.load(io.BytesIO(lzma.decompress(quant_blob)), map_location="cpu") + + sd_cpu = torch.load("final_model.pt", map_location="cpu") + unbanked_template = _unbank_state_dict(sd_cpu, args.num_layers) + deq_unbanked = dequantize_mixed_int6(quant_state["w"], quant_state["m"], unbanked_template) + deq_state = _rebank_state_dict(deq_unbanked, args.num_layers, sd_cpu) + + eval_model = GPT( + vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim, + num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init, + bigram_vocab_size=args.bigram_vocab_size, bigram_dim=args.bigram_dim, + xsa_last_n=args.xsa_last_n, rope_dims=args.rope_dims, ln_scale=args.ln_scale, + ve_enabled=args.ve_enabled, ve_dim=args.ve_dim, ve_layers=args.ve_layers, + core_start=args.core_start, core_end=args.core_end, + num_passes=eval_passes, interpass_rmsnorm=False, + ).to(device).bfloat16() + + eval_rs = ResidualScale(eval_passes, residual_scale_init).to(device) + eval_model.residual_scale = eval_rs + eval_model.qo_bank.data = eval_model.qo_bank.data.float() + eval_model.kv_bank.data = eval_model.kv_bank.data.float() + eval_model.mlp_up_bank.data = eval_model.mlp_up_bank.data.float() + eval_model.mlp_down_bank.data = eval_model.mlp_down_bank.data.float() + for m in eval_model.modules(): + if isinstance(m, CastedLinear): + m.float() + restore_low_dim_params_to_fp32(eval_model) + eval_model.load_state_dict(deq_state, strict=True) + + torch.compile(eval_model, dynamic=False, fullgraph=True) + + log0(f"TTT_EPOCHS={args.ttt_epochs} EVAL_PASSES={eval_passes}") + t0 = time.perf_counter() + ttt_loss, ttt_bpb = eval_val_sliding_ttt( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, log0=log0, + ) + elapsed = time.perf_counter() - t0 + log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} eval_time:{elapsed*1000:.0f}ms") + log0(f"legal_ttt_exact val_loss:{ttt_loss:.8f} val_bpb:{ttt_bpb:.8f}") + + if distributed: + dist.destroy_process_group() + +if __name__ == "__main__": + main() diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py index 2bda2cf69f..c7d002313d 100644 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py @@ -13,10 +13,10 @@ import zlib from pathlib import Path try: - import zstandard - _COMPRESSOR = "zstd" + import zstandard + _COMPRESSOR = "zstd" except ImportError: - _COMPRESSOR = "zlib" + _COMPRESSOR = "zlib" import numpy as np import sentencepiece as spm import torch @@ -28,559 +28,474 @@ from torch.nn.parallel import DistributedDataParallel as DDP _gpu_mem_frac = float(os.environ.get("CUDA_MEM_FRACTION", "0")) if _gpu_mem_frac > 0: - torch.cuda.set_per_process_memory_fraction(_gpu_mem_frac, 0) + torch.cuda.set_per_process_memory_fraction(_gpu_mem_frac, 0) from dataclasses import dataclass, field from flash_attn_interface import flash_attn_func as flash_attn_3_func import argparse - -# ── Stability monitoring and control for recurrent passes ── - @dataclass class PassDiagnostics: - h_norms: list[float] = field(default_factory=list) - delta_norms: list[float] = field(default_factory=list) - error_norms: list[float] = field(default_factory=list) - correction_norms: list[float] = field(default_factory=list) - growth_ratios: list[float] = field(default_factory=list) - - def reset(self): - for lst in (self.h_norms, self.delta_norms, self.error_norms, - self.correction_norms, self.growth_ratios): - lst.clear() - - def summary(self) -> dict[str, list[float]]: - return { - "h_norms": list(self.h_norms), - "delta_norms": list(self.delta_norms), - "error_norms": list(self.error_norms), - "correction_norms": list(self.correction_norms), - "growth_ratios": list(self.growth_ratios), - } - - + h_norms: list[float] = field(default_factory=list) + delta_norms: list[float] = field(default_factory=list) + error_norms: list[float] = field(default_factory=list) + correction_norms: list[float] = field(default_factory=list) + growth_ratios: list[float] = field(default_factory=list) + def reset(self): + for lst in (self.h_norms, self.delta_norms, self.error_norms, + self.correction_norms, self.growth_ratios): + lst.clear() + def summary(self) -> dict[str, list[float]]: + return { + "h_norms": list(self.h_norms), + "delta_norms": list(self.delta_norms), + "error_norms": list(self.error_norms), + "correction_norms": list(self.correction_norms), + "growth_ratios": list(self.growth_ratios), + } class RecurrentStabilizer: - def __init__( - self, - clip_hidden: bool = False, - clip_value: float = 10.0, - clip_mode: str = "value", - jacobian_proxy_weight: float = 0.0, - eps: float = 1e-6, - ): - self.clip_hidden = clip_hidden - self.clip_value = clip_value - self.clip_mode = clip_mode - self.jacobian_proxy_weight = jacobian_proxy_weight - self.eps = eps - self.diagnostics = PassDiagnostics() - - def clip(self, h: Tensor) -> Tensor: - if not self.clip_hidden: - return h - if self.clip_mode == "value": - return torch.clamp(h, -self.clip_value, self.clip_value) - norm = h.norm(dim=-1, keepdim=True) - scale = torch.clamp(self.clip_value / (norm + self.eps), max=1.0) - return h * scale - - def record_pass( - self, - h_prev: Tensor, - h_next: Tensor, - error: Tensor | None = None, - correction: Tensor | None = None, - ): - with torch.no_grad(): - h_pn = h_prev.float().norm().item() - h_nn = h_next.float().norm().item() - self.diagnostics.h_norms.append(h_nn) - self.diagnostics.delta_norms.append( - (h_next - h_prev).float().norm().item() - ) - self.diagnostics.growth_ratios.append(h_nn / (h_pn + self.eps)) - if error is not None: - self.diagnostics.error_norms.append(error.float().norm().item()) - if correction is not None: - self.diagnostics.correction_norms.append( - correction.float().norm().item() - ) - - def jacobian_proxy_loss(self, h_in: Tensor, h_out: Tensor) -> Tensor: - if self.jacobian_proxy_weight <= 0: - return h_in.new_zeros(()) - delta = h_out - h_in - ratio = delta.norm() / (h_in.norm() + self.eps) - return self.jacobian_proxy_weight * torch.relu(ratio - 1.0).square() - - def reset(self): - self.diagnostics.reset() - - + def __init__( + self, + clip_hidden: bool = False, + clip_value: float = 10.0, + clip_mode: str = "value", + jacobian_proxy_weight: float = 0.0, + eps: float = 1e-6, + ): + self.clip_hidden = clip_hidden + self.clip_value = clip_value + self.clip_mode = clip_mode + self.jacobian_proxy_weight = jacobian_proxy_weight + self.eps = eps + self.diagnostics = PassDiagnostics() + def clip(self, h: Tensor) -> Tensor: + if not self.clip_hidden: + return h + if self.clip_mode == "value": + return torch.clamp(h, -self.clip_value, self.clip_value) + norm = h.norm(dim=-1, keepdim=True) + scale = torch.clamp(self.clip_value / (norm + self.eps), max=1.0) + return h * scale + def record_pass( + self, + h_prev: Tensor, + h_next: Tensor, + error: Tensor | None = None, + correction: Tensor | None = None, + ): + with torch.no_grad(): + h_pn = h_prev.float().norm().item() + h_nn = h_next.float().norm().item() + self.diagnostics.h_norms.append(h_nn) + self.diagnostics.delta_norms.append( + (h_next - h_prev).float().norm().item() + ) + self.diagnostics.growth_ratios.append(h_nn / (h_pn + self.eps)) + if error is not None: + self.diagnostics.error_norms.append(error.float().norm().item()) + if correction is not None: + self.diagnostics.correction_norms.append( + correction.float().norm().item() + ) + def jacobian_proxy_loss(self, h_in: Tensor, h_out: Tensor) -> Tensor: + if self.jacobian_proxy_weight <= 0: + return h_in.new_zeros(()) + delta = h_out - h_in + ratio = delta.norm() / (h_in.norm() + self.eps) + return self.jacobian_proxy_weight * torch.relu(ratio - 1.0).square() + def reset(self): + self.diagnostics.reset() class ResidualScale(nn.Module): - def __init__(self, num_passes: int, init_value: float = 1.0): - super().__init__() - self.scales = nn.Parameter( - torch.full((num_passes,), init_value, dtype=torch.float32) - ) - - def forward(self, residual: Tensor, pass_idx: int) -> Tensor: - return self.scales[pass_idx].to(dtype=residual.dtype) * residual - - -# ── Error feedback modules for recurrent quantization correction ── - + def __init__(self, num_passes: int, init_value: float = 1.0): + super().__init__() + self.scales = nn.Parameter( + torch.full((num_passes,), init_value, dtype=torch.float32) + ) + def forward(self, residual: Tensor, pass_idx: int) -> Tensor: + return self.scales[pass_idx].to(dtype=residual.dtype) * residual class LowRankResidual(nn.Module): - def __init__(self, dim: int, rank: int = 2): - super().__init__() - self.V = nn.Parameter(torch.zeros(dim, rank)) - self.U = nn.Parameter(torch.zeros(dim, rank)) - - def forward(self, h: Tensor) -> Tensor: - return (h @ self.V) @ self.U.T - - + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V = nn.Parameter(torch.zeros(dim, rank)) + self.U = nn.Parameter(torch.zeros(dim, rank)) + def forward(self, h: Tensor) -> Tensor: + return (h @ self.V) @ self.U.T class DiagonalFeedback(nn.Module): - def __init__(self, dim: int, init_ones: bool = False): - super().__init__() - init_val = torch.ones(dim) if init_ones else torch.zeros(dim) - self.d = nn.Parameter(init_val) - - def forward(self, e: Tensor) -> Tensor: - return self.d.to(dtype=e.dtype) * e - - + def __init__(self, dim: int, init_ones: bool = False): + super().__init__() + init_val = torch.ones(dim) if init_ones else torch.zeros(dim) + self.d = nn.Parameter(init_val) + def forward(self, e: Tensor) -> Tensor: + return self.d.to(dtype=e.dtype) * e class LowRankFeedback(nn.Module): - def __init__(self, dim: int, rank: int = 2): - super().__init__() - self.V_D = nn.Parameter(torch.zeros(dim, rank)) - self.U_D = nn.Parameter(torch.zeros(dim, rank)) - - def forward(self, e: Tensor) -> Tensor: - return (e @ self.V_D) @ self.U_D.T - - + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V_D = nn.Parameter(torch.zeros(dim, rank)) + self.U_D = nn.Parameter(torch.zeros(dim, rank)) + def forward(self, e: Tensor) -> Tensor: + return (e @ self.V_D) @ self.U_D.T class AffineJunction(nn.Module): - def __init__(self, dim: int): - super().__init__() - self.gamma = nn.Parameter(torch.ones(dim)) - self.beta = nn.Parameter(torch.zeros(dim)) - - def forward(self, h: Tensor) -> Tensor: - return self.gamma.to(dtype=h.dtype) * h + self.beta.to(dtype=h.dtype) - - + def __init__(self, dim: int): + super().__init__() + self.gamma = nn.Parameter(torch.ones(dim)) + self.beta = nn.Parameter(torch.zeros(dim)) + def forward(self, h: Tensor) -> Tensor: + return self.gamma.to(dtype=h.dtype) * h + self.beta.to(dtype=h.dtype) class ErrorFeedbackModule(nn.Module): - def __init__( - self, - dim: int, - rank: int = 2, - feedback_mode: str = "diagonal", - per_pass: bool = False, - num_passes: int = 3, - affine_junction: bool = False, - ): - super().__init__() - self.feedback_mode = feedback_mode - self.per_pass = per_pass - self.num_passes = num_passes - self.residual = LowRankResidual(dim, rank) - if feedback_mode == "identity": - self.correction: nn.Module | nn.ModuleList | None = None - elif feedback_mode == "diagonal": - if per_pass: - self.correction = nn.ModuleList( - [DiagonalFeedback(dim) for _ in range(num_passes)] - ) - else: - self.correction = DiagonalFeedback(dim) - elif feedback_mode == "low_rank": - if per_pass: - self.correction = nn.ModuleList( - [LowRankFeedback(dim, rank) for _ in range(num_passes)] - ) - else: - self.correction = LowRankFeedback(dim, rank) - else: - raise ValueError(f"Unknown feedback_mode: {feedback_mode}") - self.junction: AffineJunction | None = ( - AffineJunction(dim) if affine_junction else None - ) - - def forward(self, h: Tensor, pass_idx: int) -> Tensor: - e = self.residual(h) - if self.correction is None: - c = e - elif self.per_pass: - c = self.correction[pass_idx](e) - else: - c = self.correction(e) - if self.junction is not None: - c = c + self.junction(h) - mask = torch.tensor(1.0 if pass_idx > 0 else 0.0, device=h.device, dtype=h.dtype) - return c * mask - - def param_count(self) -> int: - return sum(p.numel() for p in self.parameters()) -_wandb = None + def __init__( + self, + dim: int, + rank: int = 2, + feedback_mode: str = "diagonal", + per_pass: bool = False, + num_passes: int = 3, + affine_junction: bool = False, + ): + super().__init__() + self.feedback_mode = feedback_mode + self.per_pass = per_pass + self.num_passes = num_passes + self.residual = LowRankResidual(dim, rank) + if feedback_mode == "identity": + self.correction: nn.Module | nn.ModuleList | None = None + elif feedback_mode == "diagonal": + if per_pass: + self.correction = nn.ModuleList( + [DiagonalFeedback(dim) for _ in range(num_passes)] + ) + else: + self.correction = DiagonalFeedback(dim) + elif feedback_mode == "low_rank": + if per_pass: + self.correction = nn.ModuleList( + [LowRankFeedback(dim, rank) for _ in range(num_passes)] + ) + else: + self.correction = LowRankFeedback(dim, rank) + else: + raise ValueError(f"Unknown feedback_mode: {feedback_mode}") + self.junction: AffineJunction | None = ( + AffineJunction(dim) if affine_junction else None + ) + def forward(self, h: Tensor, pass_idx: int) -> Tensor: + e = self.residual(h) + if self.correction is None: + c = e + elif self.per_pass: + c = self.correction[pass_idx](e) + else: + c = self.correction(e) + if self.junction is not None: + c = c + self.junction(h) + mask = torch.tensor(1.0 if pass_idx > 0 else 0.0, device=h.device, dtype=h.dtype) + return c * mask + def param_count(self) -> int: + return sum(p.numel() for p in self.parameters()) class Hyperparameters: - data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024") - train_files = os.path.join(data_path, "fineweb_train_*.bin") - val_files = os.path.join(data_path, "fineweb_val_*.bin") - tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model") - run_id = os.environ.get("RUN_ID", str(uuid.uuid4())) - seed = int(os.environ.get("SEED", 1337)) - val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288)) - val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 4000)) - train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 500)) - iterations = int(os.environ.get("ITERATIONS", 20000)) - warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 3500)) - warmup_steps = int(os.environ.get("WARMUP_STEPS", 20)) - train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 786_432)) - train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 2048)) - eval_seq_len = int(os.environ.get("EVAL_SEQ_LEN", 2048)) - max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0)) - qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5)) - vocab_size = int(os.environ.get("VOCAB_SIZE", 1024)) - num_layers = int(os.environ.get("NUM_LAYERS", 11)) - num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4)) - model_dim = int(os.environ.get("MODEL_DIM", 512)) - num_heads = int(os.environ.get("NUM_HEADS", 8)) - mlp_mult = float(os.environ.get("MLP_MULT", 3.0)) - tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1"))) - rope_base = float(os.environ.get("ROPE_BASE", 10000.0)) - logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0)) - embed_lr = float(os.environ.get("EMBED_LR", 0.6)) - head_lr = float(os.environ.get("HEAD_LR", 0.008)) - tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.035)) - tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005)) - matrix_lr = float(os.environ.get("MATRIX_LR", 0.025)) - scalar_lr = float(os.environ.get("SCALAR_LR", 0.025)) - muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.99)) - muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5)) - muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.92)) - muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 1500)) - beta1 = float(os.environ.get("BETA1", 0.9)) - beta2 = float(os.environ.get("BETA2", 0.95)) - adam_eps = float(os.environ.get("ADAM_EPS", 1e-8)) - grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.3)) - eval_stride = int(os.environ.get("EVAL_STRIDE", 64)) - mtp_num_heads = int(os.environ.get("MTP_NUM_HEADS", 0)) - mtp_loss_weight = float(os.environ.get("MTP_LOSS_WEIGHT", 0.2)) - muon_beta2 = float(os.environ.get("MUON_BETA2", 0.95)) - swa_enabled = bool(int(os.environ.get("SWA_ENABLED", "1"))) - swa_every = int(os.environ.get("SWA_EVERY", 50)) - lawa_enabled = bool(int(os.environ.get("LAWA_ENABLED", "0"))) - lawa_k = int(os.environ.get("LAWA_K", 10)) - lawa_freq = int(os.environ.get("LAWA_FREQ", 100)) - muon_wd = float(os.environ.get("MUON_WD", 0.04)) - adam_wd = float(os.environ.get("ADAM_WD", 0.04)) - qat_enabled = bool(int(os.environ.get("QAT_ENABLED", "0"))) - bigram_vocab_size = int(os.environ.get("BIGRAM_VOCAB_SIZE", 2048)) - bigram_dim = int(os.environ.get("BIGRAM_DIM", 128)) - xsa_last_n = int(os.environ.get("XSA_LAST_N", 4)) - rope_dims = int(os.environ.get("ROPE_DIMS", 16)) - ln_scale = bool(int(os.environ.get("LN_SCALE", "1"))) - dtg_enabled = bool(int(os.environ.get("DTG_ENABLED", "0"))) - late_qat_threshold = float(os.environ.get("LATE_QAT_THRESHOLD", 0.15)) - ve_enabled = bool(int(os.environ.get("VE_ENABLED", "1"))) - ve_dim = int(os.environ.get("VE_DIM", 128)) - ve_layers = os.environ.get("VE_LAYERS", "9,10") - gated_attention = bool(int(os.environ.get("GATED_ATTENTION", "0"))) - value_residual = bool(int(os.environ.get("VALUE_RESIDUAL", "0"))) - ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "0"))) - ttt_lr = float(os.environ.get("TTT_LR", 0.002)) - ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3)) - ttt_chunk_tokens = int(os.environ.get("TTT_CHUNK_TOKENS", 32768)) - ttt_freeze_blocks = int(os.environ.get("TTT_FREEZE_BLOCKS", 2)) - ttt_momentum = float(os.environ.get("TTT_MOMENTUM", 0.9)) - ttt_batch_seqs = int(os.environ.get("TTT_BATCH_SEQS", 32)) - ttt_grad_clip = float(os.environ.get("TTT_GRAD_CLIP", 1.0)) - # recurrence - core_start = int(os.environ.get("CORE_START", 3)) - core_end = int(os.environ.get("CORE_END", 8)) - num_passes = int(os.environ.get("NUM_PASSES", 1)) - core_quant_bits = int(os.environ.get("CORE_QUANT_BITS", 6)) - core_quant_enabled = bool(int(os.environ.get("CORE_QUANT_ENABLED", "0"))) - eval_passes = int(os.environ.get("EVAL_PASSES", 0)) - # Progressive passes schedule: comma-separated "step:passes" pairs, e.g. "0:1,4500:2,5500:3,6000:4" - passes_schedule_str = os.environ.get("PASSES_SCHEDULE", "") - -# --- Batched Newton-Schulz orthogonalization --- - + data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024") + train_files = os.path.join(data_path, "fineweb_train_*.bin") + val_files = os.path.join(data_path, "fineweb_val_*.bin") + tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model") + run_id = os.environ.get("RUN_ID", str(uuid.uuid4())) + seed = int(os.environ.get("SEED", 1337)) + val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288)) + val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 4000)) + train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 500)) + iterations = int(os.environ.get("ITERATIONS", 20000)) + warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 3500)) + warmup_steps = int(os.environ.get("WARMUP_STEPS", 20)) + train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 786_432)) + train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 2048)) + eval_seq_len = int(os.environ.get("EVAL_SEQ_LEN", 2048)) + max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0)) + qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5)) + vocab_size = int(os.environ.get("VOCAB_SIZE", 1024)) + num_layers = int(os.environ.get("NUM_LAYERS", 11)) + num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4)) + model_dim = int(os.environ.get("MODEL_DIM", 512)) + num_heads = int(os.environ.get("NUM_HEADS", 8)) + mlp_mult = float(os.environ.get("MLP_MULT", 3.0)) + tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1"))) + rope_base = float(os.environ.get("ROPE_BASE", 10000.0)) + logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0)) + embed_lr = float(os.environ.get("EMBED_LR", 0.6)) + head_lr = float(os.environ.get("HEAD_LR", 0.008)) + tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.035)) + tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005)) + matrix_lr = float(os.environ.get("MATRIX_LR", 0.025)) + scalar_lr = float(os.environ.get("SCALAR_LR", 0.025)) + muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.99)) + muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5)) + muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.92)) + muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 1500)) + beta1 = float(os.environ.get("BETA1", 0.9)) + beta2 = float(os.environ.get("BETA2", 0.95)) + adam_eps = float(os.environ.get("ADAM_EPS", 1e-8)) + grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.3)) + eval_stride = int(os.environ.get("EVAL_STRIDE", 64)) + muon_beta2 = float(os.environ.get("MUON_BETA2", 0.95)) + swa_enabled = bool(int(os.environ.get("SWA_ENABLED", "1"))) + swa_every = int(os.environ.get("SWA_EVERY", 50)) + muon_wd = float(os.environ.get("MUON_WD", 0.04)) + adam_wd = float(os.environ.get("ADAM_WD", 0.04)) + qat_enabled = bool(int(os.environ.get("QAT_ENABLED", "0"))) + xsa_last_n = int(os.environ.get("XSA_LAST_N", 4)) + rope_dims = int(os.environ.get("ROPE_DIMS", 16)) + ln_scale = bool(int(os.environ.get("LN_SCALE", "1"))) + late_qat_threshold = float(os.environ.get("LATE_QAT_THRESHOLD", 0.15)) + ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "0"))) + ttt_lr = float(os.environ.get("TTT_LR", 0.002)) + ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3)) + ttt_chunk_tokens = int(os.environ.get("TTT_CHUNK_TOKENS", 32768)) + ttt_freeze_blocks = int(os.environ.get("TTT_FREEZE_BLOCKS", 2)) + ttt_momentum = float(os.environ.get("TTT_MOMENTUM", 0.9)) + ttt_batch_seqs = int(os.environ.get("TTT_BATCH_SEQS", 32)) + ttt_grad_clip = float(os.environ.get("TTT_GRAD_CLIP", 1.0)) + core_start = int(os.environ.get("CORE_START", 3)) + core_end = int(os.environ.get("CORE_END", 8)) + num_passes = int(os.environ.get("NUM_PASSES", 1)) + core_quant_bits = int(os.environ.get("CORE_QUANT_BITS", 6)) + core_quant_enabled = bool(int(os.environ.get("CORE_QUANT_ENABLED", "0"))) + eval_passes = int(os.environ.get("EVAL_PASSES", 0)) + passes_schedule_str = os.environ.get("PASSES_SCHEDULE", "") def zeropower_via_newtonschulz5(G: Tensor, steps: int = 5, eps: float = 1e-7) -> Tensor: - """Batched Newton-Schulz orthogonalization. G: (B,M,N) or (M,N).""" - a, b, c = (3.4445, -4.7750, 2.0315) - was_2d = G.ndim == 2 - if was_2d: - G = G.unsqueeze(0) - X = G.bfloat16() - transposed = X.size(-2) > X.size(-1) - if transposed: - X = X.mT - X = X / (X.norm(dim=(-2, -1), keepdim=True) + eps) - for _ in range(steps): - A = X @ X.mT - B = b * A + c * (A @ A) - X = a * X + B @ X - if transposed: - X = X.mT - if was_2d: - X = X.squeeze(0) - return X - -# --- Parallel Muon optimizer --- - + """""" + a, b, c = (3.4445, -4.7750, 2.0315) + was_2d = G.ndim == 2 + if was_2d: + G = G.unsqueeze(0) + X = G.bfloat16() + transposed = X.size(-2) > X.size(-1) + if transposed: + X = X.mT + X = X / (X.norm(dim=(-2, -1), keepdim=True) + eps) + for _ in range(steps): + A = X @ X.mT + B = b * A + c * (A @ A) + X = a * X + B @ X + if transposed: + X = X.mT + if was_2d: + X = X.squeeze(0) + return X class Muon(torch.optim.Optimizer): - """Parallel Muon: post-backward reduce-scatter -> local NS5 -> all-gather. - - No DDP for bank params. After backward, this optimizer: - 1. Launches async reduce-scatter for all banks (biggest first) - 2. Returns control so Adam can step on small params while RS is in-flight - 3. Waits for each RS, runs local NS5 on the shard, launches async all-gather - 4. Each all-gather overlaps with next bank's NS5 - """ - def __init__(self, params, lr: float, momentum: float, backend_steps: int, - nesterov: bool = True, weight_decay: float = 0.0): - super().__init__( - params, - dict(lr=lr, momentum=momentum, backend_steps=backend_steps, - nesterov=nesterov, weight_decay=weight_decay), - ) - self._built = False - - def _build(self): - self._distributed = dist.is_available() and dist.is_initialized() - self._world_size = dist.get_world_size() if self._distributed else 1 - self._rank = dist.get_rank() if self._distributed else 0 - ws = self._world_size - - self._bank_meta = [] - for group in self.param_groups: - for p in group["params"]: - B = p.shape[0] - padded_B = ((B + ws - 1) // ws) * ws - shard_B = padded_B // ws - tail = p.shape[1:] - dev = p.device - self._bank_meta.append({ - 'p': p, - 'B': B, - 'padded_grad': torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), - 'shard': torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), - 'shard_mom': torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), - 'full_update': torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), - 'scale': max(1, p.shape[-2] / p.shape[-1]) ** 0.5, - }) - # Sort by size descending -- launch biggest reduce-scatters first - self._bank_meta.sort(key=lambda m: -m['p'].numel()) - self._built = True - - def launch_reduce_scatters(self): - """Phase 1: launch async reduce-scatter for all banks. Call right after backward.""" - if not self._built: - self._build() - if not self._distributed: - return - self._rs_futures = [] - for m in self._bank_meta: - p = m['p'] - if p.grad is None: - self._rs_futures.append(None) - continue - pg = m['padded_grad'] - pg[:m['B']].copy_(p.grad.bfloat16()) - if pg.shape[0] > m['B']: - pg[m['B']:].zero_() - fut = dist.reduce_scatter_tensor(m['shard'], pg, op=dist.ReduceOp.AVG, async_op=True) - self._rs_futures.append(fut) - - @torch.no_grad() - def step(self, closure=None): - """Phase 3: wait for RS, local NS5, all-gather. Call AFTER Adam steps.""" - loss = None - if closure is not None: - with torch.enable_grad(): - loss = closure() - - if not self._built: - self._build() - - for group in self.param_groups: - lr = group["lr"] - momentum = group["momentum"] - backend_steps = group["backend_steps"] - nesterov = group["nesterov"] - wd = group.get("weight_decay", 0.0) - - prev_ag_handle = None - prev_m = None - - sharded = self._distributed and hasattr(self, '_rs_futures') - - for i, m in enumerate(self._bank_meta): - p = m['p'] - if p.grad is None: - continue - - if prev_ag_handle is not None: - prev_ag_handle.wait() - pp = prev_m['p'] - upd = prev_m['full_update'][:prev_m['B']] - if wd > 0.0: - pp.data.mul_(1.0 - lr * wd) - pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m['scale']) - - if sharded and self._rs_futures[i] is not None: - self._rs_futures[i].wait() - g = m['shard'] - buf = m['shard_mom'] - else: - g = p.grad.bfloat16() - state = self.state[p] - if "momentum_buffer" not in state: - state["momentum_buffer"] = torch.zeros_like(g) - buf = state["momentum_buffer"] - - buf.mul_(momentum).add_(g) - if nesterov: - update = g.add(buf, alpha=momentum) - else: - update = buf - - update = zeropower_via_newtonschulz5(update, steps=backend_steps) - - if sharded: - prev_ag_handle = dist.all_gather_into_tensor( - m['full_update'], update, async_op=True) - prev_m = m - else: - if wd > 0.0: - p.data.mul_(1.0 - lr * wd) - p.add_(update.to(dtype=p.dtype), alpha=-lr * m['scale']) - - if prev_ag_handle is not None: - prev_ag_handle.wait() - pp = prev_m['p'] - upd = prev_m['full_update'][:prev_m['B']] - if wd > 0.0: - pp.data.mul_(1.0 - lr * wd) - pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m['scale']) - - if hasattr(self, '_rs_futures'): - del self._rs_futures - - return loss - -# --- Tokenizer evaluation helpers --- - + """""" + def __init__(self, params, lr: float, momentum: float, backend_steps: int, + nesterov: bool = True, weight_decay: float = 0.0): + super().__init__( + params, + dict(lr=lr, momentum=momentum, backend_steps=backend_steps, + nesterov=nesterov, weight_decay=weight_decay), + ) + self._built = False + def _build(self): + self._distributed = dist.is_available() and dist.is_initialized() + self._world_size = dist.get_world_size() if self._distributed else 1 + self._rank = dist.get_rank() if self._distributed else 0 + ws = self._world_size + self._bank_meta = [] + for group in self.param_groups: + for p in group["params"]: + B = p.shape[0] + padded_B = ((B + ws - 1) // ws) * ws + shard_B = padded_B // ws + tail = p.shape[1:] + dev = p.device + self._bank_meta.append({ + 'p': p, + 'B': B, + 'padded_grad': torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), + 'shard': torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), + 'shard_mom': torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), + 'full_update': torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), + 'scale': max(1, p.shape[-2] / p.shape[-1]) ** 0.5, + }) + self._bank_meta.sort(key=lambda m: -m['p'].numel()) + self._built = True + def launch_reduce_scatters(self): + """""" + if not self._built: + self._build() + if not self._distributed: + return + self._rs_futures = [] + for m in self._bank_meta: + p = m['p'] + if p.grad is None: + self._rs_futures.append(None) + continue + pg = m['padded_grad'] + pg[:m['B']].copy_(p.grad.bfloat16()) + if pg.shape[0] > m['B']: + pg[m['B']:].zero_() + fut = dist.reduce_scatter_tensor(m['shard'], pg, op=dist.ReduceOp.AVG, async_op=True) + self._rs_futures.append(fut) + @torch.no_grad() + def step(self, closure=None): + """""" + loss = None + if closure is not None: + with torch.enable_grad(): + loss = closure() + if not self._built: + self._build() + for group in self.param_groups: + lr = group["lr"] + momentum = group["momentum"] + backend_steps = group["backend_steps"] + nesterov = group["nesterov"] + wd = group.get("weight_decay", 0.0) + prev_ag_handle = None + prev_m = None + sharded = self._distributed and hasattr(self, '_rs_futures') + for i, m in enumerate(self._bank_meta): + p = m['p'] + if p.grad is None: + continue + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m['p'] + upd = prev_m['full_update'][:prev_m['B']] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m['scale']) + if sharded and self._rs_futures[i] is not None: + self._rs_futures[i].wait() + g = m['shard'] + buf = m['shard_mom'] + else: + g = p.grad.bfloat16() + state = self.state[p] + if "momentum_buffer" not in state: + state["momentum_buffer"] = torch.zeros_like(g) + buf = state["momentum_buffer"] + buf.mul_(momentum).add_(g) + if nesterov: + update = g.add(buf, alpha=momentum) + else: + update = buf + update = zeropower_via_newtonschulz5(update, steps=backend_steps) + if sharded: + prev_ag_handle = dist.all_gather_into_tensor( + m['full_update'], update, async_op=True) + prev_m = m + else: + if wd > 0.0: + p.data.mul_(1.0 - lr * wd) + p.add_(update.to(dtype=p.dtype), alpha=-lr * m['scale']) + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m['p'] + upd = prev_m['full_update'][:prev_m['B']] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m['scale']) + if hasattr(self, '_rs_futures'): + del self._rs_futures + return loss def build_sentencepiece_luts( - sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device + sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device ) -> tuple[Tensor, Tensor, Tensor]: - sp_vocab_size = int(sp.vocab_size()) - table_size = max(sp_vocab_size, vocab_size) - base_bytes_np = np.zeros((table_size,), dtype=np.int16) - has_leading_space_np = np.zeros((table_size,), dtype=np.bool_) - is_boundary_token_np = np.ones((table_size,), dtype=np.bool_) - for token_id in range(sp_vocab_size): - if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id): - continue - is_boundary_token_np[token_id] = False - if sp.is_byte(token_id): - base_bytes_np[token_id] = 1 - continue - piece = sp.id_to_piece(token_id) - if piece.startswith("\u2581"): - has_leading_space_np[token_id] = True - piece = piece[1:] - base_bytes_np[token_id] = len(piece.encode("utf-8")) - return ( - torch.tensor(base_bytes_np, dtype=torch.int16, device=device), - torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), - torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), - ) + sp_vocab_size = int(sp.vocab_size()) + table_size = max(sp_vocab_size, vocab_size) + base_bytes_np = np.zeros((table_size,), dtype=np.int16) + has_leading_space_np = np.zeros((table_size,), dtype=np.bool_) + is_boundary_token_np = np.ones((table_size,), dtype=np.bool_) + for token_id in range(sp_vocab_size): + if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id): + continue + is_boundary_token_np[token_id] = False + if sp.is_byte(token_id): + base_bytes_np[token_id] = 1 + continue + piece = sp.id_to_piece(token_id) + if piece.startswith("\u2581"): + has_leading_space_np[token_id] = True + piece = piece[1:] + base_bytes_np[token_id] = len(piece.encode("utf-8")) + return ( + torch.tensor(base_bytes_np, dtype=torch.int16, device=device), + torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), + torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), + ) def load_validation_tokens(pattern: str, seq_len: int) -> Tensor: - files = [Path(p) for p in sorted(glob.glob(pattern))] - if not files: - raise FileNotFoundError(f"No files found for pattern: {pattern}") - tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() - usable = ((tokens.numel() - 1) // seq_len) * seq_len - if usable <= 0: - raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") - return tokens[: usable + 1] + files = [Path(p) for p in sorted(glob.glob(pattern))] + if not files: + raise FileNotFoundError(f"No files found for pattern: {pattern}") + tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() + usable = ((tokens.numel() - 1) // seq_len) * seq_len + if usable <= 0: + raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") + return tokens[: usable + 1] def eval_val( - args: Hyperparameters, - model: nn.Module, - rank: int, - world_size: int, - device: torch.device, - grad_accum_steps: int, - val_tokens: Tensor, - base_bytes_lut: Tensor, - has_leading_space_lut: Tensor, - is_boundary_token_lut: Tensor, - eval_seq_len: int | None = None, + args: Hyperparameters, + model: nn.Module, + rank: int, + world_size: int, + device: torch.device, + grad_accum_steps: int, + val_tokens: Tensor, + base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + eval_seq_len: int | None = None, ) -> tuple[float, float]: - seq_len = eval_seq_len or args.train_seq_len - local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps) - if local_batch_tokens < seq_len: - raise ValueError( - "VAL_BATCH_SIZE must provide at least one sequence per rank; " - f"got VAL_BATCH_SIZE={args.val_batch_size}, WORLD_SIZE={world_size}, " - f"GRAD_ACCUM_STEPS={grad_accum_steps}, seq_len={seq_len}" - ) - local_batch_seqs = local_batch_tokens // seq_len - total_seqs = (val_tokens.numel() - 1) // seq_len - seq_start = (total_seqs * rank) // world_size - seq_end = (total_seqs * (rank + 1)) // world_size - val_loss_sum = torch.zeros((), device=device, dtype=torch.float64) - val_token_count = torch.zeros((), device=device, dtype=torch.float64) - val_byte_count = torch.zeros((), device=device, dtype=torch.float64) - model.eval() - with torch.inference_mode(): - for batch_seq_start in range(seq_start, seq_end, local_batch_seqs): - batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end) - raw_start = batch_seq_start * seq_len - raw_end = batch_seq_end * seq_len + 1 - local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True) - x = local[:-1].reshape(-1, seq_len) - y = local[1:].reshape(-1, seq_len) - with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): - batch_loss = model(x, y).detach() - batch_token_count = float(y.numel()) - val_loss_sum += batch_loss.to(torch.float64) * batch_token_count - val_token_count += batch_token_count - prev_ids = x.reshape(-1) - tgt_ids = y.reshape(-1) - token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16) - token_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16) - val_byte_count += token_bytes.to(torch.float64).sum() - if dist.is_available() and dist.is_initialized(): - dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM) - dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM) - dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM) - val_loss = val_loss_sum / val_token_count - bits_per_token = val_loss.item() / math.log(2.0) - tokens_per_byte = val_token_count.item() / val_byte_count.item() - model.train() - return float(val_loss.item()), float(bits_per_token * tokens_per_byte) - -# --- Quantization helpers --- - + seq_len = eval_seq_len or args.train_seq_len + local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps) + if local_batch_tokens < seq_len: + raise ValueError( + "VAL_BATCH_SIZE must provide at least one sequence per rank; " + f"got VAL_BATCH_SIZE={args.val_batch_size}, WORLD_SIZE={world_size}, " + f"GRAD_ACCUM_STEPS={grad_accum_steps}, seq_len={seq_len}" + ) + local_batch_seqs = local_batch_tokens // seq_len + total_seqs = (val_tokens.numel() - 1) // seq_len + seq_start = (total_seqs * rank) // world_size + seq_end = (total_seqs * (rank + 1)) // world_size + val_loss_sum = torch.zeros((), device=device, dtype=torch.float64) + val_token_count = torch.zeros((), device=device, dtype=torch.float64) + val_byte_count = torch.zeros((), device=device, dtype=torch.float64) + model.eval() + with torch.inference_mode(): + for batch_seq_start in range(seq_start, seq_end, local_batch_seqs): + batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end) + raw_start = batch_seq_start * seq_len + raw_end = batch_seq_end * seq_len + 1 + local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + batch_loss = model(x, y).detach() + batch_token_count = float(y.numel()) + val_loss_sum += batch_loss.to(torch.float64) * batch_token_count + val_token_count += batch_token_count + prev_ids = x.reshape(-1) + tgt_ids = y.reshape(-1) + token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16) + token_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16) + val_byte_count += token_bytes.to(torch.float64).sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM) + val_loss = val_loss_sum / val_token_count + bits_per_token = val_loss.item() / math.log(2.0) + tokens_per_byte = val_token_count.item() / val_byte_count.item() + model.train() + return float(val_loss.item()), float(bits_per_token * tokens_per_byte) CONTROL_TENSOR_NAME_PATTERNS = tuple( - pattern - for pattern in os.environ.get( - "CONTROL_TENSOR_NAME_PATTERNS", - "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,smear,dtg_gate,ve_layer_scales,ve_shared.scale,attn_gate,vr_lambda", - ).split(",") - if pattern + pattern + for pattern in os.environ.get( + "CONTROL_TENSOR_NAME_PATTERNS", + "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,smear", + ).split(",") + if pattern ) INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple( - pattern - for pattern in os.environ.get( - "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS", - ",".join(CONTROL_TENSOR_NAME_PATTERNS), - ).split(",") - if pattern + pattern + for pattern in os.environ.get( + "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS", + ",".join(CONTROL_TENSOR_NAME_PATTERNS), + ).split(",") + if pattern ) INT8_KEEP_FLOAT_MAX_NUMEL = 65_536 INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16 @@ -588,1688 +503,1398 @@ def eval_val( INT8_CLIP_PERCENTILE = 99.99984 INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0 def tensor_nbytes(t: Tensor) -> int: - return int(t.numel()) * int(t.element_size()) + return int(t.numel()) * int(t.element_size()) def keep_float_tensor(name: str, t: Tensor, passthrough_orig_dtypes: dict[str, str]) -> Tensor: - if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS): - return t.float().contiguous() - if t.dtype in {torch.float32, torch.bfloat16}: - passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.") - return t.to(dtype=INT8_KEEP_FLOAT_STORE_DTYPE).contiguous() - return t + if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS): + return t.float().contiguous() + if t.dtype in {torch.float32, torch.bfloat16}: + passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.") + return t.to(dtype=INT8_KEEP_FLOAT_STORE_DTYPE).contiguous() + return t def quantize_float_tensor(t: Tensor) -> tuple[Tensor, Tensor]: - t32 = t.float() - if t32.ndim == 2: - clip_abs = ( - torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1) - if t32.numel() - else torch.empty((t32.shape[0],), dtype=torch.float32) - ) - clipped = torch.maximum(torch.minimum(t32, clip_abs[:, None]), -clip_abs[:, None]) - scale = (clip_abs / 127.0).clamp_min(1.0 / 127.0) - q = torch.clamp(torch.round(clipped / scale[:, None]), -127, 127).to(torch.int8).contiguous() - return q, scale.to(dtype=INT8_PER_ROW_SCALE_DTYPE).contiguous() - clip_abs = float(torch.quantile(t32.abs().flatten(), INT8_CLIP_Q).item()) if t32.numel() else 0.0 - scale = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, dtype=torch.float32) - q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), -127, 127).to(torch.int8).contiguous() - return q, scale + t32 = t.float() + if t32.ndim == 2: + clip_abs = ( + torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1) + if t32.numel() + else torch.empty((t32.shape[0],), dtype=torch.float32) + ) + clipped = torch.maximum(torch.minimum(t32, clip_abs[:, None]), -clip_abs[:, None]) + scale = (clip_abs / 127.0).clamp_min(1.0 / 127.0) + q = torch.clamp(torch.round(clipped / scale[:, None]), -127, 127).to(torch.int8).contiguous() + return q, scale.to(dtype=INT8_PER_ROW_SCALE_DTYPE).contiguous() + clip_abs = float(torch.quantile(t32.abs().flatten(), INT8_CLIP_Q).item()) if t32.numel() else 0.0 + scale = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, dtype=torch.float32) + q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), -127, 127).to(torch.int8).contiguous() + return q, scale def quantize_state_dict_int8(state_dict: dict[str, Tensor]): - quantized: dict[str, Tensor] = {} - scales: dict[str, Tensor] = {} - dtypes: dict[str, str] = {} - passthrough: dict[str, Tensor] = {} - passthrough_orig_dtypes: dict[str, str] = {} - qmeta: dict[str, dict[str, object]] = {} - stats = dict.fromkeys( - ("param_count", "num_tensors", "num_float_tensors", "num_nonfloat_tensors", "baseline_tensor_bytes", "int8_payload_bytes"), - 0, - ) - for name, tensor in state_dict.items(): - t = tensor.detach().to("cpu").contiguous() - stats["param_count"] += int(t.numel()) - stats["num_tensors"] += 1 - stats["baseline_tensor_bytes"] += tensor_nbytes(t) - if not t.is_floating_point(): - stats["num_nonfloat_tensors"] += 1 - passthrough[name] = t - stats["int8_payload_bytes"] += tensor_nbytes(t) - continue - if t.numel() <= INT8_KEEP_FLOAT_MAX_NUMEL: - kept = keep_float_tensor(name, t, passthrough_orig_dtypes) - passthrough[name] = kept - stats["int8_payload_bytes"] += tensor_nbytes(kept) - continue - stats["num_float_tensors"] += 1 - q, s = quantize_float_tensor(t) - if s.ndim > 0: - qmeta[name] = {"scheme": "per_row", "axis": 0} - quantized[name] = q - scales[name] = s - dtypes[name] = str(t.dtype).removeprefix("torch.") - stats["int8_payload_bytes"] += tensor_nbytes(q) + tensor_nbytes(s) - obj: dict[str, object] = { - "__quant_format__": "int8_clean_per_row_v1", - "quantized": quantized, - "scales": scales, - "dtypes": dtypes, - "passthrough": passthrough, - } - if qmeta: - obj["qmeta"] = qmeta - if passthrough_orig_dtypes: - obj["passthrough_orig_dtypes"] = passthrough_orig_dtypes - return obj, stats + quantized: dict[str, Tensor] = {} + scales: dict[str, Tensor] = {} + dtypes: dict[str, str] = {} + passthrough: dict[str, Tensor] = {} + passthrough_orig_dtypes: dict[str, str] = {} + qmeta: dict[str, dict[str, object]] = {} + stats = dict.fromkeys( + ("param_count", "num_tensors", "num_float_tensors", "num_nonfloat_tensors", "baseline_tensor_bytes", "int8_payload_bytes"), + 0, + ) + for name, tensor in state_dict.items(): + t = tensor.detach().to("cpu").contiguous() + stats["param_count"] += int(t.numel()) + stats["num_tensors"] += 1 + stats["baseline_tensor_bytes"] += tensor_nbytes(t) + if not t.is_floating_point(): + stats["num_nonfloat_tensors"] += 1 + passthrough[name] = t + stats["int8_payload_bytes"] += tensor_nbytes(t) + continue + if t.numel() <= INT8_KEEP_FLOAT_MAX_NUMEL: + kept = keep_float_tensor(name, t, passthrough_orig_dtypes) + passthrough[name] = kept + stats["int8_payload_bytes"] += tensor_nbytes(kept) + continue + stats["num_float_tensors"] += 1 + q, s = quantize_float_tensor(t) + if s.ndim > 0: + qmeta[name] = {"scheme": "per_row", "axis": 0} + quantized[name] = q + scales[name] = s + dtypes[name] = str(t.dtype).removeprefix("torch.") + stats["int8_payload_bytes"] += tensor_nbytes(q) + tensor_nbytes(s) + obj: dict[str, object] = { + "__quant_format__": "int8_clean_per_row_v1", + "quantized": quantized, + "scales": scales, + "dtypes": dtypes, + "passthrough": passthrough, + } + if qmeta: + obj["qmeta"] = qmeta + if passthrough_orig_dtypes: + obj["passthrough_orig_dtypes"] = passthrough_orig_dtypes + return obj, stats def dequantize_state_dict_int8(obj: dict[str, object]) -> dict[str, Tensor]: - out: dict[str, Tensor] = {} - qmeta = obj.get("qmeta", {}) - passthrough_orig_dtypes = obj.get("passthrough_orig_dtypes", {}) - for name, q in obj["quantized"].items(): - dtype = getattr(torch, obj["dtypes"][name]) - s = obj["scales"][name] - if qmeta.get(name, {}).get("scheme") == "per_row" or s.ndim > 0: - s = s.to(dtype=torch.float32) - out[name] = (q.float() * s.view(q.shape[0], *([1] * (q.ndim - 1)))).to(dtype=dtype).contiguous() - else: - scale = float(s.item()) - out[name] = (q.float() * scale).to(dtype=dtype).contiguous() - for name, t in obj["passthrough"].items(): - out_t = t.detach().to("cpu").contiguous() - orig_dtype = passthrough_orig_dtypes.get(name) - if isinstance(orig_dtype, str): - out_t = out_t.to(dtype=getattr(torch, orig_dtype)).contiguous() - out[name] = out_t - return out - -# --- Data loading --- - + out: dict[str, Tensor] = {} + qmeta = obj.get("qmeta", {}) + passthrough_orig_dtypes = obj.get("passthrough_orig_dtypes", {}) + for name, q in obj["quantized"].items(): + dtype = getattr(torch, obj["dtypes"][name]) + s = obj["scales"][name] + if qmeta.get(name, {}).get("scheme") == "per_row" or s.ndim > 0: + s = s.to(dtype=torch.float32) + out[name] = (q.float() * s.view(q.shape[0], *([1] * (q.ndim - 1)))).to(dtype=dtype).contiguous() + else: + scale = float(s.item()) + out[name] = (q.float() * scale).to(dtype=dtype).contiguous() + for name, t in obj["passthrough"].items(): + out_t = t.detach().to("cpu").contiguous() + orig_dtype = passthrough_orig_dtypes.get(name) + if isinstance(orig_dtype, str): + out_t = out_t.to(dtype=getattr(torch, orig_dtype)).contiguous() + out[name] = out_t + return out def load_data_shard(file: Path) -> Tensor: - header_bytes = 256 * np.dtype(" None: - self.file_idx = (self.file_idx + 1) % len(self.files) - self.tokens = load_data_shard(self.files[self.file_idx]) - self.pos = 0 - def take(self, n: int) -> Tensor: - chunks: list[Tensor] = [] - remaining = n - while remaining > 0: - avail = self.tokens.numel() - self.pos - if avail <= 0: - self._advance_file() - continue - k = min(remaining, avail) - chunks.append(self.tokens[self.pos : self.pos + k]) - self.pos += k - remaining -= k - return chunks[0] if len(chunks) == 1 else torch.cat(chunks) + def __init__(self, pattern: str): + self.files = [Path(p) for p in sorted(glob.glob(pattern))] + if not self.files: + raise FileNotFoundError(f"No files found for pattern: {pattern}") + self.file_idx = 0 + self.tokens = load_data_shard(self.files[0]) + self.pos = 0 + def _advance_file(self) -> None: + self.file_idx = (self.file_idx + 1) % len(self.files) + self.tokens = load_data_shard(self.files[self.file_idx]) + self.pos = 0 + def take(self, n: int) -> Tensor: + chunks: list[Tensor] = [] + remaining = n + while remaining > 0: + avail = self.tokens.numel() - self.pos + if avail <= 0: + self._advance_file() + continue + k = min(remaining, avail) + chunks.append(self.tokens[self.pos : self.pos + k]) + self.pos += k + remaining -= k + return chunks[0] if len(chunks) == 1 else torch.cat(chunks) class DistributedTokenLoader: - def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device): - self.rank = rank - self.world_size = world_size - self.device = device - self.stream = TokenStream(pattern) - def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]: - local_tokens = global_tokens // (self.world_size * grad_accum_steps) - per_rank_span = local_tokens + 1 - chunk = self.stream.take(per_rank_span * self.world_size) - start = self.rank * per_rank_span - local = chunk[start : start + per_rank_span].to(dtype=torch.int64) - x = local[:-1].reshape(-1, seq_len) - y = local[1:].reshape(-1, seq_len) - return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True) - -# --- Transformer modules --- - + def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device): + self.rank = rank + self.world_size = world_size + self.device = device + self.stream = TokenStream(pattern) + def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]: + local_tokens = global_tokens // (self.world_size * grad_accum_steps) + per_rank_span = local_tokens + 1 + chunk = self.stream.take(per_rank_span * self.world_size) + start = self.rank * per_rank_span + local = chunk[start : start + per_rank_span].to(dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True) class RMSNorm(nn.Module): - def __init__(self, eps: float | None = None): - super().__init__() - self.eps = eps - def forward(self, x: Tensor) -> Tensor: - return F.rms_norm(x, (x.size(-1),), eps=self.eps) + def __init__(self, eps: float | None = None): + super().__init__() + self.eps = eps + def forward(self, x: Tensor) -> Tensor: + return F.rms_norm(x, (x.size(-1),), eps=self.eps) class CastedLinear(nn.Linear): - _qat_enabled: bool = False - def forward(self, x: Tensor) -> Tensor: - w = self.weight.to(x.dtype) - if CastedLinear._qat_enabled and self.training and w.ndim == 2: - with torch.no_grad(): - w32 = self.weight.float() - row_max = w32.abs().amax(dim=1) - scale = (row_max / 31.0).clamp_min(1.0 / 31.0) - w_q = (torch.clamp(torch.round(w32 / scale[:, None]), -32, 31) * scale[:, None]).to(x.dtype) - w = w + (w_q - w).detach() - bias = self.bias.to(x.dtype) if self.bias is not None else None - return F.linear(x, w, bias) + _qat_enabled: bool = False + def forward(self, x: Tensor) -> Tensor: + w = self.weight.to(x.dtype) + if CastedLinear._qat_enabled and self.training and w.ndim == 2: + with torch.no_grad(): + w32 = self.weight.float() + row_max = w32.abs().amax(dim=1) + scale = (row_max / 31.0).clamp_min(1.0 / 31.0) + w_q = (torch.clamp(torch.round(w32 / scale[:, None]), -32, 31) * scale[:, None]).to(x.dtype) + w = w + (w_q - w).detach() + bias = self.bias.to(x.dtype) if self.bias is not None else None + return F.linear(x, w, bias) def restore_low_dim_params_to_fp32(module: nn.Module) -> None: - with torch.no_grad(): - for name, param in module.named_parameters(): - if (param.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32: - param.data = param.data.float() + with torch.no_grad(): + for name, param in module.named_parameters(): + if (param.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32: + param.data = param.data.float() class Rotary(nn.Module): - def __init__(self, dim: int, base: float = 10000.0, train_seq_len: int = 1024, rope_dims: int = 0): - super().__init__() - self.dim = dim - self.base = base - self.train_seq_len = train_seq_len - self.rope_dims = rope_dims if rope_dims > 0 else dim - inv_freq = 1.0 / (base ** (torch.arange(0, self.rope_dims, 2, dtype=torch.float32) / self.rope_dims)) - self.register_buffer("inv_freq", inv_freq, persistent=False) - self._seq_len_cached = 0 - self._cos_cached: Tensor | None = None - self._sin_cached: Tensor | None = None - def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]: - if ( - self._cos_cached is None - or self._sin_cached is None - or self._seq_len_cached != seq_len - or self._cos_cached.device != device - ): - rd = self.rope_dims - if seq_len > self.train_seq_len: - scale = seq_len / self.train_seq_len - new_base = self.base * (scale ** (rd / (rd - 2))) - inv_freq = 1.0 / (new_base ** (torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd)) - else: - inv_freq = self.inv_freq.to(device) - t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype) - freqs = torch.outer(t, inv_freq) - self._cos_cached = freqs.cos()[None, :, None, :] - self._sin_cached = freqs.sin()[None, :, None, :] - self._seq_len_cached = seq_len - return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype) + def __init__(self, dim: int, base: float = 10000.0, train_seq_len: int = 1024, rope_dims: int = 0): + super().__init__() + self.dim = dim + self.base = base + self.train_seq_len = train_seq_len + self.rope_dims = rope_dims if rope_dims > 0 else dim + inv_freq = 1.0 / (base ** (torch.arange(0, self.rope_dims, 2, dtype=torch.float32) / self.rope_dims)) + self.register_buffer("inv_freq", inv_freq, persistent=False) + self._seq_len_cached = 0 + self._cos_cached: Tensor | None = None + self._sin_cached: Tensor | None = None + def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]: + if ( + self._cos_cached is None + or self._sin_cached is None + or self._seq_len_cached != seq_len + or self._cos_cached.device != device + ): + rd = self.rope_dims + if seq_len > self.train_seq_len: + scale = seq_len / self.train_seq_len + new_base = self.base * (scale ** (rd / (rd - 2))) + inv_freq = 1.0 / (new_base ** (torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd)) + else: + inv_freq = self.inv_freq.to(device) + t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype) + freqs = torch.outer(t, inv_freq) + self._cos_cached = freqs.cos()[None, :, None, :] + self._sin_cached = freqs.sin()[None, :, None, :] + self._seq_len_cached = seq_len + return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype) def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor, rope_dims: int = 0) -> Tensor: - if rope_dims > 0 and rope_dims < x.size(-1): - x_rope, x_pass = x[..., :rope_dims], x[..., rope_dims:] - half = rope_dims // 2 - x1, x2 = x_rope[..., :half], x_rope[..., half:] - x_rope = torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) - return torch.cat((x_rope, x_pass), dim=-1) - half = x.size(-1) // 2 - x1, x2 = x[..., :half], x[..., half:] - return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) - + if rope_dims > 0 and rope_dims < x.size(-1): + x_rope, x_pass = x[..., :rope_dims], x[..., rope_dims:] + half = rope_dims // 2 + x1, x2 = x_rope[..., :half], x_rope[..., half:] + x_rope = torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) + return torch.cat((x_rope, x_pass), dim=-1) + half = x.size(-1) // 2 + x1, x2 = x[..., :half], x[..., half:] + return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) class CausalSelfAttention(nn.Module): - def __init__( - self, - dim: int, - num_heads: int, - num_kv_heads: int, - rope_base: float, - qk_gain_init: float, - gated_attention: bool = False, - value_residual: bool = False, - ): - super().__init__() - if dim % num_heads != 0: - raise ValueError("model_dim must be divisible by num_heads") - if num_heads % num_kv_heads != 0: - raise ValueError("num_heads must be divisible by num_kv_heads") - self.num_heads = num_heads - self.num_kv_heads = num_kv_heads - self.head_dim = dim // num_heads - if self.head_dim % 2 != 0: - raise ValueError("head_dim must be even for RoPE") - # No CastedLinear -- weights come from banks - self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32)) - self.rope_dims = 0 # set by GPT.__init__ for partial RoPE - self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=1024) - self.use_xsa = False # set by GPT.__init__ for deep layers only - # Gated attention and value residual (non-banked small params) - self.gated_attention = gated_attention - if gated_attention: - self.attn_gate = nn.Linear(dim, num_heads, bias=True) - nn.init.zeros_(self.attn_gate.weight) - nn.init.constant_(self.attn_gate.bias, 4.0) - self.value_residual = value_residual - if value_residual: - self.vr_lambda = nn.Parameter(torch.tensor([0.5, 0.5], dtype=torch.float32)) - def _xsa_efficient(self, y: Tensor, v: Tensor) -> Tensor: - """Efficient XSA: subtract self-value projection via GQA-aware reshape (no repeat_interleave). - y: [B, T, H, D], v: [B, T, Hkv, D]. H must be divisible by Hkv.""" - B, T, H, D = y.shape - Hkv = v.size(-2) - group = H // Hkv - y_g = y.reshape(B, T, Hkv, group, D) # [B, T, Hkv, group, D] - vn = F.normalize(v, dim=-1).unsqueeze(-2) # [B, T, Hkv, 1, D] -- broadcast ready - proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn - return (y_g - proj).reshape(B, T, H, D) - def forward(self, x: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, v_embed: Tensor | None = None, v0: Tensor | None = None) -> tuple[Tensor, Tensor | None]: - bsz, seqlen, dim = x.shape - q = F.linear(x, q_w.to(x.dtype)).reshape(bsz, seqlen, self.num_heads, self.head_dim) - k = F.linear(x, k_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) - v = F.linear(x, v_w.to(x.dtype)) - if v_embed is not None: - v = v + v_embed - v = v.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) - raw_v = v if self.value_residual else None - if self.value_residual and v0 is not None: - lam = self.vr_lambda.to(dtype=v.dtype) - v = lam[0] * v0 + lam[1] * v - q = F.rms_norm(q, (q.size(-1),)) - k = F.rms_norm(k, (k.size(-1),)) - cos, sin = self.rotary(seqlen, x.device, q.dtype) - q = apply_rotary_emb(q, cos, sin, self.rope_dims) - k = apply_rotary_emb(k, cos, sin, self.rope_dims) - q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None] - y = flash_attn_3_func(q, k, v, causal=True) - if self.use_xsa: - y = self._xsa_efficient(y, v) - if self.gated_attention: - # gate shape: (bsz, seqlen, num_heads) -> (bsz, seqlen, num_heads, 1) for B,T,H,D layout - gate = torch.sigmoid(self.attn_gate(x)).unsqueeze(-1) - y = y * gate - y = y.reshape(bsz, seqlen, dim) - return F.linear(y, out_w.to(x.dtype)), raw_v - + def __init__( + self, + dim: int, + num_heads: int, + num_kv_heads: int, + rope_base: float, + qk_gain_init: float, + ): + super().__init__() + if dim % num_heads != 0: + raise ValueError("model_dim must be divisible by num_heads") + if num_heads % num_kv_heads != 0: + raise ValueError("num_heads must be divisible by num_kv_heads") + self.num_heads = num_heads + self.num_kv_heads = num_kv_heads + self.head_dim = dim // num_heads + if self.head_dim % 2 != 0: + raise ValueError("head_dim must be even for RoPE") + self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32)) + self.rope_dims = 0 # set by GPT.__init__ for partial RoPE + self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=1024) + self.use_xsa = False # set by GPT.__init__ for deep layers only + def _xsa_efficient(self, y: Tensor, v: Tensor) -> Tensor: + """""" + B, T, H, D = y.shape + Hkv = v.size(-2) + group = H // Hkv + y_g = y.reshape(B, T, Hkv, group, D) # [B, T, Hkv, group, D] + vn = F.normalize(v, dim=-1).unsqueeze(-2) # [B, T, Hkv, 1, D] -- broadcast ready + proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn + return (y_g - proj).reshape(B, T, H, D) + def forward(self, x: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor) -> tuple[Tensor, Tensor | None]: + bsz, seqlen, dim = x.shape + q = F.linear(x, q_w.to(x.dtype)).reshape(bsz, seqlen, self.num_heads, self.head_dim) + k = F.linear(x, k_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + v = F.linear(x, v_w.to(x.dtype)) + v = v.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + raw_v = None + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = self.rotary(seqlen, x.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, self.rope_dims) + k = apply_rotary_emb(k, cos, sin, self.rope_dims) + q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None] + y = flash_attn_3_func(q, k, v, causal=True) + if self.use_xsa: + y = self._xsa_efficient(y, v) + y = y.reshape(bsz, seqlen, dim) + return F.linear(y, out_w.to(x.dtype)), raw_v class SmearGate(nn.Module): - def __init__(self, dim: int): - super().__init__() - self.gate = nn.Parameter(torch.zeros(dim, dtype=torch.float32)) - def forward(self, x: Tensor) -> Tensor: - g = torch.sigmoid(self.gate.to(dtype=x.dtype))[None, None, :] - x_prev = torch.cat([torch.zeros_like(x[:, :1]), x[:, :-1]], dim=1) - return (1 - g) * x + g * x_prev - -class BigramHashEmbedding(nn.Module): - def __init__(self, bigram_vocab_size: int, bigram_dim: int, model_dim: int): - super().__init__() - self.bigram_vocab_size = bigram_vocab_size - self.embed = nn.Embedding(bigram_vocab_size, bigram_dim) - nn.init.zeros_(self.embed.weight) - self.proj = CastedLinear(bigram_dim, model_dim, bias=False) if bigram_dim != model_dim else None - if self.proj is not None: - nn.init.zeros_(self.proj.weight) - self.scale = nn.Parameter(torch.tensor(0.05, dtype=torch.float32)) - def bigram_hash(self, tokens: Tensor) -> Tensor: - t = tokens.to(torch.int32) - mod = self.bigram_vocab_size - 1 - out = torch.empty_like(t) - out[..., 0] = mod - out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod - return out.long() - def forward(self, token_ids: Tensor) -> Tensor: - h = self.embed(self.bigram_hash(token_ids)) - if self.proj is not None: - h = self.proj(h) - return h * self.scale.to(dtype=h.dtype) - -class ValueEmbedding(nn.Module): - """Reinject token identity into attention values at specific layers. - Each table maps vocab tokens to a low-dim embedding, projected to model_dim.""" - def __init__(self, vocab_size: int, ve_dim: int, model_dim: int): - super().__init__() - self.embed = nn.Embedding(vocab_size, ve_dim) - nn.init.normal_(self.embed.weight, std=0.01) - self.proj = CastedLinear(ve_dim, model_dim, bias=False) if ve_dim != model_dim else None - if self.proj is not None: - nn.init.zeros_(self.proj.weight) - self.scale = nn.Parameter(torch.tensor(0.1, dtype=torch.float32)) - def forward(self, token_ids: Tensor) -> Tensor: - h = self.embed(token_ids) - if self.proj is not None: - h = self.proj(h) - return h * self.scale.to(dtype=h.dtype) - + def __init__(self, dim: int): + super().__init__() + self.gate = nn.Parameter(torch.zeros(dim, dtype=torch.float32)) + def forward(self, x: Tensor) -> Tensor: + g = torch.sigmoid(self.gate.to(dtype=x.dtype))[None, None, :] + x_prev = torch.cat([torch.zeros_like(x[:, :1]), x[:, :-1]], dim=1) + return (1 - g) * x + g * x_prev class MLP(nn.Module): - def __init__(self, dim: int, mlp_mult: int): - super().__init__() - # No CastedLinear -- weights come from banks - def forward(self, x: Tensor, up_w: Tensor, down_w: Tensor) -> Tensor: - x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5) - return F.linear(x.square(), down_w.to(x.dtype)) - + def __init__(self, dim: int, mlp_mult: int): + super().__init__() + def forward(self, x: Tensor, up_w: Tensor, down_w: Tensor) -> Tensor: + x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5) + return F.linear(x.square(), down_w.to(x.dtype)) class Block(nn.Module): - def __init__( - self, - dim: int, - num_heads: int, - num_kv_heads: int, - mlp_mult: int, - rope_base: float, - qk_gain_init: float, - layer_idx: int = 0, - ln_scale: bool = False, - dtg: bool = False, - gated_attention: bool = False, - value_residual: bool = False, - ): - super().__init__() - self.attn_norm = RMSNorm() - self.mlp_norm = RMSNorm() - self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init, - gated_attention=gated_attention, value_residual=value_residual) - self.mlp = MLP(dim, mlp_mult) - self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) - self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) - self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float()) - self.ln_scale_factor = 1.0 / math.sqrt(layer_idx + 1) if ln_scale else 1.0 - if dtg: - self.dtg_gate = nn.Linear(dim, 1, bias=True) - nn.init.zeros_(self.dtg_gate.weight) - nn.init.constant_(self.dtg_gate.bias, 2.0) - else: - self.dtg_gate = None - def forward(self, x: Tensor, x0: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, up_w: Tensor, down_w: Tensor, v_embed: Tensor | None = None, v0: Tensor | None = None) -> tuple[Tensor, Tensor | None]: - mix = self.resid_mix.to(dtype=x.dtype) - x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 - attn_out, raw_v = self.attn(self.attn_norm(x_in) * self.ln_scale_factor, q_w, k_w, v_w, out_w, v_embed=v_embed, v0=v0) - x_out = x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out - x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) - if self.dtg_gate is not None: - gate = torch.sigmoid(self.dtg_gate(x_in.detach())) - x_out = x_in + gate * (x_out - x_in) - return x_out, raw_v - + def __init__( + self, + dim: int, + num_heads: int, + num_kv_heads: int, + mlp_mult: int, + rope_base: float, + qk_gain_init: float, + layer_idx: int = 0, + ln_scale: bool = False, + ): + super().__init__() + self.attn_norm = RMSNorm() + self.mlp_norm = RMSNorm() + self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init, +) + self.mlp = MLP(dim, mlp_mult) + self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float()) + self.ln_scale_factor = 1.0 / math.sqrt(layer_idx + 1) if ln_scale else 1.0 + def forward(self, x: Tensor, x0: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, up_w: Tensor, down_w: Tensor) -> tuple[Tensor, Tensor | None]: + mix = self.resid_mix.to(dtype=x.dtype) + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + attn_out, _ = self.attn(self.attn_norm(x_in) * self.ln_scale_factor, q_w, k_w, v_w, out_w) + x_out = x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) + return x_out, None def _fake_quantize(w: Tensor, bits: int = 6) -> Tensor: - clip_range = (1 << (bits - 1)) - 1 - w32 = w.float() - if w32.ndim >= 2: - row_max = w32.abs().amax(dim=-1) - scale = (row_max / clip_range).clamp_min(1.0 / clip_range) - dims = (slice(None),) * (w32.ndim - 1) + (None,) - w_q = (torch.clamp(torch.round(w32 / scale[dims]), -clip_range, clip_range) * scale[dims]).to(w.dtype) - else: - amax = w32.abs().max() - scale = (amax / clip_range).clamp_min(1.0 / clip_range) - w_q = (torch.clamp(torch.round(w32 / scale), -clip_range, clip_range) * scale).to(w.dtype) - return w + (w_q - w).detach() - + clip_range = (1 << (bits - 1)) - 1 + w32 = w.float() + if w32.ndim >= 2: + row_max = w32.abs().amax(dim=-1) + scale = (row_max / clip_range).clamp_min(1.0 / clip_range) + dims = (slice(None),) * (w32.ndim - 1) + (None,) + w_q = (torch.clamp(torch.round(w32 / scale[dims]), -clip_range, clip_range) * scale[dims]).to(w.dtype) + else: + amax = w32.abs().max() + scale = (amax / clip_range).clamp_min(1.0 / clip_range) + w_q = (torch.clamp(torch.round(w32 / scale), -clip_range, clip_range) * scale).to(w.dtype) + return w + (w_q - w).detach() class GPT(nn.Module): - def __init__( - self, - vocab_size: int, - num_layers: int, - model_dim: int, - num_heads: int, - num_kv_heads: int, - mlp_mult: int, - tie_embeddings: bool, - tied_embed_init_std: float, - logit_softcap: float, - rope_base: float, - qk_gain_init: float, - mtp_num_heads: int = 0, - mtp_loss_weight: float = 0.1, - bigram_vocab_size: int = 0, - bigram_dim: int = 128, - xsa_last_n: int = 0, - rope_dims: int = 0, - ln_scale: bool = False, - dtg: bool = False, - ve_enabled: bool = False, - ve_dim: int = 128, - ve_layers: str = "9,10", - gated_attention: bool = False, - value_residual: bool = False, - core_start: int = 3, - core_end: int = 8, - num_passes: int = 1, - core_quant_bits: int = 6, - core_quant_enabled: bool = False, - residual_scale: nn.Module | None = None, - interpass_rmsnorm: bool = True, - ): - super().__init__() - self._ve_target_dim = num_kv_heads * (model_dim // num_heads) - if logit_softcap <= 0.0: - raise ValueError(f"logit_softcap must be positive, got {logit_softcap}") - self.tie_embeddings = tie_embeddings - self.tied_embed_init_std = tied_embed_init_std - self.logit_softcap = logit_softcap - self.value_residual = value_residual - self.mtp_num_heads = mtp_num_heads - self.mtp_loss_weight = mtp_loss_weight - self.core_start = core_start - self.core_end = min(core_end, num_layers) - self.interpass_rmsnorm = interpass_rmsnorm - self.num_passes = num_passes - self.core_quant_bits = core_quant_bits - self.core_quant_enabled = core_quant_enabled - self.num_stem = core_start - self.num_core = self.core_end - core_start - self.num_tail = num_layers - self.core_end - self.residual_scale = residual_scale - self.tok_emb = nn.Embedding(vocab_size, model_dim) - self.bigram = BigramHashEmbedding(bigram_vocab_size, bigram_dim, model_dim) if bigram_vocab_size > 0 else None - self.smear = SmearGate(model_dim) - self.num_skip_weights = min(self.num_stem, self.num_tail) - self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32)) - # Parameter banks: contiguous 3D tensors for batched optimizer - head_dim = model_dim // num_heads - kv_dim = num_kv_heads * head_dim - mlp_dim = int(mlp_mult * model_dim) - self.num_layers = num_layers - self.qo_bank = nn.Parameter(torch.empty(2 * num_layers, model_dim, model_dim)) - self.kv_bank = nn.Parameter(torch.empty(2 * num_layers, kv_dim, model_dim)) - self.mlp_up_bank = nn.Parameter(torch.empty(num_layers, mlp_dim, model_dim)) - self.mlp_down_bank = nn.Parameter(torch.empty(num_layers, model_dim, mlp_dim)) - self.blocks = nn.ModuleList( - [ - Block( - model_dim, - num_heads, - num_kv_heads, - mlp_mult, - rope_base, - qk_gain_init, - layer_idx=i, - ln_scale=ln_scale, - dtg=dtg, - gated_attention=gated_attention, - value_residual=value_residual, - ) - for i in range(num_layers) - ] - ) - if rope_dims > 0: - head_dim = model_dim // num_heads - for block in self.blocks: - block.attn.rope_dims = rope_dims - block.attn.rotary = Rotary(head_dim, base=rope_base, train_seq_len=1024, rope_dims=rope_dims) - self.ve_layer_indices = [int(x) for x in ve_layers.split(",") if x.strip()] if ve_enabled else [] - kv_dim_ve = self._ve_target_dim - if self.ve_layer_indices: - self.ve_shared = ValueEmbedding(vocab_size, ve_dim, kv_dim_ve) - self.ve_layer_scales = nn.ParameterList( - [nn.Parameter(torch.ones(1, dtype=torch.float32)) for _ in self.ve_layer_indices] - ) - else: - self.ve_shared = None - self.ve_layer_scales = nn.ParameterList() - self.value_embeds = nn.ModuleList() # keep empty for compat - self.final_norm = RMSNorm() - self.lm_head = None if tie_embeddings else CastedLinear(model_dim, vocab_size, bias=False) - if self.lm_head is not None: - self.lm_head._zero_init = True - self.mtp_heads = nn.ModuleList( - [CastedLinear(model_dim, vocab_size, bias=False) for _ in range(mtp_num_heads)] - ) - for head in self.mtp_heads: - head._zero_init = True - if xsa_last_n > 0: - for i in range(max(0, num_layers - xsa_last_n), num_layers): - if i < core_start or i >= self.core_end: - self.blocks[i].attn.use_xsa = True - self._init_weights() - def _init_weights(self) -> None: - if self.tie_embeddings: - nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std) - n = self.num_layers - proj_scale = 1.0 / math.sqrt(2 * n) - # Init banks: orthogonal, with proj layers scaled down and out/down zero-init - for i in range(n): - nn.init.orthogonal_(self.qo_bank.data[i], gain=1.0) # Q - nn.init.zeros_(self.qo_bank.data[n + i]) # Out (zero init) - nn.init.orthogonal_(self.kv_bank.data[i], gain=1.0) # K - nn.init.orthogonal_(self.kv_bank.data[n + i], gain=1.0) # V - nn.init.orthogonal_(self.mlp_up_bank.data[i], gain=1.0) # MLP up - nn.init.zeros_(self.mlp_down_bank.data[i]) # MLP down (zero init) - # Scale proj layers (out_proj and mlp_down are "proj" layers) - self.qo_bank.data[n + i].mul_(proj_scale) - self.mlp_down_bank.data[i].mul_(proj_scale) - # Init remaining nn.Linear modules (bigram proj, mtp heads, lm_head) - for name, module in self.named_modules(): - if isinstance(module, nn.Linear): - if getattr(module, "_zero_init", False): - nn.init.zeros_(module.weight) - elif module.weight.ndim == 2 and module.weight.shape[0] >= 64 and module.weight.shape[1] >= 64: - nn.init.orthogonal_(module.weight, gain=1.0) - def _get_ve(self, layer_idx: int, input_ids: Tensor, ve_cache: dict | None = None) -> Tensor | None: - """Get value embedding for a specific layer using shared table + per-layer scale.""" - if self.ve_shared is None or layer_idx not in self.ve_layer_indices: - return None - if ve_cache is not None and 've' not in ve_cache: - ve_cache['ve'] = self.ve_shared(input_ids) - ve_base = ve_cache['ve'] if ve_cache is not None else self.ve_shared(input_ids) - ve_idx = self.ve_layer_indices.index(layer_idx) - return ve_base * self.ve_layer_scales[ve_idx].to(dtype=ve_base.dtype) - def _get_bank_weights(self, bi: int) -> tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]: - n = self.num_layers - q_w = self.qo_bank[bi] - out_w = self.qo_bank[n + bi] - k_w = self.kv_bank[bi] - v_w = self.kv_bank[n + bi] - up_w = self.mlp_up_bank[bi] - down_w = self.mlp_down_bank[bi] - if self.core_quant_enabled and self.training and self.core_start <= bi < self.core_end: - q_w = _fake_quantize(q_w, self.core_quant_bits) - out_w = _fake_quantize(out_w, self.core_quant_bits) - k_w = _fake_quantize(k_w, self.core_quant_bits) - v_w = _fake_quantize(v_w, self.core_quant_bits) - up_w = _fake_quantize(up_w, self.core_quant_bits) - down_w = _fake_quantize(down_w, self.core_quant_bits) - return q_w, k_w, v_w, out_w, up_w, down_w - - def _forward_hidden(self, input_ids: Tensor, feedback_fn=None, - stabilizer=None) -> tuple[Tensor, Tensor, Tensor]: - n = self.num_layers - x = self.tok_emb(input_ids) - if self.bigram is not None: - x = x + self.bigram(input_ids) - x = F.rms_norm(x, (x.size(-1),)) - x = self.smear(x) - x0 = x - v0 = None - skips: list[Tensor] = [] - ve_cache: dict = {} - # --- STEM --- - for i in range(self.core_start): - ve = self._get_ve(i, input_ids, ve_cache) - q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) - x, raw_v = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, - v_embed=ve, v0=v0) - if v0 is None and raw_v is not None: - v0 = raw_v - skips.append(x) - # --- RECURRENT CORE (Fixes 1, 2, 5) --- - h_core_in = x - for k in range(self.num_passes): - if k > 0 and self.interpass_rmsnorm: - x = F.rms_norm(x, (x.size(-1),)) - if feedback_fn is not None: - x = x + feedback_fn(x, k) - if stabilizer is not None: - x = stabilizer.clip(x) - x_before_pass = x - for j in range(self.core_start, self.core_end): - h_prev = x - ve = self._get_ve(j, input_ids, ve_cache) - q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(j) - x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, - v_embed=ve, v0=v0) - if v0 is None and raw_v is not None: - v0 = raw_v - if stabilizer is not None and self.training and not torch.compiler.is_compiling(): - stabilizer.record_pass(h_prev, x) - if self.residual_scale is not None and k > 0: - delta = x - x_before_pass - x = x_before_pass + self.residual_scale(delta, k) - h_core_out = x - # --- TAIL --- - for i in range(self.core_end, n): - ti = i - self.core_end - if ti < len(skips): - x = x + self.skip_weights[ti].to(dtype=x.dtype)[None, None, :] * skips.pop() - ve = self._get_ve(i, input_ids, ve_cache) - q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) - x, _ = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, - v_embed=ve, v0=v0) - x = self.final_norm(x) - return x, h_core_in, h_core_out - - def forward(self, input_ids: Tensor, target_ids: Tensor, - feedback_fn=None, stabilizer=None) -> Tensor: - x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) - x_flat = x.reshape(-1, x.size(-1)) - targets = target_ids.reshape(-1) - if self.tie_embeddings: - logits_proj = F.linear(x_flat, self.tok_emb.weight) - else: - if self.lm_head is None: - raise RuntimeError("lm_head is required when tie_embeddings=False") - logits_proj = self.lm_head(x_flat) - logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) - main_loss = F.cross_entropy(logits.float(), targets, reduction="mean") - if self.training and self.mtp_num_heads > 0 and self.mtp_loss_weight > 0.0: - _, seqlen, dim = x.shape - mtp_loss_sum = x.new_zeros(()) - mtp_loss_count = 0 - for k, mtp_head in enumerate(self.mtp_heads): - valid_t = seqlen - (k + 1) - if valid_t <= 0: - continue - mtp_hidden = x[:, :valid_t, :].reshape(-1, dim) - mtp_targets = target_ids[:, k + 1 :].reshape(-1) - mtp_logits_proj = mtp_head(mtp_hidden) - mtp_logits = self.logit_softcap * torch.tanh(mtp_logits_proj / self.logit_softcap) - mtp_loss_sum = mtp_loss_sum + F.cross_entropy(mtp_logits.float(), mtp_targets, reduction="mean") - mtp_loss_count += 1 - if mtp_loss_count > 0: - main_loss = main_loss + self.mtp_loss_weight * (mtp_loss_sum / mtp_loss_count) - if stabilizer is not None and stabilizer.jacobian_proxy_weight > 0: - main_loss = main_loss + stabilizer.jacobian_proxy_loss(h_core_in, h_core_out) - return main_loss - - def forward_logits(self, input_ids: Tensor, - feedback_fn=None, stabilizer=None) -> Tensor: - """Return logits (bsz, seq_len, vocab) without computing loss.""" - x, _, _ = self._forward_hidden(input_ids, feedback_fn, stabilizer) - if self.tie_embeddings: - logits_proj = F.linear(x, self.tok_emb.weight) - else: - logits_proj = self.lm_head(x) - return self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) - -# --- Sliding window evaluation --- - + def __init__( + self, + vocab_size: int, + num_layers: int, + model_dim: int, + num_heads: int, + num_kv_heads: int, + mlp_mult: int, + tie_embeddings: bool, + tied_embed_init_std: float, + logit_softcap: float, + rope_base: float, + qk_gain_init: float, + xsa_last_n: int = 0, + rope_dims: int = 0, + ln_scale: bool = False, + core_start: int = 3, + core_end: int = 8, + num_passes: int = 1, + core_quant_bits: int = 6, + core_quant_enabled: bool = False, + residual_scale: nn.Module | None = None, + interpass_rmsnorm: bool = True, + ): + super().__init__() + self._ve_target_dim = num_kv_heads * (model_dim // num_heads) + if logit_softcap <= 0.0: + raise ValueError(f"logit_softcap must be positive, got {logit_softcap}") + self.tie_embeddings = tie_embeddings + self.tied_embed_init_std = tied_embed_init_std + self.logit_softcap = logit_softcap + self.core_start = core_start + self.core_end = min(core_end, num_layers) + self.interpass_rmsnorm = interpass_rmsnorm + self.num_passes = num_passes + self.core_quant_bits = core_quant_bits + self.core_quant_enabled = core_quant_enabled + self.num_stem = core_start + self.num_core = self.core_end - core_start + self.num_tail = num_layers - self.core_end + self.residual_scale = residual_scale + self.tok_emb = nn.Embedding(vocab_size, model_dim) + self.bigram = None + self.smear = SmearGate(model_dim) + self.num_skip_weights = min(self.num_stem, self.num_tail) + self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32)) + head_dim = model_dim // num_heads + kv_dim = num_kv_heads * head_dim + mlp_dim = int(mlp_mult * model_dim) + self.num_layers = num_layers + self.qo_bank = nn.Parameter(torch.empty(2 * num_layers, model_dim, model_dim)) + self.kv_bank = nn.Parameter(torch.empty(2 * num_layers, kv_dim, model_dim)) + self.mlp_up_bank = nn.Parameter(torch.empty(num_layers, mlp_dim, model_dim)) + self.mlp_down_bank = nn.Parameter(torch.empty(num_layers, model_dim, mlp_dim)) + self.blocks = nn.ModuleList( + [ + Block( + model_dim, + num_heads, + num_kv_heads, + mlp_mult, + rope_base, + qk_gain_init, + layer_idx=i, + ln_scale=ln_scale, + ) + for i in range(num_layers) + ] + ) + if rope_dims > 0: + head_dim = model_dim // num_heads + for block in self.blocks: + block.attn.rope_dims = rope_dims + block.attn.rotary = Rotary(head_dim, base=rope_base, train_seq_len=1024, rope_dims=rope_dims) + self.ve_shared = None + self.ve_layer_indices = [] + self.ve_layer_scales = nn.ParameterList() + self.value_embeds = nn.ModuleList() + self.final_norm = RMSNorm() + self.lm_head = None if tie_embeddings else CastedLinear(model_dim, vocab_size, bias=False) + if self.lm_head is not None: + self.lm_head._zero_init = True + self.mtp_heads = nn.ModuleList() + if xsa_last_n > 0: + for i in range(max(0, num_layers - xsa_last_n), num_layers): + if i < core_start or i >= self.core_end: + self.blocks[i].attn.use_xsa = True + self._init_weights() + def _init_weights(self) -> None: + if self.tie_embeddings: + nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std) + n = self.num_layers + proj_scale = 1.0 / math.sqrt(2 * n) + for i in range(n): + nn.init.orthogonal_(self.qo_bank.data[i], gain=1.0) # Q + nn.init.zeros_(self.qo_bank.data[n + i]) # Out (zero init) + nn.init.orthogonal_(self.kv_bank.data[i], gain=1.0) # K + nn.init.orthogonal_(self.kv_bank.data[n + i], gain=1.0) # V + nn.init.orthogonal_(self.mlp_up_bank.data[i], gain=1.0) # MLP up + nn.init.zeros_(self.mlp_down_bank.data[i]) # MLP down (zero init) + self.qo_bank.data[n + i].mul_(proj_scale) + self.mlp_down_bank.data[i].mul_(proj_scale) + for name, module in self.named_modules(): + if isinstance(module, nn.Linear): + if getattr(module, "_zero_init", False): + nn.init.zeros_(module.weight) + elif module.weight.ndim == 2 and module.weight.shape[0] >= 64 and module.weight.shape[1] >= 64: + nn.init.orthogonal_(module.weight, gain=1.0) + def _get_bank_weights(self, bi: int) -> tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]: + n = self.num_layers + q_w = self.qo_bank[bi] + out_w = self.qo_bank[n + bi] + k_w = self.kv_bank[bi] + v_w = self.kv_bank[n + bi] + up_w = self.mlp_up_bank[bi] + down_w = self.mlp_down_bank[bi] + if self.core_quant_enabled and self.training and self.core_start <= bi < self.core_end: + q_w = _fake_quantize(q_w, self.core_quant_bits) + out_w = _fake_quantize(out_w, self.core_quant_bits) + k_w = _fake_quantize(k_w, self.core_quant_bits) + v_w = _fake_quantize(v_w, self.core_quant_bits) + up_w = _fake_quantize(up_w, self.core_quant_bits) + down_w = _fake_quantize(down_w, self.core_quant_bits) + return q_w, k_w, v_w, out_w, up_w, down_w + def _forward_hidden(self, input_ids: Tensor, feedback_fn=None, + stabilizer=None) -> tuple[Tensor, Tensor, Tensor]: + n = self.num_layers + x = self.tok_emb(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + x = self.smear(x) + x0 = x + skips: list[Tensor] = [] + for i in range(self.core_start): + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) + x, _ = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w) + skips.append(x) + h_core_in = x + for k in range(self.num_passes): + if k > 0 and self.interpass_rmsnorm: + x = F.rms_norm(x, (x.size(-1),)) + if feedback_fn is not None: + x = x + feedback_fn(x, k) + if stabilizer is not None: + x = stabilizer.clip(x) + x_before_pass = x + for j in range(self.core_start, self.core_end): + h_prev = x + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(j) + x, _ = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w) + if stabilizer is not None and self.training and not torch.compiler.is_compiling(): + stabilizer.record_pass(h_prev, x) + if self.residual_scale is not None and k > 0: + delta = x - x_before_pass + x = x_before_pass + self.residual_scale(delta, k) + h_core_out = x + for i in range(self.core_end, n): + ti = i - self.core_end + if ti < len(skips): + x = x + self.skip_weights[ti].to(dtype=x.dtype)[None, None, :] * skips.pop() + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) + x, _ = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w) + x = self.final_norm(x) + return x, h_core_in, h_core_out + def forward(self, input_ids: Tensor, target_ids: Tensor, + feedback_fn=None, stabilizer=None) -> Tensor: + x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) + x_flat = x.reshape(-1, x.size(-1)) + targets = target_ids.reshape(-1) + if self.tie_embeddings: + logits_proj = F.linear(x_flat, self.tok_emb.weight) + else: + if self.lm_head is None: + raise RuntimeError("lm_head is required when tie_embeddings=False") + logits_proj = self.lm_head(x_flat) + logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + main_loss = F.cross_entropy(logits.float(), targets, reduction="mean") + if stabilizer is not None and stabilizer.jacobian_proxy_weight > 0: + main_loss = main_loss + stabilizer.jacobian_proxy_loss(h_core_in, h_core_out) + return main_loss + def forward_logits(self, input_ids: Tensor, + feedback_fn=None, stabilizer=None) -> Tensor: + """""" + x, _, _ = self._forward_hidden(input_ids, feedback_fn, stabilizer) + if self.tie_embeddings: + logits_proj = F.linear(x, self.tok_emb.weight) + else: + logits_proj = self.lm_head(x) + return self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) def eval_val_sliding( - args: Hyperparameters, - base_model: nn.Module, - rank: int, - world_size: int, - device: torch.device, - val_tokens: Tensor, - base_bytes_lut: Tensor, - has_leading_space_lut: Tensor, - is_boundary_token_lut: Tensor, - stride: int, - batch_seqs: int = 32, - eval_seq_len: int | None = None, + args: Hyperparameters, + base_model: nn.Module, + rank: int, + world_size: int, + device: torch.device, + val_tokens: Tensor, + base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + stride: int, + batch_seqs: int = 32, + eval_seq_len: int | None = None, ) -> tuple[float, float]: - """Sliding window evaluation: each token scored with maximum context.""" - seq_len = eval_seq_len or args.train_seq_len - total_tokens = val_tokens.numel() - 1 - window_starts = [ws for ws in range(0, total_tokens, stride) - if min(ws + seq_len, total_tokens) - ws >= 1] - total_windows = len(window_starts) - my_s = (total_windows * rank) // world_size - my_e = (total_windows * (rank + 1)) // world_size - my_windows = window_starts[my_s:my_e] - loss_sum = torch.zeros((), device=device, dtype=torch.float64) - token_count = torch.zeros((), device=device, dtype=torch.float64) - byte_count = torch.zeros((), device=device, dtype=torch.float64) - base_model.eval() - compiled_logits = torch.compile(base_model.forward_logits, dynamic=False, fullgraph=True) - with torch.inference_mode(): - for bi in range(0, len(my_windows), batch_seqs): - batch_ws = my_windows[bi:bi + batch_seqs] - bsz = len(batch_ws) - x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) - y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) - wlens: list[int] = [] - for i, ws in enumerate(batch_ws): - end = min(ws + seq_len, total_tokens) - wlen = end - ws - wlens.append(wlen) - chunk = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) - x_batch[i, :wlen] = chunk[:-1] - y_batch[i, :wlen] = chunk[1:] - with torch.autocast(device_type="cuda", dtype=torch.bfloat16): - logits = compiled_logits(x_batch) - nll = F.cross_entropy( - logits.reshape(-1, logits.size(-1)).float(), - y_batch.reshape(-1), - reduction="none", - ).reshape(bsz, seq_len) - for i, ws in enumerate(batch_ws): - wlen = wlens[i] - s = 0 if ws == 0 else max(wlen - stride, 0) - scored_nll = nll[i, s:wlen].to(torch.float64) - loss_sum += scored_nll.sum() - token_count += float(wlen - s) - tgt = y_batch[i, s:wlen] - prev = x_batch[i, s:wlen] - tb = base_bytes_lut[tgt].to(torch.float64) - tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) - byte_count += tb.sum() - if dist.is_available() and dist.is_initialized(): - dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) - dist.all_reduce(token_count, op=dist.ReduceOp.SUM) - dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) - val_loss = (loss_sum / token_count).item() - bits_per_token = val_loss / math.log(2.0) - tokens_per_byte = token_count.item() / byte_count.item() - base_model.train() - return val_loss, bits_per_token * tokens_per_byte - - + """""" + seq_len = eval_seq_len or args.train_seq_len + total_tokens = val_tokens.numel() - 1 + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= 1] + total_windows = len(window_starts) + my_s = (total_windows * rank) // world_size + my_e = (total_windows * (rank + 1)) // world_size + my_windows = window_starts[my_s:my_e] + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + base_model.eval() + compiled_logits = torch.compile(base_model.forward_logits, dynamic=False, fullgraph=True) + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk[:-1] + y_batch[i, :wlen] = chunk[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = compiled_logits(x_batch) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), + reduction="none", + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt = y_batch[i, s:wlen] + prev = x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + val_loss = (loss_sum / token_count).item() + bits_per_token = val_loss / math.log(2.0) + tokens_per_byte = token_count.item() / byte_count.item() + base_model.train() + return val_loss, bits_per_token * tokens_per_byte def eval_val_sliding_ttt( - args: Hyperparameters, base_model: nn.Module, rank: int, world_size: int, - device: torch.device, val_tokens: Tensor, base_bytes_lut: Tensor, - has_leading_space_lut: Tensor, is_boundary_token_lut: Tensor, - stride: int, batch_seqs: int = 32, log0=print, + args: Hyperparameters, base_model: nn.Module, rank: int, world_size: int, + device: torch.device, val_tokens: Tensor, base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, is_boundary_token_lut: Tensor, + stride: int, batch_seqs: int = 32, log0=print, ) -> tuple[float, float]: - """Legal score-first TTT (PR #461 recipe): score each chunk with sliding windows, - then train on it. Every token scored BEFORE any update that could use it.""" - seq_len = args.train_seq_len - total_tokens = val_tokens.numel() - 1 - ttt_chunk = args.ttt_chunk_tokens - - # Pre-compute all window starts - window_starts = [ws for ws in range(0, total_tokens, stride) - if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0] - - # Assign each window to a chunk based on the first token it scores - num_chunks = (total_tokens + ttt_chunk - 1) // ttt_chunk - chunk_windows: list[list[int]] = [[] for _ in range(num_chunks)] - for ws in window_starts: - end = min(ws + seq_len, total_tokens) - wlen = end - ws - s = 0 if ws == 0 else max(wlen - stride, 0) - scored_start = ws + s - ci = min(scored_start // ttt_chunk, num_chunks - 1) - chunk_windows[ci].append(ws) - - log0(f"ttt_sliding:start chunks={num_chunks} chunk_tokens={ttt_chunk} " - f"total_windows={len(window_starts)} stride={stride} " - f"ttt_lr={args.ttt_lr} ttt_epochs={args.ttt_epochs} " - f"freeze_blocks={args.ttt_freeze_blocks}") - - loss_sum = torch.zeros((), device=device, dtype=torch.float64) - token_count = torch.zeros((), device=device, dtype=torch.float64) - byte_count = torch.zeros((), device=device, dtype=torch.float64) - - # Freeze first N blocks - frozen_block_ids = set(range(min(args.ttt_freeze_blocks, len(base_model.blocks)))) - ttt_params = [] - for name, p in base_model.named_parameters(): - freeze = False - for bi in frozen_block_ids: - if f"blocks.{bi}." in name: - freeze = True - break - if freeze: - p.requires_grad_(False) - else: - p.requires_grad_(True) - ttt_params.append(p) - - log0(f"ttt_sliding:params unfrozen={sum(p.numel() for p in ttt_params)} " - f"frozen={sum(p.numel() for p in base_model.parameters() if not p.requires_grad)}") - - optimizer = torch.optim.SGD(ttt_params, lr=args.ttt_lr, momentum=args.ttt_momentum) - t0 = time.perf_counter() - - for ci in range(num_chunks): - windows = chunk_windows[ci] - if not windows: - continue - chunk_start = ci * ttt_chunk - chunk_end = min((ci + 1) * ttt_chunk, total_tokens) - - # --- Phase 1: SCORE this chunk's windows (inference_mode) --- - my_s = (len(windows) * rank) // world_size - my_e = (len(windows) * (rank + 1)) // world_size - my_windows = windows[my_s:my_e] - - base_model.eval() - with torch.inference_mode(): - for bi in range(0, len(my_windows), batch_seqs): - batch_ws = my_windows[bi:bi + batch_seqs] - bsz = len(batch_ws) - x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) - y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) - wlens: list[int] = [] - for i, ws in enumerate(batch_ws): - end = min(ws + seq_len, total_tokens) - wlen = end - ws - wlens.append(wlen) - chunk_tok = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) - x_batch[i, :wlen] = chunk_tok[:-1] - y_batch[i, :wlen] = chunk_tok[1:] - with torch.autocast(device_type="cuda", dtype=torch.bfloat16): - logits = base_model.forward_logits(x_batch) - nll = F.cross_entropy( - logits.reshape(-1, logits.size(-1)).float(), - y_batch.reshape(-1), reduction="none", - ).reshape(bsz, seq_len) - for i, ws in enumerate(batch_ws): - wlen = wlens[i] - s = 0 if ws == 0 else max(wlen - stride, 0) - scored_nll = nll[i, s:wlen].to(torch.float64) - loss_sum += scored_nll.sum() - token_count += float(wlen - s) - tgt, prev = y_batch[i, s:wlen], x_batch[i, s:wlen] - tb = base_bytes_lut[tgt].to(torch.float64) - tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) - byte_count += tb.sum() - - # --- Phase 2: TRAIN on this chunk (already scored = legal) --- - is_last_chunk = (ci == num_chunks - 1) - if not is_last_chunk and args.ttt_epochs > 0: - base_model.train() - chunk_seqs = (chunk_end - chunk_start) // seq_len - if chunk_seqs > 0: - cos_lr = args.ttt_lr * 0.5 * (1.0 + math.cos(math.pi * ci / max(num_chunks - 1, 1))) - for pg in optimizer.param_groups: - pg['lr'] = cos_lr - my_seq_s = (chunk_seqs * rank) // world_size - my_seq_e = (chunk_seqs * (rank + 1)) // world_size - my_chunk_seqs = my_seq_e - my_seq_s - for _ep in range(args.ttt_epochs): - for bs in range(0, my_chunk_seqs, args.ttt_batch_seqs): - be = min(bs + args.ttt_batch_seqs, my_chunk_seqs) - actual_bs = my_seq_s + bs - start_tok = chunk_start + actual_bs * seq_len - end_tok = chunk_start + (my_seq_s + be) * seq_len + 1 - if end_tok > val_tokens.numel(): - continue - local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64) - x = local[:-1].reshape(-1, seq_len) - y = local[1:].reshape(-1, seq_len) - optimizer.zero_grad(set_to_none=True) - with torch.autocast(device_type="cuda", dtype=torch.bfloat16): - loss = base_model(x, y) - loss.backward() - if world_size > 1: - for p in ttt_params: - if p.grad is not None: - dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) - torch.nn.utils.clip_grad_norm_(ttt_params, args.ttt_grad_clip) - optimizer.step() - - if rank == 0 and (ci % 10 == 0 or ci == num_chunks - 1): - elapsed = time.perf_counter() - t0 - rl = loss_sum.item() / max(token_count.item(), 1) - rbpb = rl / math.log(2.0) * (token_count.item() / max(byte_count.item(), 1)) if token_count.item() > 0 else 0.0 - log0(f" ttt_chunk [{ci+1}/{num_chunks}] bpb={rbpb:.6f} time={elapsed:.1f}s") - - if dist.is_available() and dist.is_initialized(): - dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) - dist.all_reduce(token_count, op=dist.ReduceOp.SUM) - dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) - - val_loss = (loss_sum / token_count).item() - val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item()) - - for p in base_model.parameters(): - p.requires_grad_(True) - base_model.eval() - - log0(f"ttt_sliding:done val_loss={val_loss:.6f} val_bpb={val_bpb:.6f} " - f"elapsed={time.perf_counter() - t0:.1f}s") - return val_loss, val_bpb - - -# --- GPTQ-lite int6 quantization --- - + """""" + seq_len = args.train_seq_len + total_tokens = val_tokens.numel() - 1 + ttt_chunk = args.ttt_chunk_tokens + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0] + num_chunks = (total_tokens + ttt_chunk - 1) // ttt_chunk + chunk_windows: list[list[int]] = [[] for _ in range(num_chunks)] + for ws in window_starts: + end = min(ws + seq_len, total_tokens) + wlen = end - ws + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_start = ws + s + ci = min(scored_start // ttt_chunk, num_chunks - 1) + chunk_windows[ci].append(ws) + log0(f"ttt_sliding:start chunks={num_chunks} chunk_tokens={ttt_chunk} " + f"total_windows={len(window_starts)} stride={stride} " + f"ttt_lr={args.ttt_lr} ttt_epochs={args.ttt_epochs} " + f"freeze_blocks={args.ttt_freeze_blocks}") + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + frozen_block_ids = set(range(min(args.ttt_freeze_blocks, len(base_model.blocks)))) + ttt_params = [] + for name, p in base_model.named_parameters(): + freeze = False + for bi in frozen_block_ids: + if f"blocks.{bi}." in name: + freeze = True + break + if freeze: + p.requires_grad_(False) + else: + p.requires_grad_(True) + ttt_params.append(p) + log0(f"ttt_sliding:params unfrozen={sum(p.numel() for p in ttt_params)} " + f"frozen={sum(p.numel() for p in base_model.parameters() if not p.requires_grad)}") + optimizer = torch.optim.SGD(ttt_params, lr=args.ttt_lr, momentum=args.ttt_momentum) + t0 = time.perf_counter() + for ci in range(num_chunks): + windows = chunk_windows[ci] + if not windows: + continue + chunk_start = ci * ttt_chunk + chunk_end = min((ci + 1) * ttt_chunk, total_tokens) + my_s = (len(windows) * rank) // world_size + my_e = (len(windows) * (rank + 1)) // world_size + my_windows = windows[my_s:my_e] + base_model.eval() + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk_tok = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk_tok[:-1] + y_batch[i, :wlen] = chunk_tok[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = base_model.forward_logits(x_batch) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), reduction="none", + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt, prev = y_batch[i, s:wlen], x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + is_last_chunk = (ci == num_chunks - 1) + if not is_last_chunk and args.ttt_epochs > 0: + base_model.train() + chunk_seqs = (chunk_end - chunk_start) // seq_len + if chunk_seqs > 0: + cos_lr = args.ttt_lr * 0.5 * (1.0 + math.cos(math.pi * ci / max(num_chunks - 1, 1))) + for pg in optimizer.param_groups: + pg['lr'] = cos_lr + my_seq_s = (chunk_seqs * rank) // world_size + my_seq_e = (chunk_seqs * (rank + 1)) // world_size + my_chunk_seqs = my_seq_e - my_seq_s + for _ep in range(args.ttt_epochs): + for bs in range(0, my_chunk_seqs, args.ttt_batch_seqs): + be = min(bs + args.ttt_batch_seqs, my_chunk_seqs) + actual_bs = my_seq_s + bs + start_tok = chunk_start + actual_bs * seq_len + end_tok = chunk_start + (my_seq_s + be) * seq_len + 1 + if end_tok > val_tokens.numel(): + continue + local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + optimizer.zero_grad(set_to_none=True) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = base_model(x, y) + loss.backward() + if world_size > 1: + for p in ttt_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + torch.nn.utils.clip_grad_norm_(ttt_params, args.ttt_grad_clip) + optimizer.step() + if rank == 0 and (ci % 10 == 0 or ci == num_chunks - 1): + elapsed = time.perf_counter() - t0 + rl = loss_sum.item() / max(token_count.item(), 1) + rbpb = rl / math.log(2.0) * (token_count.item() / max(byte_count.item(), 1)) if token_count.item() > 0 else 0.0 + log0(f" ttt_chunk [{ci+1}/{num_chunks}] bpb={rbpb:.6f} time={elapsed:.1f}s") + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + val_loss = (loss_sum / token_count).item() + val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item()) + for p in base_model.parameters(): + p.requires_grad_(True) + base_model.eval() + log0(f"ttt_sliding:done val_loss={val_loss:.6f} val_bpb={val_bpb:.6f} " + f"elapsed={time.perf_counter() - t0:.1f}s") + return val_loss, val_bpb def _classify_param(name: str) -> str: - if "tok_emb" in name or "lm_head" in name: - return "embed" - if ".mlp." in name: - return "mlp" - if ".attn." in name or (".proj." in name and ".mlp." not in name): - return "attn" - return "other" - + if "tok_emb" in name or "lm_head" in name: + return "embed" + if ".mlp." in name: + return "mlp" + if ".attn." in name or (".proj." in name and ".mlp." not in name): + return "attn" + return "other" def _extract_layer_idx(name: str) -> int | None: - if not name.startswith("blocks."): - return None - parts = name.split(".") - if len(parts) >= 2 and parts[1].isdigit(): - return int(parts[1]) - return None + if not name.startswith("blocks."): + return None + parts = name.split(".") + if len(parts) >= 2 and parts[1].isdigit(): + return int(parts[1]) + return None def quantize_int6_per_row(t: Tensor, clip_range: int = 31) -> tuple[Tensor, Tensor]: - t32 = t.float() - if t32.ndim == 2: - best_q, best_s, best_err = None, None, float('inf') - for pct in [0.9990, 0.9995, 0.9999, 0.99999, 1.0]: - if pct < 1.0: - row_clip = torch.quantile(t32.abs(), pct, dim=1) - else: - row_clip = t32.abs().amax(dim=1) - s = (row_clip / clip_range).clamp_min(1.0 / clip_range).to(torch.float16) - q = torch.clamp(torch.round(t32 / s.float()[:, None]), -clip_range, clip_range).to(torch.int8) - recon = q.float() * s.float()[:, None] - err = (t32 - recon).pow(2).mean().item() - if err < best_err: - best_q, best_s, best_err = q, s, err - return best_q, best_s - amax = t32.abs().max().item() - scale = torch.tensor(amax / clip_range if amax > 0 else 1.0, dtype=torch.float16) - q = torch.clamp(torch.round(t32 / scale.float()), -clip_range, clip_range).to(torch.int8) - return q, scale - + t32 = t.float() + if t32.ndim == 2: + best_q, best_s, best_err = None, None, float('inf') + for pct in [0.9990, 0.9995, 0.9999, 0.99999, 1.0]: + if pct < 1.0: + row_clip = torch.quantile(t32.abs(), pct, dim=1) + else: + row_clip = t32.abs().amax(dim=1) + s = (row_clip / clip_range).clamp_min(1.0 / clip_range).to(torch.float16) + q = torch.clamp(torch.round(t32 / s.float()[:, None]), -clip_range, clip_range).to(torch.int8) + recon = q.float() * s.float()[:, None] + err = (t32 - recon).pow(2).mean().item() + if err < best_err: + best_q, best_s, best_err = q, s, err + return best_q, best_s + amax = t32.abs().max().item() + scale = torch.tensor(amax / clip_range if amax > 0 else 1.0, dtype=torch.float16) + q = torch.clamp(torch.round(t32 / scale.float()), -clip_range, clip_range).to(torch.int8) + return q, scale def _unbank_state_dict(sd: dict[str, Tensor], num_layers: int) -> dict[str, Tensor]: - """Convert 3D bank tensors into individual 2D tensors with standard names.""" - out: dict[str, Tensor] = {} - n = num_layers - for name, tensor in sd.items(): - if name == "qo_bank": - for i in range(n): - out[f"blocks.{i}.attn.c_q.weight"] = tensor[i] - out[f"blocks.{i}.attn.proj.weight"] = tensor[n + i] - elif name == "kv_bank": - for i in range(n): - out[f"blocks.{i}.attn.c_k.weight"] = tensor[i] - out[f"blocks.{i}.attn.c_v.weight"] = tensor[n + i] - elif name == "mlp_up_bank": - for i in range(n): - out[f"blocks.{i}.mlp.fc.weight"] = tensor[i] - elif name == "mlp_down_bank": - for i in range(n): - out[f"blocks.{i}.mlp.proj.weight"] = tensor[i] - else: - out[name] = tensor - return out - + """""" + out: dict[str, Tensor] = {} + n = num_layers + for name, tensor in sd.items(): + if name == "qo_bank": + for i in range(n): + out[f"blocks.{i}.attn.c_q.weight"] = tensor[i] + out[f"blocks.{i}.attn.proj.weight"] = tensor[n + i] + elif name == "kv_bank": + for i in range(n): + out[f"blocks.{i}.attn.c_k.weight"] = tensor[i] + out[f"blocks.{i}.attn.c_v.weight"] = tensor[n + i] + elif name == "mlp_up_bank": + for i in range(n): + out[f"blocks.{i}.mlp.fc.weight"] = tensor[i] + elif name == "mlp_down_bank": + for i in range(n): + out[f"blocks.{i}.mlp.proj.weight"] = tensor[i] + else: + out[name] = tensor + return out def _rebank_state_dict(sd: dict[str, Tensor], num_layers: int, template_sd: dict[str, Tensor]) -> dict[str, Tensor]: - """Convert individual 2D tensors back into 3D bank tensors.""" - out: dict[str, Tensor] = {} - n = num_layers - # Reconstruct banks from individual weight keys - qo_slices = [None] * (2 * n) - kv_slices = [None] * (2 * n) - up_slices = [None] * n - down_slices = [None] * n - consumed = set() - for i in range(n): - qk = f"blocks.{i}.attn.c_q.weight" - if qk in sd: - qo_slices[i] = sd[qk] - consumed.add(qk) - ok = f"blocks.{i}.attn.proj.weight" - if ok in sd: - qo_slices[n + i] = sd[ok] - consumed.add(ok) - kk = f"blocks.{i}.attn.c_k.weight" - if kk in sd: - kv_slices[i] = sd[kk] - consumed.add(kk) - vk = f"blocks.{i}.attn.c_v.weight" - if vk in sd: - kv_slices[n + i] = sd[vk] - consumed.add(vk) - fk = f"blocks.{i}.mlp.fc.weight" - if fk in sd: - up_slices[i] = sd[fk] - consumed.add(fk) - dk = f"blocks.{i}.mlp.proj.weight" - if dk in sd: - down_slices[i] = sd[dk] - consumed.add(dk) - out["qo_bank"] = torch.stack(qo_slices).to(dtype=template_sd["qo_bank"].dtype) - out["kv_bank"] = torch.stack(kv_slices).to(dtype=template_sd["kv_bank"].dtype) - out["mlp_up_bank"] = torch.stack(up_slices).to(dtype=template_sd["mlp_up_bank"].dtype) - out["mlp_down_bank"] = torch.stack(down_slices).to(dtype=template_sd["mlp_down_bank"].dtype) - for name, tensor in sd.items(): - if name not in consumed: - out[name] = tensor - return out - + """""" + out: dict[str, Tensor] = {} + n = num_layers + qo_slices = [None] * (2 * n) + kv_slices = [None] * (2 * n) + up_slices = [None] * n + down_slices = [None] * n + consumed = set() + for i in range(n): + qk = f"blocks.{i}.attn.c_q.weight" + if qk in sd: + qo_slices[i] = sd[qk] + consumed.add(qk) + ok = f"blocks.{i}.attn.proj.weight" + if ok in sd: + qo_slices[n + i] = sd[ok] + consumed.add(ok) + kk = f"blocks.{i}.attn.c_k.weight" + if kk in sd: + kv_slices[i] = sd[kk] + consumed.add(kk) + vk = f"blocks.{i}.attn.c_v.weight" + if vk in sd: + kv_slices[n + i] = sd[vk] + consumed.add(vk) + fk = f"blocks.{i}.mlp.fc.weight" + if fk in sd: + up_slices[i] = sd[fk] + consumed.add(fk) + dk = f"blocks.{i}.mlp.proj.weight" + if dk in sd: + down_slices[i] = sd[dk] + consumed.add(dk) + out["qo_bank"] = torch.stack(qo_slices).to(dtype=template_sd["qo_bank"].dtype) + out["kv_bank"] = torch.stack(kv_slices).to(dtype=template_sd["kv_bank"].dtype) + out["mlp_up_bank"] = torch.stack(up_slices).to(dtype=template_sd["mlp_up_bank"].dtype) + out["mlp_down_bank"] = torch.stack(down_slices).to(dtype=template_sd["mlp_down_bank"].dtype) + for name, tensor in sd.items(): + if name not in consumed: + out[name] = tensor + return out def mixed_quantize_int6(state_dict: dict[str, Tensor], int6_cats: set[str], - core_start: int = -1, core_end: int = -1): - num_layers_total = max( - (int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")), - default=0, - ) + 1 - late_k_layers = set(range(num_layers_total - 2, num_layers_total)) - result: dict[str, Tensor] = {} - meta: dict[str, object] = {} - for name, tensor in state_dict.items(): - t = tensor.detach().cpu().contiguous() - cat = _classify_param(name) - if not t.is_floating_point() or t.numel() <= 65536: - result[name] = t.to(torch.float16) if t.is_floating_point() else t - meta[name] = "passthrough" - continue - if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS): - result[name] = t.float() - meta[name] = "passthrough_ctrl" - continue - if cat in int6_cats and t.ndim >= 1: - q, s = quantize_int6_per_row(t) - result[name + ".q"] = q - result[name + ".scale"] = s - meta[name] = {"type": "int6"} - else: - q, s = quantize_float_tensor(t) - result[name + ".q"] = q - result[name + ".scale"] = s - meta[name] = {"type": "int8"} - return result, meta + core_start: int = -1, core_end: int = -1): + num_layers_total = max( + (int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")), + default=0, + ) + 1 + late_k_layers = set(range(num_layers_total - 2, num_layers_total)) + result: dict[str, Tensor] = {} + meta: dict[str, object] = {} + for name, tensor in state_dict.items(): + t = tensor.detach().cpu().contiguous() + cat = _classify_param(name) + if not t.is_floating_point() or t.numel() <= 65536: + result[name] = t.to(torch.float16) if t.is_floating_point() else t + meta[name] = "passthrough" + continue + if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS): + result[name] = t.float() + meta[name] = "passthrough_ctrl" + continue + if cat in int6_cats and t.ndim >= 1: + q, s = quantize_int6_per_row(t) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int6"} + else: + q, s = quantize_float_tensor(t) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int8"} + return result, meta def dequantize_mixed_int6(result: dict[str, Tensor], meta: dict[str, object], - template_sd: dict[str, Tensor]) -> dict[str, Tensor]: - out: dict[str, Tensor] = {} - for name, orig in template_sd.items(): - info = meta.get(name) - if info is None: - continue - orig_dtype = orig.dtype - if info in ("passthrough", "passthrough_ctrl", "passthrough_fp16"): - t = result[name] - if t.dtype == torch.float16 and orig_dtype in (torch.float32, torch.bfloat16): - t = t.to(orig_dtype) - out[name] = t - continue - q, s = result[name + ".q"], result[name + ".scale"] - if s.ndim > 0: - out[name] = (q.float() * s.float().view(q.shape[0], *([1] * (q.ndim - 1)))).to(orig_dtype) - else: - out[name] = (q.float() * float(s.item())).to(orig_dtype) - return out - -# --- Training --- - + template_sd: dict[str, Tensor]) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + for name, orig in template_sd.items(): + info = meta.get(name) + if info is None: + continue + orig_dtype = orig.dtype + if info in ("passthrough", "passthrough_ctrl", "passthrough_fp16"): + t = result[name] + if t.dtype == torch.float16 and orig_dtype in (torch.float32, torch.bfloat16): + t = t.to(orig_dtype) + out[name] = t + continue + q, s = result[name + ".q"], result[name + ".scale"] + if s.ndim > 0: + out[name] = (q.float() * s.float().view(q.shape[0], *([1] * (q.ndim - 1)))).to(orig_dtype) + else: + out[name] = (q.float() * float(s.item())).to(orig_dtype) + return out def parse_args() -> argparse.Namespace: - parser = argparse.ArgumentParser(description="Recurrent SOTA with stabilization") - g = parser.add_argument_group("feedback") - g.add_argument("--feedback-rank", type=int, default=2) - g.add_argument("--feedback-mode", type=str, default="diagonal", - choices=["identity", "diagonal", "low_rank", "none"]) - g.add_argument("--per-pass-feedback", action="store_true") - g.add_argument("--affine-junction", action="store_true") - g = parser.add_argument_group("stability") - g.add_argument("--clip-hidden", action="store_true") - g.add_argument("--clip-value", type=float, default=10.0) - g.add_argument("--residual-scale-init", type=float, default=0.5) - g.add_argument("--jacobian-proxy-weight", type=float, default=0.01) - g.add_argument("--no-interpass-rmsnorm", action="store_true") - return parser.parse_args() - + parser = argparse.ArgumentParser(description="Recurrent SOTA with stabilization") + g = parser.add_argument_group("feedback") + g.add_argument("--feedback-rank", type=int, default=2) + g.add_argument("--feedback-mode", type=str, default="diagonal", + choices=["identity", "diagonal", "low_rank", "none"]) + g.add_argument("--per-pass-feedback", action="store_true") + g.add_argument("--affine-junction", action="store_true") + g = parser.add_argument_group("stability") + g.add_argument("--clip-hidden", action="store_true") + g.add_argument("--clip-value", type=float, default=10.0) + g.add_argument("--residual-scale-init", type=float, default=0.5) + g.add_argument("--jacobian-proxy-weight", type=float, default=0.01) + g.add_argument("--no-interpass-rmsnorm", action="store_true") + return parser.parse_args() def main() -> None: - cli = parse_args() - code = Path(__file__).read_text(encoding="utf-8") - args = Hyperparameters() - distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ - rank = int(os.environ.get("RANK", "0")) - world_size = int(os.environ.get("WORLD_SIZE", "1")) - local_rank = int(os.environ.get("LOCAL_RANK", "0")) - if world_size <= 0: - raise ValueError(f"WORLD_SIZE must be positive, got {world_size}") - if 8 % world_size != 0: - raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral") - grad_accum_steps = 8 // world_size - grad_scale = 1.0 / grad_accum_steps - if not torch.cuda.is_available(): - raise RuntimeError("CUDA is required") - device = torch.device("cuda", local_rank) - torch.cuda.set_device(device) - if distributed: - dist.init_process_group(backend="nccl", device_id=device) - dist.barrier() - master_process = rank == 0 - torch.backends.cuda.matmul.allow_tf32 = True - torch.backends.cudnn.allow_tf32 = True - from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp - enable_cudnn_sdp(False) - enable_flash_sdp(True) - enable_mem_efficient_sdp(False) - enable_math_sdp(False) - logfile = None - if master_process: - os.makedirs("logs", exist_ok=True) - logfile = f"logs/{args.run_id}.txt" - print(logfile) - def log0(msg: str, console: bool = True) -> None: - if not master_process: - return - if console: - print(msg) - if logfile is not None: - with open(logfile, "a", encoding="utf-8") as f: - print(msg, file=f) - log0(code, console=False) - log0("=" * 100, console=False) - log0(f"Running Python {sys.version}", console=False) - log0(f"Running PyTorch {torch.__version__}", console=False) - log0( - subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=False).stdout, - console=False, - ) - log0("=" * 100, console=False) - random.seed(args.seed) - np.random.seed(args.seed) - torch.manual_seed(args.seed) - torch.cuda.manual_seed_all(args.seed) - if not args.tokenizer_path.endswith(".model"): - raise ValueError(f"Script only setup for SentencePiece .model file: {args.tokenizer_path}") - sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) - if int(sp.vocab_size()) != args.vocab_size: - raise ValueError( - f"VOCAB_SIZE={args.vocab_size} does not match tokenizer vocab_size={int(sp.vocab_size())}" - ) - dataset_dir = Path(args.data_path).resolve() - actual_train_files = len(list(dataset_dir.glob("fineweb_train_*.bin"))) - effective_eval_seq_len = args.eval_seq_len if args.eval_seq_len > 0 else args.train_seq_len - val_seq_len = max(args.train_seq_len, effective_eval_seq_len) - val_tokens = load_validation_tokens(args.val_files, val_seq_len) - base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts( - sp, args.vocab_size, device - ) - log0(f"val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path={args.tokenizer_path}") - log0(f"train_loader:dataset:{dataset_dir.name} train_shards:{actual_train_files}") - log0(f"val_loader:shards pattern={args.val_files} tokens:{val_tokens.numel() - 1}") - CastedLinear._qat_enabled = args.qat_enabled - base_model = GPT( - vocab_size=args.vocab_size, - num_layers=args.num_layers, - model_dim=args.model_dim, - num_heads=args.num_heads, - num_kv_heads=args.num_kv_heads, - mlp_mult=args.mlp_mult, - tie_embeddings=args.tie_embeddings, - tied_embed_init_std=args.tied_embed_init_std, - logit_softcap=args.logit_softcap, - rope_base=args.rope_base, - qk_gain_init=args.qk_gain_init, - mtp_num_heads=args.mtp_num_heads, - mtp_loss_weight=args.mtp_loss_weight, - bigram_vocab_size=args.bigram_vocab_size, - bigram_dim=args.bigram_dim, - xsa_last_n=args.xsa_last_n, - rope_dims=args.rope_dims, - ln_scale=args.ln_scale, - dtg=args.dtg_enabled, - ve_enabled=args.ve_enabled, - ve_dim=args.ve_dim, - ve_layers=args.ve_layers, - gated_attention=args.gated_attention, - value_residual=args.value_residual, - core_start=args.core_start, - core_end=args.core_end, - num_passes=args.num_passes, - core_quant_bits=args.core_quant_bits, - core_quant_enabled=args.core_quant_enabled, - residual_scale=None, - interpass_rmsnorm=not cli.no_interpass_rmsnorm, - ).to(device).bfloat16() - # Banks stay FP32 (like CastedLinear weights), cast to BF16 in forward - base_model.qo_bank.data = base_model.qo_bank.data.float() - base_model.kv_bank.data = base_model.kv_bank.data.float() - base_model.mlp_up_bank.data = base_model.mlp_up_bank.data.float() - base_model.mlp_down_bank.data = base_model.mlp_down_bank.data.float() - for module in base_model.modules(): - if isinstance(module, CastedLinear): - module.float() - restore_low_dim_params_to_fp32(base_model) - # --- feedback / stabilizer --- - feedback = None - feedback_fn = None - stabilizer = None - residual_scale = None - extra_scalar_params: list[nn.Parameter] = [] - # Parse progressive passes schedule - passes_schedule: list[tuple[int, int]] = [] - if args.passes_schedule_str: - for entry in args.passes_schedule_str.split(","): - s, p = entry.strip().split(":") - passes_schedule.append((int(s), int(p))) - passes_schedule.sort(key=lambda x: x[0]) - max_passes = max((p for _, p in passes_schedule), default=args.num_passes) - max_passes = max(max_passes, args.eval_passes if args.eval_passes > 0 else args.num_passes) - needs_recurrence = max_passes > 1 - if cli.feedback_mode != "none" and needs_recurrence: - feedback = ErrorFeedbackModule( - dim=args.model_dim, rank=cli.feedback_rank, - feedback_mode=cli.feedback_mode, - per_pass=cli.per_pass_feedback, - num_passes=max_passes, - affine_junction=cli.affine_junction, - ).to(device).bfloat16() - restore_low_dim_params_to_fp32(feedback) - extra_scalar_params.extend(feedback.parameters()) - def feedback_fn(h, pass_idx): - return feedback(h, pass_idx) - log0(f"feedback: mode={cli.feedback_mode} rank={cli.feedback_rank} " - f"per_pass={cli.per_pass_feedback} params={sum(p.numel() for p in feedback.parameters())}") - if needs_recurrence: - stabilizer = RecurrentStabilizer( - clip_hidden=cli.clip_hidden, clip_value=cli.clip_value, - jacobian_proxy_weight=cli.jacobian_proxy_weight) - if cli.residual_scale_init != 1.0: - residual_scale = ResidualScale(max_passes, cli.residual_scale_init).to(device) - base_model.residual_scale = residual_scale - extra_scalar_params.extend(residual_scale.parameters()) - sched_str = f" schedule={passes_schedule}" if passes_schedule else "" - log0(f"recurrence: core_start={args.core_start} core_end={args.core_end} " - f"num_passes={args.num_passes} max_passes={max_passes} stem={base_model.num_stem} " - f"core={base_model.num_core} tail={base_model.num_tail}{sched_str}") - - # No DDP -- Parallel Muon handles bank grad communication via reduce-scatter, - # and non-bank grads are manually all-reduced before Adam steps. - compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) - model = compiled_model - - # Optimizer split: - # - 4 parameter banks -> Muon (batched Newton-Schulz) - # - token embedding -> Adam - # - scalars/control tensors -> Adam - # - bigram proj, mtp heads, VE proj -> Adam (small matrix params not worth banking) - matrix_params = [ - base_model.qo_bank, base_model.kv_bank, - base_model.mlp_up_bank, base_model.mlp_down_bank, - ] - block_named_params = list(base_model.blocks.named_parameters()) - scalar_params = [ - p - for name, p in block_named_params - if p.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS) - ] - if base_model.skip_weights.numel() > 0: - scalar_params.append(base_model.skip_weights) - scalar_params.append(base_model.smear.gate) - if base_model.bigram is not None: - scalar_params.append(base_model.bigram.scale) - token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr - tok_params = [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}] - if base_model.bigram is not None: - tok_params.append({"params": [base_model.bigram.embed.weight], "lr": token_lr, "base_lr": token_lr}) - if base_model.bigram.proj is not None: - scalar_params.append(base_model.bigram.proj.weight) - if base_model.ve_shared is not None: - tok_params.append({"params": [base_model.ve_shared.embed.weight], "lr": token_lr, "base_lr": token_lr}) - if base_model.ve_shared.proj is not None: - scalar_params.append(base_model.ve_shared.proj.weight) - scalar_params.append(base_model.ve_shared.scale) - for s in base_model.ve_layer_scales: - scalar_params.append(s) - optimizer_tok = torch.optim.AdamW( - tok_params, - betas=(args.beta1, args.beta2), - eps=args.adam_eps, - weight_decay=args.adam_wd, - fused=True, - ) - optimizer_muon = Muon( - matrix_params, - lr=args.matrix_lr, - momentum=args.muon_momentum, - backend_steps=args.muon_backend_steps, - weight_decay=args.muon_wd, - ) - for group in optimizer_muon.param_groups: - group["base_lr"] = args.matrix_lr - scalar_params.extend(extra_scalar_params) - optimizer_scalar = torch.optim.AdamW( - [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}], - betas=(args.beta1, args.beta2), - eps=args.adam_eps, - weight_decay=args.adam_wd, - fused=True, - ) - # Non-bank params that need manual all-reduce (replicated across GPUs) - replicated_params = list(optimizer_tok.param_groups[0]["params"]) - for pg in optimizer_tok.param_groups[1:]: - replicated_params.extend(pg["params"]) - replicated_params.extend(scalar_params) - - optimizer_head = None - if base_model.lm_head is not None: - optimizer_head = torch.optim.Adam( - [{"params": [base_model.lm_head.weight], "lr": args.head_lr, "base_lr": args.head_lr}], - betas=(args.beta1, args.beta2), - eps=args.adam_eps, - fused=True, - ) - replicated_params.append(base_model.lm_head.weight) - optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar] - if optimizer_head is not None: - optimizers.append(optimizer_head) - n_params = sum(p.numel() for p in base_model.parameters()) - mtp_params = sum(p.numel() for p in base_model.mtp_heads.parameters()) - log0(f"model_params:{n_params}") - log0(f"mtp_num_heads:{args.mtp_num_heads} mtp_loss_weight:{args.mtp_loss_weight} mtp_params:{mtp_params}") - xsa_layers = [i for i, b in enumerate(base_model.blocks) if b.attn.use_xsa] - log0(f"XSA:last_{args.xsa_last_n} active_layers:{xsa_layers}") - log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}") - log0("sdp_backends:cudnn=False flash=True mem_efficient=False math=False") - log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}") - log0( - f"tie_embeddings:{args.tie_embeddings} embed_lr:{token_lr} " - f"head_lr:{args.head_lr if base_model.lm_head is not None else 0.0} " - f"matrix_lr:{args.matrix_lr} scalar_lr:{args.scalar_lr}" - ) - log0( - f"train_batch_tokens:{args.train_batch_tokens} train_seq_len:{args.train_seq_len} " - f"iterations:{args.iterations} warmup_steps:{args.warmup_steps} " - f"max_wallclock_seconds:{args.max_wallclock_seconds:.3f}" - ) - log0(f"seed:{args.seed}") - use_wandb = False - train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) - def zero_grad_all() -> None: - for opt in optimizers: - opt.zero_grad(set_to_none=True) - max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None - def lr_mul(step: int, elapsed_ms: float) -> float: - if args.warmdown_iters <= 0: - return 1.0 - if max_wallclock_ms is None: - warmdown_start = max(args.iterations - args.warmdown_iters, 0) - return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0) if warmdown_start <= step < args.iterations else 1.0 - step_ms = elapsed_ms / max(step, 1) - warmdown_ms = args.warmdown_iters * step_ms - remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) - return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0 - if args.warmup_steps > 0: - initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()} - initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] - _precompile_passes = sorted(set(p for _, p in passes_schedule) - {args.num_passes}) if passes_schedule else [] - _qat_precompile_passes = _precompile_passes[-2:] if len(_precompile_passes) >= 2 else _precompile_passes[:] - _total_precompile = len(_precompile_passes) + len(_qat_precompile_passes) - _precompile_start = args.warmup_steps - _total_precompile - model.train() - for warmup_step in range(args.warmup_steps): - if warmup_step >= _precompile_start: - _pc_idx = warmup_step - _precompile_start - if _pc_idx < len(_precompile_passes): - base_model.num_passes = _precompile_passes[_pc_idx] - CastedLinear._qat_enabled = False - base_model.core_quant_enabled = False - else: - _qat_idx = _pc_idx - len(_precompile_passes) - base_model.num_passes = _qat_precompile_passes[_qat_idx] - CastedLinear._qat_enabled = True - base_model.core_quant_enabled = True - zero_grad_all() - for micro_step in range(grad_accum_steps): - x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) - with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): - warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - (warmup_loss * grad_scale).backward() - if distributed: - for p in base_model.parameters(): - if p.grad is not None: - dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) - if feedback is not None: - for p in feedback.parameters(): - if p.grad is not None: - dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) - for opt in optimizers: - opt.step() - zero_grad_all() - if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps: - log0(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}") - base_model.num_passes = args.num_passes - CastedLinear._qat_enabled = args.qat_enabled - base_model.core_quant_enabled = args.core_quant_enabled - if stabilizer is not None: - stabilizer.reset() - base_model.load_state_dict(initial_model_state, strict=True) - for opt, state in zip(optimizers, initial_optimizer_states, strict=True): - opt.load_state_dict(state) - zero_grad_all() - train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) - swa_state: dict[str, Tensor] | None = None - swa_count = 0 - from collections import deque - lawa_queue: deque[dict[str, Tensor]] = deque(maxlen=args.lawa_k) - _all_state = dict(base_model.state_dict()) - if feedback is not None: - for k, v in feedback.state_dict().items(): - _all_state[f"_fb.{k}"] = v - ema_state = {name: t.detach().float().clone() for name, t in _all_state.items()} - ema_decay = 0.997 - training_time_ms = 0.0 - stop_after_step: int | None = None - torch.cuda.synchronize() - t0 = time.perf_counter() - step = 0 - while True: - last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step) - should_validate = last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0) - if should_validate: - torch.cuda.synchronize() - training_time_ms += 1000.0 * (time.perf_counter() - t0) - val_loss, val_bpb = eval_val( - args, - model, - rank, - world_size, - device, - grad_accum_steps, - val_tokens, - base_bytes_lut, - has_leading_space_lut, - is_boundary_token_lut, - ) - diag_str = "" - if stabilizer is not None and stabilizer.diagnostics.h_norms: - hn = [f"{v:.1f}" for v in stabilizer.diagnostics.h_norms[-args.num_passes*base_model.num_core:]] - gr = [f"{v:.3f}" for v in stabilizer.diagnostics.growth_ratios[-args.num_passes*base_model.num_core:]] - diag_str = f" h_norms={hn} growth={gr}" - stabilizer.reset() - log0( - f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} " - f"train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms / max(step, 1):.2f}ms" - f"{diag_str}" - ) - if use_wandb: - wb_data = {"val_loss": val_loss, "val_bpb": val_bpb} - if stabilizer is not None and stabilizer.diagnostics.growth_ratios: - wb_data["max_growth"] = max(stabilizer.diagnostics.growth_ratios) - wb_data["mean_growth"] = sum(stabilizer.diagnostics.growth_ratios) / len(stabilizer.diagnostics.growth_ratios) - _wandb.log(wb_data, step=step) - torch.cuda.synchronize() - t0 = time.perf_counter() - if last_step: - if stop_after_step is not None and step < args.iterations: - log0( - f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms " - f"step:{step}/{args.iterations}" - ) - break - elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) - scale = lr_mul(step, elapsed_ms) - if passes_schedule: - target_passes = args.num_passes - for threshold_step, p in passes_schedule: - if step >= threshold_step: - target_passes = p - if target_passes != base_model.num_passes: - base_model.num_passes = target_passes - log0(f"progressive_passes: step:{step} num_passes:{target_passes}") - if args.late_qat_threshold > 0 and step > 100 and scale < args.late_qat_threshold and not CastedLinear._qat_enabled: - CastedLinear._qat_enabled = True - base_model.core_quant_enabled = True - log0(f"late_qat:enabled step:{step} scale:{scale:.4f} core_quant:on") - zero_grad_all() - train_loss = torch.zeros((), device=device) - for micro_step in range(grad_accum_steps): - x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) - with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): - loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) - train_loss += loss.detach() - (loss * grad_scale).backward() - train_loss /= grad_accum_steps - frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0 - muon_momentum = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum - for group in optimizer_muon.param_groups: - group["momentum"] = muon_momentum - for opt in optimizers: - for group in opt.param_groups: - group["lr"] = group["base_lr"] * scale - grad_norm = None - if args.grad_clip_norm > 0: - grad_norm = torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm) - # === 3-phase overlapped optimizer step === - # Phase 1: Launch async reduce-scatter for banks (biggest first) - optimizer_muon.launch_reduce_scatters() - # Phase 2: All-reduce non-bank grads + step Adam (while bank RS is in-flight) - if distributed: - for p in replicated_params: - if p.grad is not None: - dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) - optimizer_tok.step() - optimizer_scalar.step() - if optimizer_head is not None: - optimizer_head.step() - # Phase 3: Wait for RS, local NS5, all-gather (banks processed last) - optimizer_muon.step() - zero_grad_all() - # EMA update - with torch.no_grad(): - _cur = dict(base_model.state_dict()) - if feedback is not None: - for k, v in feedback.state_dict().items(): - _cur[f"_fb.{k}"] = v - for name, t in _cur.items(): - ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay) - step += 1 - approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) - if args.swa_enabled and scale < 0.2 and step % args.swa_every == 0: - if swa_state is None: - swa_state = {name: t.detach().cpu().clone() for name, t in base_model.state_dict().items()} - swa_count = 1 - log0(f"swa:start step:{step}") - else: - for name, t in base_model.state_dict().items(): - swa_state[name] += t.detach().cpu() - swa_count += 1 - if args.lawa_enabled and step % args.lawa_freq == 0: - lawa_queue.append({name: t.detach().cpu().clone() for name, t in base_model.state_dict().items()}) - should_log_train = ( - args.train_log_every > 0 - and (step <= 10 or step % args.train_log_every == 0 or stop_after_step is not None) - ) - if should_log_train: - tl = train_loss.item() - gn_str = f" grad_norm:{grad_norm:.4f}" if grad_norm is not None else "" - log0( - f"step:{step}/{args.iterations} train_loss:{tl:.4f}{gn_str} " - f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms / step:.2f}ms" - ) - if use_wandb: - wlog = {"train_loss": tl, "step_avg_ms": approx_training_time_ms / step, "lr_scale": scale} - if grad_norm is not None: - wlog["grad_norm"] = float(grad_norm) - _wandb.log(wlog, step=step) - reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms - if distributed and max_wallclock_ms is not None: - reached_cap_tensor = torch.tensor(int(reached_cap), device=device) - dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX) - reached_cap = bool(reached_cap_tensor.item()) - if stop_after_step is None and reached_cap: - stop_after_step = step - log0( - f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB " - f"reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB" - ) - # Apply weight averaging - if args.lawa_enabled and len(lawa_queue) > 1: - log0(f"lawa:applying LAWA averaging k={len(lawa_queue)}") - current_state = base_model.state_dict() - avg_state = {name: torch.zeros(t.shape, dtype=torch.float32, device='cpu') for name, t in current_state.items()} - for snap in lawa_queue: - for name in avg_state: - avg_state[name] += snap[name].float() - for name in avg_state: - avg_state[name] /= len(lawa_queue) - avg_state[name] = avg_state[name].to(dtype=current_state[name].dtype) - base_model.load_state_dict(avg_state, strict=True) - else: - log0("ema:applying EMA weights") - current_state = base_model.state_dict() - model_ema = {k: v for k, v in ema_state.items() if not k.startswith("_fb.")} - avg_state = {name: model_ema[name].to(dtype=current_state[name].dtype) for name in current_state} - base_model.load_state_dict(avg_state, strict=True) - if feedback is not None: - fb_ema = {k.removeprefix("_fb."): v for k, v in ema_state.items() if k.startswith("_fb.")} - fb_state = feedback.state_dict() - fb_avg = {k: fb_ema[k].to(dtype=fb_state[k].dtype) for k in fb_state} - feedback.load_state_dict(fb_avg, strict=True) - torch.cuda.synchronize() - t_diag = time.perf_counter() - diag_val_loss, diag_val_bpb = eval_val( - args, compiled_model, rank, world_size, device, grad_accum_steps, - val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, - ) - torch.cuda.synchronize() - log0( - f"DIAGNOSTIC post_ema val_loss:{diag_val_loss:.4f} val_bpb:{diag_val_bpb:.4f} " - f"eval_time:{1000.0 * (time.perf_counter() - t_diag):.0f}ms" - ) - full_state_dict = base_model.state_dict() - export_sd = {k: v for k, v in full_state_dict.items() if "mtp_heads" not in k} - excluded_mtp = sum(int(t.numel()) for k, t in full_state_dict.items() if "mtp_heads" in k) - if excluded_mtp > 0: - log0(f"export_excluding_mtp_params:{excluded_mtp}") - if master_process: - torch.save(export_sd, "final_model.pt") - model_bytes = os.path.getsize("final_model.pt") - code_bytes = len(code.encode("utf-8")) - log0(f"Serialized model: {model_bytes} bytes") - log0(f"Code size: {code_bytes} bytes") - # Override passes for eval phase (train cheap, eval deep) - eval_num_passes = args.eval_passes if args.eval_passes > 0 else args.num_passes - if eval_num_passes != args.num_passes: - log0(f"eval_override: num_passes {args.num_passes} -> {eval_num_passes}") - base_model.num_passes = eval_num_passes - if base_model.residual_scale is not None: - old_s = base_model.residual_scale.scales.data - new_s = torch.full((eval_num_passes,), cli.residual_scale_init, - dtype=torch.float32, device=old_s.device) - copy_len = min(eval_num_passes, old_s.shape[0]) - new_s[:copy_len] = old_s[:copy_len] - base_model.residual_scale.scales = nn.Parameter(new_s) - export_sd = {k: v for k, v in base_model.state_dict().items() if "mtp_heads" not in k} - # Unbank 3D tensors into individual 2D tensors for quantization - sd_cpu = {k: v.detach().cpu() for k, v in export_sd.items()} - unbanked_sd = _unbank_state_dict(sd_cpu, args.num_layers) - quant_result, quant_meta = mixed_quantize_int6(unbanked_sd, {"mlp", "attn"}) - quant_buf = io.BytesIO() - torch.save({"w": quant_result, "m": quant_meta}, quant_buf) - quant_raw = quant_buf.getvalue() - quant_blob = lzma.compress(quant_raw, preset=6) - if master_process: - with open("final_model.int6.ptz", "wb") as f: - f.write(quant_blob) - quant_file_bytes = len(quant_blob) - code_bytes = len(code.encode("utf-8")) - log0(f"Serialized model int6+lzma: {quant_file_bytes} bytes") - log0(f"Total submission size int6+lzma: {quant_file_bytes + code_bytes} bytes") - if distributed: - dist.barrier() - with open("final_model.int6.ptz", "rb") as f: - quant_blob_disk = f.read() - quant_state = torch.load( - io.BytesIO(lzma.decompress(quant_blob_disk)), - map_location="cpu", - ) - deq_unbanked = dequantize_mixed_int6(quant_state["w"], quant_state["m"], unbanked_sd) - # Re-bank the dequantized tensors - deq_state = _rebank_state_dict(deq_unbanked, args.num_layers, sd_cpu) - eval_model = GPT( - vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim, - num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, mlp_mult=args.mlp_mult, - tie_embeddings=args.tie_embeddings, tied_embed_init_std=args.tied_embed_init_std, - logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init, - mtp_num_heads=0, mtp_loss_weight=0.0, - bigram_vocab_size=args.bigram_vocab_size, bigram_dim=args.bigram_dim, - xsa_last_n=args.xsa_last_n, - rope_dims=args.rope_dims, ln_scale=args.ln_scale, dtg=args.dtg_enabled, - ve_enabled=args.ve_enabled, ve_dim=args.ve_dim, ve_layers=args.ve_layers, - gated_attention=args.gated_attention, value_residual=args.value_residual, - core_start=args.core_start, core_end=args.core_end, - num_passes=eval_num_passes, - interpass_rmsnorm=not cli.no_interpass_rmsnorm, - ).to(device).bfloat16() - if residual_scale is not None: - eval_rs = ResidualScale(eval_num_passes, cli.residual_scale_init).to(device) - eval_model.residual_scale = eval_rs - eval_model.qo_bank.data = eval_model.qo_bank.data.float() - eval_model.kv_bank.data = eval_model.kv_bank.data.float() - eval_model.mlp_up_bank.data = eval_model.mlp_up_bank.data.float() - eval_model.mlp_down_bank.data = eval_model.mlp_down_bank.data.float() - for m in eval_model.modules(): - if isinstance(m, CastedLinear): - m.float() - restore_low_dim_params_to_fp32(eval_model) - eval_model.load_state_dict(deq_state, strict=True) - # Legal score-first TTT (PR #461 recipe) -- skip intermediate evals to maximize TTT time budget - if args.ttt_enabled: - torch.cuda.synchronize() - t_ttt = time.perf_counter() - ttt_loss, ttt_bpb = eval_val_sliding_ttt( - args, eval_model, rank, world_size, device, - val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, - stride=args.eval_stride, log0=log0, - ) - torch.cuda.synchronize() - log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} " - f"eval_time:{1000.0 * (time.perf_counter() - t_ttt):.0f}ms") - log0(f"legal_ttt_exact val_loss:{ttt_loss:.8f} val_bpb:{ttt_bpb:.8f}") - if use_wandb: - _wandb.finish() - if distributed: - dist.destroy_process_group() + cli = parse_args() + code = Path(__file__).read_text(encoding="utf-8") + args = Hyperparameters() + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + if world_size <= 0: + raise ValueError(f"WORLD_SIZE must be positive, got {world_size}") + if 8 % world_size != 0: + raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral") + grad_accum_steps = 8 // world_size + grad_scale = 1.0 / grad_accum_steps + if not torch.cuda.is_available(): + raise RuntimeError("CUDA is required") + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + master_process = rank == 0 + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp + enable_cudnn_sdp(False) + enable_flash_sdp(True) + enable_mem_efficient_sdp(False) + enable_math_sdp(False) + logfile = None + if master_process: + os.makedirs("logs", exist_ok=True) + logfile = f"logs/{args.run_id}.txt" + print(logfile) + def log0(msg: str, console: bool = True) -> None: + if not master_process: + return + if console: + print(msg) + if logfile is not None: + with open(logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + log0(code, console=False) + log0("=" * 100, console=False) + log0(f"Running Python {sys.version}", console=False) + log0(f"Running PyTorch {torch.__version__}", console=False) + log0( + subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=False).stdout, + console=False, + ) + log0("=" * 100, console=False) + random.seed(args.seed) + np.random.seed(args.seed) + torch.manual_seed(args.seed) + torch.cuda.manual_seed_all(args.seed) + if not args.tokenizer_path.endswith(".model"): + raise ValueError(f"Script only setup for SentencePiece .model file: {args.tokenizer_path}") + sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) + if int(sp.vocab_size()) != args.vocab_size: + raise ValueError( + f"VOCAB_SIZE={args.vocab_size} does not match tokenizer vocab_size={int(sp.vocab_size())}" + ) + dataset_dir = Path(args.data_path).resolve() + actual_train_files = len(list(dataset_dir.glob("fineweb_train_*.bin"))) + effective_eval_seq_len = args.eval_seq_len if args.eval_seq_len > 0 else args.train_seq_len + val_seq_len = max(args.train_seq_len, effective_eval_seq_len) + val_tokens = load_validation_tokens(args.val_files, val_seq_len) + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts( + sp, args.vocab_size, device + ) + log0(f"val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path={args.tokenizer_path}") + log0(f"train_loader:dataset:{dataset_dir.name} train_shards:{actual_train_files}") + log0(f"val_loader:shards pattern={args.val_files} tokens:{val_tokens.numel() - 1}") + CastedLinear._qat_enabled = args.qat_enabled + base_model = GPT( + vocab_size=args.vocab_size, + num_layers=args.num_layers, + model_dim=args.model_dim, + num_heads=args.num_heads, + num_kv_heads=args.num_kv_heads, + mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, + tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, + rope_base=args.rope_base, + qk_gain_init=args.qk_gain_init, + xsa_last_n=args.xsa_last_n, + rope_dims=args.rope_dims, + ln_scale=args.ln_scale, + core_start=args.core_start, + core_end=args.core_end, + num_passes=args.num_passes, + core_quant_bits=args.core_quant_bits, + core_quant_enabled=args.core_quant_enabled, + residual_scale=None, + interpass_rmsnorm=not cli.no_interpass_rmsnorm, + ).to(device).bfloat16() + base_model.qo_bank.data = base_model.qo_bank.data.float() + base_model.kv_bank.data = base_model.kv_bank.data.float() + base_model.mlp_up_bank.data = base_model.mlp_up_bank.data.float() + base_model.mlp_down_bank.data = base_model.mlp_down_bank.data.float() + for module in base_model.modules(): + if isinstance(module, CastedLinear): + module.float() + restore_low_dim_params_to_fp32(base_model) + feedback = None + feedback_fn = None + stabilizer = None + residual_scale = None + extra_scalar_params: list[nn.Parameter] = [] + passes_schedule: list[tuple[int, int]] = [] + if args.passes_schedule_str: + for entry in args.passes_schedule_str.split(","): + s, p = entry.strip().split(":") + passes_schedule.append((int(s), int(p))) + passes_schedule.sort(key=lambda x: x[0]) + max_passes = max((p for _, p in passes_schedule), default=args.num_passes) + max_passes = max(max_passes, args.eval_passes if args.eval_passes > 0 else args.num_passes) + needs_recurrence = max_passes > 1 + if cli.feedback_mode != "none" and needs_recurrence: + feedback = ErrorFeedbackModule( + dim=args.model_dim, rank=cli.feedback_rank, + feedback_mode=cli.feedback_mode, + per_pass=cli.per_pass_feedback, + num_passes=max_passes, + affine_junction=cli.affine_junction, + ).to(device).bfloat16() + restore_low_dim_params_to_fp32(feedback) + extra_scalar_params.extend(feedback.parameters()) + def feedback_fn(h, pass_idx): + return feedback(h, pass_idx) + log0(f"feedback: mode={cli.feedback_mode} rank={cli.feedback_rank} " + f"per_pass={cli.per_pass_feedback} params={sum(p.numel() for p in feedback.parameters())}") + if needs_recurrence: + stabilizer = RecurrentStabilizer( + clip_hidden=cli.clip_hidden, clip_value=cli.clip_value, + jacobian_proxy_weight=cli.jacobian_proxy_weight) + if cli.residual_scale_init != 1.0: + residual_scale = ResidualScale(max_passes, cli.residual_scale_init).to(device) + base_model.residual_scale = residual_scale + extra_scalar_params.extend(residual_scale.parameters()) + sched_str = f" schedule={passes_schedule}" if passes_schedule else "" + log0(f"recurrence: core_start={args.core_start} core_end={args.core_end} " + f"num_passes={args.num_passes} max_passes={max_passes} stem={base_model.num_stem} " + f"core={base_model.num_core} tail={base_model.num_tail}{sched_str}") + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + model = compiled_model + matrix_params = [ + base_model.qo_bank, base_model.kv_bank, + base_model.mlp_up_bank, base_model.mlp_down_bank, + ] + block_named_params = list(base_model.blocks.named_parameters()) + scalar_params = [ + p + for name, p in block_named_params + if p.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS) + ] + if base_model.skip_weights.numel() > 0: + scalar_params.append(base_model.skip_weights) + scalar_params.append(base_model.smear.gate) + token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr + tok_params = [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}] + optimizer_tok = torch.optim.AdamW( + tok_params, + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + weight_decay=args.adam_wd, + fused=True, + ) + optimizer_muon = Muon( + matrix_params, + lr=args.matrix_lr, + momentum=args.muon_momentum, + backend_steps=args.muon_backend_steps, + weight_decay=args.muon_wd, + ) + for group in optimizer_muon.param_groups: + group["base_lr"] = args.matrix_lr + scalar_params.extend(extra_scalar_params) + optimizer_scalar = torch.optim.AdamW( + [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}], + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + weight_decay=args.adam_wd, + fused=True, + ) + replicated_params = list(optimizer_tok.param_groups[0]["params"]) + for pg in optimizer_tok.param_groups[1:]: + replicated_params.extend(pg["params"]) + replicated_params.extend(scalar_params) + optimizer_head = None + if base_model.lm_head is not None: + optimizer_head = torch.optim.Adam( + [{"params": [base_model.lm_head.weight], "lr": args.head_lr, "base_lr": args.head_lr}], + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + fused=True, + ) + replicated_params.append(base_model.lm_head.weight) + optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar] + if optimizer_head is not None: + optimizers.append(optimizer_head) + n_params = sum(p.numel() for p in base_model.parameters()) + log0(f"model_params:{n_params}") + xsa_layers = [i for i, b in enumerate(base_model.blocks) if b.attn.use_xsa] + log0(f"XSA:last_{args.xsa_last_n} active_layers:{xsa_layers}") + log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}") + log0("sdp_backends:cudnn=False flash=True mem_efficient=False math=False") + log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}") + log0( + f"tie_embeddings:{args.tie_embeddings} embed_lr:{token_lr} " + f"head_lr:{args.head_lr if base_model.lm_head is not None else 0.0} " + f"matrix_lr:{args.matrix_lr} scalar_lr:{args.scalar_lr}" + ) + log0( + f"train_batch_tokens:{args.train_batch_tokens} train_seq_len:{args.train_seq_len} " + f"iterations:{args.iterations} warmup_steps:{args.warmup_steps} " + f"max_wallclock_seconds:{args.max_wallclock_seconds:.3f}" + ) + log0(f"seed:{args.seed}") + train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) + def zero_grad_all() -> None: + for opt in optimizers: + opt.zero_grad(set_to_none=True) + max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None + def lr_mul(step: int, elapsed_ms: float) -> float: + if args.warmdown_iters <= 0: + return 1.0 + if max_wallclock_ms is None: + warmdown_start = max(args.iterations - args.warmdown_iters, 0) + return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0) if warmdown_start <= step < args.iterations else 1.0 + step_ms = elapsed_ms / max(step, 1) + warmdown_ms = args.warmdown_iters * step_ms + remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) + return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0 + if args.warmup_steps > 0: + initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()} + initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] + _precompile_passes = sorted(set(p for _, p in passes_schedule) - {args.num_passes}) if passes_schedule else [] + _qat_precompile_passes = _precompile_passes[-2:] if len(_precompile_passes) >= 2 else _precompile_passes[:] + _total_precompile = len(_precompile_passes) + len(_qat_precompile_passes) + _precompile_start = args.warmup_steps - _total_precompile + model.train() + for warmup_step in range(args.warmup_steps): + if warmup_step >= _precompile_start: + _pc_idx = warmup_step - _precompile_start + if _pc_idx < len(_precompile_passes): + base_model.num_passes = _precompile_passes[_pc_idx] + CastedLinear._qat_enabled = False + base_model.core_quant_enabled = False + else: + _qat_idx = _pc_idx - len(_precompile_passes) + base_model.num_passes = _qat_precompile_passes[_qat_idx] + CastedLinear._qat_enabled = True + base_model.core_quant_enabled = True + zero_grad_all() + for micro_step in range(grad_accum_steps): + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + (warmup_loss * grad_scale).backward() + if distributed: + for p in base_model.parameters(): + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + if feedback is not None: + for p in feedback.parameters(): + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + for opt in optimizers: + opt.step() + zero_grad_all() + if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps: + log0(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}") + base_model.num_passes = args.num_passes + CastedLinear._qat_enabled = args.qat_enabled + base_model.core_quant_enabled = args.core_quant_enabled + if stabilizer is not None: + stabilizer.reset() + base_model.load_state_dict(initial_model_state, strict=True) + for opt, state in zip(optimizers, initial_optimizer_states, strict=True): + opt.load_state_dict(state) + zero_grad_all() + train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) + swa_state: dict[str, Tensor] | None = None + swa_count = 0 + _all_state = dict(base_model.state_dict()) + if feedback is not None: + for k, v in feedback.state_dict().items(): + _all_state[f"_fb.{k}"] = v + ema_state = {name: t.detach().float().clone() for name, t in _all_state.items()} + ema_decay = 0.997 + training_time_ms = 0.0 + stop_after_step: int | None = None + torch.cuda.synchronize() + t0 = time.perf_counter() + step = 0 + while True: + last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step) + should_validate = last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0) + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1000.0 * (time.perf_counter() - t0) + val_loss, val_bpb = eval_val( + args, + model, + rank, + world_size, + device, + grad_accum_steps, + val_tokens, + base_bytes_lut, + has_leading_space_lut, + is_boundary_token_lut, + ) + diag_str = "" + if stabilizer is not None and stabilizer.diagnostics.h_norms: + hn = [f"{v:.1f}" for v in stabilizer.diagnostics.h_norms[-args.num_passes*base_model.num_core:]] + gr = [f"{v:.3f}" for v in stabilizer.diagnostics.growth_ratios[-args.num_passes*base_model.num_core:]] + diag_str = f" h_norms={hn} growth={gr}" + stabilizer.reset() + log0( + f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} " + f"train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms / max(step, 1):.2f}ms" + f"{diag_str}" + ) + torch.cuda.synchronize() + t0 = time.perf_counter() + if last_step: + if stop_after_step is not None and step < args.iterations: + log0( + f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms " + f"step:{step}/{args.iterations}" + ) + break + elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + scale = lr_mul(step, elapsed_ms) + if passes_schedule: + target_passes = args.num_passes + for threshold_step, p in passes_schedule: + if step >= threshold_step: + target_passes = p + if target_passes != base_model.num_passes: + base_model.num_passes = target_passes + log0(f"progressive_passes: step:{step} num_passes:{target_passes}") + if args.late_qat_threshold > 0 and step > 100 and scale < args.late_qat_threshold and not CastedLinear._qat_enabled: + CastedLinear._qat_enabled = True + base_model.core_quant_enabled = True + log0(f"late_qat:enabled step:{step} scale:{scale:.4f} core_quant:on") + zero_grad_all() + train_loss = torch.zeros((), device=device) + for micro_step in range(grad_accum_steps): + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + train_loss += loss.detach() + (loss * grad_scale).backward() + train_loss /= grad_accum_steps + frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0 + muon_momentum = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum + for group in optimizer_muon.param_groups: + group["momentum"] = muon_momentum + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group["base_lr"] * scale + grad_norm = None + if args.grad_clip_norm > 0: + grad_norm = torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm) + optimizer_muon.launch_reduce_scatters() + if distributed: + for p in replicated_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + optimizer_tok.step() + optimizer_scalar.step() + if optimizer_head is not None: + optimizer_head.step() + optimizer_muon.step() + zero_grad_all() + with torch.no_grad(): + _cur = dict(base_model.state_dict()) + if feedback is not None: + for k, v in feedback.state_dict().items(): + _cur[f"_fb.{k}"] = v + for name, t in _cur.items(): + ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay) + step += 1 + approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + if args.swa_enabled and scale < 0.2 and step % args.swa_every == 0: + if swa_state is None: + swa_state = {name: t.detach().cpu().clone() for name, t in base_model.state_dict().items()} + swa_count = 1 + log0(f"swa:start step:{step}") + else: + for name, t in base_model.state_dict().items(): + swa_state[name] += t.detach().cpu() + swa_count += 1 + should_log_train = ( + args.train_log_every > 0 + and (step <= 10 or step % args.train_log_every == 0 or stop_after_step is not None) + ) + if should_log_train: + tl = train_loss.item() + gn_str = f" grad_norm:{grad_norm:.4f}" if grad_norm is not None else "" + log0( + f"step:{step}/{args.iterations} train_loss:{tl:.4f}{gn_str} " + f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms / step:.2f}ms" + ) + reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms + if distributed and max_wallclock_ms is not None: + reached_cap_tensor = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX) + reached_cap = bool(reached_cap_tensor.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + log0( + f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB " + f"reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB" + ) + log0("ema:applying EMA weights") + current_state = base_model.state_dict() + model_ema = {k: v for k, v in ema_state.items() if not k.startswith("_fb.")} + avg_state = {name: model_ema[name].to(dtype=current_state[name].dtype) for name in current_state} + base_model.load_state_dict(avg_state, strict=True) + if feedback is not None: + fb_ema = {k.removeprefix("_fb."): v for k, v in ema_state.items() if k.startswith("_fb.")} + fb_state = feedback.state_dict() + fb_avg = {k: fb_ema[k].to(dtype=fb_state[k].dtype) for k in fb_state} + feedback.load_state_dict(fb_avg, strict=True) + torch.cuda.synchronize() + t_diag = time.perf_counter() + diag_val_loss, diag_val_bpb = eval_val( + args, compiled_model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + ) + torch.cuda.synchronize() + log0( + f"DIAGNOSTIC post_ema val_loss:{diag_val_loss:.4f} val_bpb:{diag_val_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_diag):.0f}ms" + ) + full_state_dict = base_model.state_dict() + export_sd = full_state_dict + if master_process: + torch.save(export_sd, "final_model.pt") + model_bytes = os.path.getsize("final_model.pt") + code_bytes = len(code.encode("utf-8")) + log0(f"Serialized model: {model_bytes} bytes") + log0(f"Code size: {code_bytes} bytes") + eval_num_passes = args.eval_passes if args.eval_passes > 0 else args.num_passes + if eval_num_passes != args.num_passes: + log0(f"eval_override: num_passes {args.num_passes} -> {eval_num_passes}") + base_model.num_passes = eval_num_passes + if base_model.residual_scale is not None: + old_s = base_model.residual_scale.scales.data + new_s = torch.full((eval_num_passes,), cli.residual_scale_init, + dtype=torch.float32, device=old_s.device) + copy_len = min(eval_num_passes, old_s.shape[0]) + new_s[:copy_len] = old_s[:copy_len] + base_model.residual_scale.scales = nn.Parameter(new_s) + export_sd = base_model.state_dict() + sd_cpu = {k: v.detach().cpu() for k, v in export_sd.items()} + unbanked_sd = _unbank_state_dict(sd_cpu, args.num_layers) + quant_result, quant_meta = mixed_quantize_int6(unbanked_sd, {"mlp", "attn"}) + quant_buf = io.BytesIO() + torch.save({"w": quant_result, "m": quant_meta}, quant_buf) + quant_raw = quant_buf.getvalue() + quant_blob = lzma.compress(quant_raw, preset=6) + if master_process: + with open("final_model.int6.ptz", "wb") as f: + f.write(quant_blob) + quant_file_bytes = len(quant_blob) + code_bytes = len(code.encode("utf-8")) + log0(f"Serialized model int6+lzma: {quant_file_bytes} bytes") + log0(f"Total submission size int6+lzma: {quant_file_bytes + code_bytes} bytes") + if distributed: + dist.barrier() + with open("final_model.int6.ptz", "rb") as f: + quant_blob_disk = f.read() + quant_state = torch.load( + io.BytesIO(lzma.decompress(quant_blob_disk)), + map_location="cpu", + ) + deq_unbanked = dequantize_mixed_int6(quant_state["w"], quant_state["m"], unbanked_sd) + deq_state = _rebank_state_dict(deq_unbanked, args.num_layers, sd_cpu) + eval_model = GPT( + vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim, + num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init, + xsa_last_n=args.xsa_last_n, + rope_dims=args.rope_dims, ln_scale=args.ln_scale, core_start=args.core_start, core_end=args.core_end, + num_passes=eval_num_passes, + interpass_rmsnorm=not cli.no_interpass_rmsnorm, + ).to(device).bfloat16() + if residual_scale is not None: + eval_rs = ResidualScale(eval_num_passes, cli.residual_scale_init).to(device) + eval_model.residual_scale = eval_rs + eval_model.qo_bank.data = eval_model.qo_bank.data.float() + eval_model.kv_bank.data = eval_model.kv_bank.data.float() + eval_model.mlp_up_bank.data = eval_model.mlp_up_bank.data.float() + eval_model.mlp_down_bank.data = eval_model.mlp_down_bank.data.float() + for m in eval_model.modules(): + if isinstance(m, CastedLinear): + m.float() + restore_low_dim_params_to_fp32(eval_model) + eval_model.load_state_dict(deq_state, strict=True) + if args.ttt_enabled: + torch.cuda.synchronize() + t_ttt = time.perf_counter() + ttt_loss, ttt_bpb = eval_val_sliding_ttt( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, log0=log0, + ) + torch.cuda.synchronize() + log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_ttt):.0f}ms") + log0(f"legal_ttt_exact val_loss:{ttt_loss:.8f} val_bpb:{ttt_bpb:.8f}") + if distributed: + dist.destroy_process_group() if __name__ == "__main__": - main() + main() \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_old.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_old.py new file mode 100644 index 0000000000..2bda2cf69f --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_old.py @@ -0,0 +1,2275 @@ +from __future__ import annotations +import copy +import glob +import io +import lzma +import math +import os +import random +import subprocess +import sys +import time +import uuid +import zlib +from pathlib import Path +try: + import zstandard + _COMPRESSOR = "zstd" +except ImportError: + _COMPRESSOR = "zlib" +import numpy as np +import sentencepiece as spm +import torch +import torch._dynamo +torch._dynamo.config.recompile_limit = 32 +import torch.distributed as dist +import torch.nn.functional as F +from torch import Tensor, nn +from torch.nn.parallel import DistributedDataParallel as DDP +_gpu_mem_frac = float(os.environ.get("CUDA_MEM_FRACTION", "0")) +if _gpu_mem_frac > 0: + torch.cuda.set_per_process_memory_fraction(_gpu_mem_frac, 0) +from dataclasses import dataclass, field +from flash_attn_interface import flash_attn_func as flash_attn_3_func +import argparse + +# ── Stability monitoring and control for recurrent passes ── + +@dataclass +class PassDiagnostics: + h_norms: list[float] = field(default_factory=list) + delta_norms: list[float] = field(default_factory=list) + error_norms: list[float] = field(default_factory=list) + correction_norms: list[float] = field(default_factory=list) + growth_ratios: list[float] = field(default_factory=list) + + def reset(self): + for lst in (self.h_norms, self.delta_norms, self.error_norms, + self.correction_norms, self.growth_ratios): + lst.clear() + + def summary(self) -> dict[str, list[float]]: + return { + "h_norms": list(self.h_norms), + "delta_norms": list(self.delta_norms), + "error_norms": list(self.error_norms), + "correction_norms": list(self.correction_norms), + "growth_ratios": list(self.growth_ratios), + } + + +class RecurrentStabilizer: + def __init__( + self, + clip_hidden: bool = False, + clip_value: float = 10.0, + clip_mode: str = "value", + jacobian_proxy_weight: float = 0.0, + eps: float = 1e-6, + ): + self.clip_hidden = clip_hidden + self.clip_value = clip_value + self.clip_mode = clip_mode + self.jacobian_proxy_weight = jacobian_proxy_weight + self.eps = eps + self.diagnostics = PassDiagnostics() + + def clip(self, h: Tensor) -> Tensor: + if not self.clip_hidden: + return h + if self.clip_mode == "value": + return torch.clamp(h, -self.clip_value, self.clip_value) + norm = h.norm(dim=-1, keepdim=True) + scale = torch.clamp(self.clip_value / (norm + self.eps), max=1.0) + return h * scale + + def record_pass( + self, + h_prev: Tensor, + h_next: Tensor, + error: Tensor | None = None, + correction: Tensor | None = None, + ): + with torch.no_grad(): + h_pn = h_prev.float().norm().item() + h_nn = h_next.float().norm().item() + self.diagnostics.h_norms.append(h_nn) + self.diagnostics.delta_norms.append( + (h_next - h_prev).float().norm().item() + ) + self.diagnostics.growth_ratios.append(h_nn / (h_pn + self.eps)) + if error is not None: + self.diagnostics.error_norms.append(error.float().norm().item()) + if correction is not None: + self.diagnostics.correction_norms.append( + correction.float().norm().item() + ) + + def jacobian_proxy_loss(self, h_in: Tensor, h_out: Tensor) -> Tensor: + if self.jacobian_proxy_weight <= 0: + return h_in.new_zeros(()) + delta = h_out - h_in + ratio = delta.norm() / (h_in.norm() + self.eps) + return self.jacobian_proxy_weight * torch.relu(ratio - 1.0).square() + + def reset(self): + self.diagnostics.reset() + + +class ResidualScale(nn.Module): + def __init__(self, num_passes: int, init_value: float = 1.0): + super().__init__() + self.scales = nn.Parameter( + torch.full((num_passes,), init_value, dtype=torch.float32) + ) + + def forward(self, residual: Tensor, pass_idx: int) -> Tensor: + return self.scales[pass_idx].to(dtype=residual.dtype) * residual + + +# ── Error feedback modules for recurrent quantization correction ── + +class LowRankResidual(nn.Module): + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V = nn.Parameter(torch.zeros(dim, rank)) + self.U = nn.Parameter(torch.zeros(dim, rank)) + + def forward(self, h: Tensor) -> Tensor: + return (h @ self.V) @ self.U.T + + +class DiagonalFeedback(nn.Module): + def __init__(self, dim: int, init_ones: bool = False): + super().__init__() + init_val = torch.ones(dim) if init_ones else torch.zeros(dim) + self.d = nn.Parameter(init_val) + + def forward(self, e: Tensor) -> Tensor: + return self.d.to(dtype=e.dtype) * e + + +class LowRankFeedback(nn.Module): + def __init__(self, dim: int, rank: int = 2): + super().__init__() + self.V_D = nn.Parameter(torch.zeros(dim, rank)) + self.U_D = nn.Parameter(torch.zeros(dim, rank)) + + def forward(self, e: Tensor) -> Tensor: + return (e @ self.V_D) @ self.U_D.T + + +class AffineJunction(nn.Module): + def __init__(self, dim: int): + super().__init__() + self.gamma = nn.Parameter(torch.ones(dim)) + self.beta = nn.Parameter(torch.zeros(dim)) + + def forward(self, h: Tensor) -> Tensor: + return self.gamma.to(dtype=h.dtype) * h + self.beta.to(dtype=h.dtype) + + +class ErrorFeedbackModule(nn.Module): + def __init__( + self, + dim: int, + rank: int = 2, + feedback_mode: str = "diagonal", + per_pass: bool = False, + num_passes: int = 3, + affine_junction: bool = False, + ): + super().__init__() + self.feedback_mode = feedback_mode + self.per_pass = per_pass + self.num_passes = num_passes + self.residual = LowRankResidual(dim, rank) + if feedback_mode == "identity": + self.correction: nn.Module | nn.ModuleList | None = None + elif feedback_mode == "diagonal": + if per_pass: + self.correction = nn.ModuleList( + [DiagonalFeedback(dim) for _ in range(num_passes)] + ) + else: + self.correction = DiagonalFeedback(dim) + elif feedback_mode == "low_rank": + if per_pass: + self.correction = nn.ModuleList( + [LowRankFeedback(dim, rank) for _ in range(num_passes)] + ) + else: + self.correction = LowRankFeedback(dim, rank) + else: + raise ValueError(f"Unknown feedback_mode: {feedback_mode}") + self.junction: AffineJunction | None = ( + AffineJunction(dim) if affine_junction else None + ) + + def forward(self, h: Tensor, pass_idx: int) -> Tensor: + e = self.residual(h) + if self.correction is None: + c = e + elif self.per_pass: + c = self.correction[pass_idx](e) + else: + c = self.correction(e) + if self.junction is not None: + c = c + self.junction(h) + mask = torch.tensor(1.0 if pass_idx > 0 else 0.0, device=h.device, dtype=h.dtype) + return c * mask + + def param_count(self) -> int: + return sum(p.numel() for p in self.parameters()) +_wandb = None +class Hyperparameters: + data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024") + train_files = os.path.join(data_path, "fineweb_train_*.bin") + val_files = os.path.join(data_path, "fineweb_val_*.bin") + tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model") + run_id = os.environ.get("RUN_ID", str(uuid.uuid4())) + seed = int(os.environ.get("SEED", 1337)) + val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288)) + val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 4000)) + train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 500)) + iterations = int(os.environ.get("ITERATIONS", 20000)) + warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 3500)) + warmup_steps = int(os.environ.get("WARMUP_STEPS", 20)) + train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 786_432)) + train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 2048)) + eval_seq_len = int(os.environ.get("EVAL_SEQ_LEN", 2048)) + max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0)) + qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5)) + vocab_size = int(os.environ.get("VOCAB_SIZE", 1024)) + num_layers = int(os.environ.get("NUM_LAYERS", 11)) + num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4)) + model_dim = int(os.environ.get("MODEL_DIM", 512)) + num_heads = int(os.environ.get("NUM_HEADS", 8)) + mlp_mult = float(os.environ.get("MLP_MULT", 3.0)) + tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1"))) + rope_base = float(os.environ.get("ROPE_BASE", 10000.0)) + logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0)) + embed_lr = float(os.environ.get("EMBED_LR", 0.6)) + head_lr = float(os.environ.get("HEAD_LR", 0.008)) + tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.035)) + tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005)) + matrix_lr = float(os.environ.get("MATRIX_LR", 0.025)) + scalar_lr = float(os.environ.get("SCALAR_LR", 0.025)) + muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.99)) + muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5)) + muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.92)) + muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 1500)) + beta1 = float(os.environ.get("BETA1", 0.9)) + beta2 = float(os.environ.get("BETA2", 0.95)) + adam_eps = float(os.environ.get("ADAM_EPS", 1e-8)) + grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.3)) + eval_stride = int(os.environ.get("EVAL_STRIDE", 64)) + mtp_num_heads = int(os.environ.get("MTP_NUM_HEADS", 0)) + mtp_loss_weight = float(os.environ.get("MTP_LOSS_WEIGHT", 0.2)) + muon_beta2 = float(os.environ.get("MUON_BETA2", 0.95)) + swa_enabled = bool(int(os.environ.get("SWA_ENABLED", "1"))) + swa_every = int(os.environ.get("SWA_EVERY", 50)) + lawa_enabled = bool(int(os.environ.get("LAWA_ENABLED", "0"))) + lawa_k = int(os.environ.get("LAWA_K", 10)) + lawa_freq = int(os.environ.get("LAWA_FREQ", 100)) + muon_wd = float(os.environ.get("MUON_WD", 0.04)) + adam_wd = float(os.environ.get("ADAM_WD", 0.04)) + qat_enabled = bool(int(os.environ.get("QAT_ENABLED", "0"))) + bigram_vocab_size = int(os.environ.get("BIGRAM_VOCAB_SIZE", 2048)) + bigram_dim = int(os.environ.get("BIGRAM_DIM", 128)) + xsa_last_n = int(os.environ.get("XSA_LAST_N", 4)) + rope_dims = int(os.environ.get("ROPE_DIMS", 16)) + ln_scale = bool(int(os.environ.get("LN_SCALE", "1"))) + dtg_enabled = bool(int(os.environ.get("DTG_ENABLED", "0"))) + late_qat_threshold = float(os.environ.get("LATE_QAT_THRESHOLD", 0.15)) + ve_enabled = bool(int(os.environ.get("VE_ENABLED", "1"))) + ve_dim = int(os.environ.get("VE_DIM", 128)) + ve_layers = os.environ.get("VE_LAYERS", "9,10") + gated_attention = bool(int(os.environ.get("GATED_ATTENTION", "0"))) + value_residual = bool(int(os.environ.get("VALUE_RESIDUAL", "0"))) + ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "0"))) + ttt_lr = float(os.environ.get("TTT_LR", 0.002)) + ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3)) + ttt_chunk_tokens = int(os.environ.get("TTT_CHUNK_TOKENS", 32768)) + ttt_freeze_blocks = int(os.environ.get("TTT_FREEZE_BLOCKS", 2)) + ttt_momentum = float(os.environ.get("TTT_MOMENTUM", 0.9)) + ttt_batch_seqs = int(os.environ.get("TTT_BATCH_SEQS", 32)) + ttt_grad_clip = float(os.environ.get("TTT_GRAD_CLIP", 1.0)) + # recurrence + core_start = int(os.environ.get("CORE_START", 3)) + core_end = int(os.environ.get("CORE_END", 8)) + num_passes = int(os.environ.get("NUM_PASSES", 1)) + core_quant_bits = int(os.environ.get("CORE_QUANT_BITS", 6)) + core_quant_enabled = bool(int(os.environ.get("CORE_QUANT_ENABLED", "0"))) + eval_passes = int(os.environ.get("EVAL_PASSES", 0)) + # Progressive passes schedule: comma-separated "step:passes" pairs, e.g. "0:1,4500:2,5500:3,6000:4" + passes_schedule_str = os.environ.get("PASSES_SCHEDULE", "") + +# --- Batched Newton-Schulz orthogonalization --- + +def zeropower_via_newtonschulz5(G: Tensor, steps: int = 5, eps: float = 1e-7) -> Tensor: + """Batched Newton-Schulz orthogonalization. G: (B,M,N) or (M,N).""" + a, b, c = (3.4445, -4.7750, 2.0315) + was_2d = G.ndim == 2 + if was_2d: + G = G.unsqueeze(0) + X = G.bfloat16() + transposed = X.size(-2) > X.size(-1) + if transposed: + X = X.mT + X = X / (X.norm(dim=(-2, -1), keepdim=True) + eps) + for _ in range(steps): + A = X @ X.mT + B = b * A + c * (A @ A) + X = a * X + B @ X + if transposed: + X = X.mT + if was_2d: + X = X.squeeze(0) + return X + +# --- Parallel Muon optimizer --- + +class Muon(torch.optim.Optimizer): + """Parallel Muon: post-backward reduce-scatter -> local NS5 -> all-gather. + + No DDP for bank params. After backward, this optimizer: + 1. Launches async reduce-scatter for all banks (biggest first) + 2. Returns control so Adam can step on small params while RS is in-flight + 3. Waits for each RS, runs local NS5 on the shard, launches async all-gather + 4. Each all-gather overlaps with next bank's NS5 + """ + def __init__(self, params, lr: float, momentum: float, backend_steps: int, + nesterov: bool = True, weight_decay: float = 0.0): + super().__init__( + params, + dict(lr=lr, momentum=momentum, backend_steps=backend_steps, + nesterov=nesterov, weight_decay=weight_decay), + ) + self._built = False + + def _build(self): + self._distributed = dist.is_available() and dist.is_initialized() + self._world_size = dist.get_world_size() if self._distributed else 1 + self._rank = dist.get_rank() if self._distributed else 0 + ws = self._world_size + + self._bank_meta = [] + for group in self.param_groups: + for p in group["params"]: + B = p.shape[0] + padded_B = ((B + ws - 1) // ws) * ws + shard_B = padded_B // ws + tail = p.shape[1:] + dev = p.device + self._bank_meta.append({ + 'p': p, + 'B': B, + 'padded_grad': torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), + 'shard': torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), + 'shard_mom': torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16), + 'full_update': torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16), + 'scale': max(1, p.shape[-2] / p.shape[-1]) ** 0.5, + }) + # Sort by size descending -- launch biggest reduce-scatters first + self._bank_meta.sort(key=lambda m: -m['p'].numel()) + self._built = True + + def launch_reduce_scatters(self): + """Phase 1: launch async reduce-scatter for all banks. Call right after backward.""" + if not self._built: + self._build() + if not self._distributed: + return + self._rs_futures = [] + for m in self._bank_meta: + p = m['p'] + if p.grad is None: + self._rs_futures.append(None) + continue + pg = m['padded_grad'] + pg[:m['B']].copy_(p.grad.bfloat16()) + if pg.shape[0] > m['B']: + pg[m['B']:].zero_() + fut = dist.reduce_scatter_tensor(m['shard'], pg, op=dist.ReduceOp.AVG, async_op=True) + self._rs_futures.append(fut) + + @torch.no_grad() + def step(self, closure=None): + """Phase 3: wait for RS, local NS5, all-gather. Call AFTER Adam steps.""" + loss = None + if closure is not None: + with torch.enable_grad(): + loss = closure() + + if not self._built: + self._build() + + for group in self.param_groups: + lr = group["lr"] + momentum = group["momentum"] + backend_steps = group["backend_steps"] + nesterov = group["nesterov"] + wd = group.get("weight_decay", 0.0) + + prev_ag_handle = None + prev_m = None + + sharded = self._distributed and hasattr(self, '_rs_futures') + + for i, m in enumerate(self._bank_meta): + p = m['p'] + if p.grad is None: + continue + + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m['p'] + upd = prev_m['full_update'][:prev_m['B']] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m['scale']) + + if sharded and self._rs_futures[i] is not None: + self._rs_futures[i].wait() + g = m['shard'] + buf = m['shard_mom'] + else: + g = p.grad.bfloat16() + state = self.state[p] + if "momentum_buffer" not in state: + state["momentum_buffer"] = torch.zeros_like(g) + buf = state["momentum_buffer"] + + buf.mul_(momentum).add_(g) + if nesterov: + update = g.add(buf, alpha=momentum) + else: + update = buf + + update = zeropower_via_newtonschulz5(update, steps=backend_steps) + + if sharded: + prev_ag_handle = dist.all_gather_into_tensor( + m['full_update'], update, async_op=True) + prev_m = m + else: + if wd > 0.0: + p.data.mul_(1.0 - lr * wd) + p.add_(update.to(dtype=p.dtype), alpha=-lr * m['scale']) + + if prev_ag_handle is not None: + prev_ag_handle.wait() + pp = prev_m['p'] + upd = prev_m['full_update'][:prev_m['B']] + if wd > 0.0: + pp.data.mul_(1.0 - lr * wd) + pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m['scale']) + + if hasattr(self, '_rs_futures'): + del self._rs_futures + + return loss + +# --- Tokenizer evaluation helpers --- + +def build_sentencepiece_luts( + sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device +) -> tuple[Tensor, Tensor, Tensor]: + sp_vocab_size = int(sp.vocab_size()) + table_size = max(sp_vocab_size, vocab_size) + base_bytes_np = np.zeros((table_size,), dtype=np.int16) + has_leading_space_np = np.zeros((table_size,), dtype=np.bool_) + is_boundary_token_np = np.ones((table_size,), dtype=np.bool_) + for token_id in range(sp_vocab_size): + if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id): + continue + is_boundary_token_np[token_id] = False + if sp.is_byte(token_id): + base_bytes_np[token_id] = 1 + continue + piece = sp.id_to_piece(token_id) + if piece.startswith("\u2581"): + has_leading_space_np[token_id] = True + piece = piece[1:] + base_bytes_np[token_id] = len(piece.encode("utf-8")) + return ( + torch.tensor(base_bytes_np, dtype=torch.int16, device=device), + torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), + torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), + ) +def load_validation_tokens(pattern: str, seq_len: int) -> Tensor: + files = [Path(p) for p in sorted(glob.glob(pattern))] + if not files: + raise FileNotFoundError(f"No files found for pattern: {pattern}") + tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() + usable = ((tokens.numel() - 1) // seq_len) * seq_len + if usable <= 0: + raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") + return tokens[: usable + 1] +def eval_val( + args: Hyperparameters, + model: nn.Module, + rank: int, + world_size: int, + device: torch.device, + grad_accum_steps: int, + val_tokens: Tensor, + base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + eval_seq_len: int | None = None, +) -> tuple[float, float]: + seq_len = eval_seq_len or args.train_seq_len + local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps) + if local_batch_tokens < seq_len: + raise ValueError( + "VAL_BATCH_SIZE must provide at least one sequence per rank; " + f"got VAL_BATCH_SIZE={args.val_batch_size}, WORLD_SIZE={world_size}, " + f"GRAD_ACCUM_STEPS={grad_accum_steps}, seq_len={seq_len}" + ) + local_batch_seqs = local_batch_tokens // seq_len + total_seqs = (val_tokens.numel() - 1) // seq_len + seq_start = (total_seqs * rank) // world_size + seq_end = (total_seqs * (rank + 1)) // world_size + val_loss_sum = torch.zeros((), device=device, dtype=torch.float64) + val_token_count = torch.zeros((), device=device, dtype=torch.float64) + val_byte_count = torch.zeros((), device=device, dtype=torch.float64) + model.eval() + with torch.inference_mode(): + for batch_seq_start in range(seq_start, seq_end, local_batch_seqs): + batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end) + raw_start = batch_seq_start * seq_len + raw_end = batch_seq_end * seq_len + 1 + local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + batch_loss = model(x, y).detach() + batch_token_count = float(y.numel()) + val_loss_sum += batch_loss.to(torch.float64) * batch_token_count + val_token_count += batch_token_count + prev_ids = x.reshape(-1) + tgt_ids = y.reshape(-1) + token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16) + token_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16) + val_byte_count += token_bytes.to(torch.float64).sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM) + val_loss = val_loss_sum / val_token_count + bits_per_token = val_loss.item() / math.log(2.0) + tokens_per_byte = val_token_count.item() / val_byte_count.item() + model.train() + return float(val_loss.item()), float(bits_per_token * tokens_per_byte) + +# --- Quantization helpers --- + +CONTROL_TENSOR_NAME_PATTERNS = tuple( + pattern + for pattern in os.environ.get( + "CONTROL_TENSOR_NAME_PATTERNS", + "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,smear,dtg_gate,ve_layer_scales,ve_shared.scale,attn_gate,vr_lambda", + ).split(",") + if pattern +) +INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple( + pattern + for pattern in os.environ.get( + "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS", + ",".join(CONTROL_TENSOR_NAME_PATTERNS), + ).split(",") + if pattern +) +INT8_KEEP_FLOAT_MAX_NUMEL = 65_536 +INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16 +INT8_PER_ROW_SCALE_DTYPE = torch.float16 +INT8_CLIP_PERCENTILE = 99.99984 +INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0 +def tensor_nbytes(t: Tensor) -> int: + return int(t.numel()) * int(t.element_size()) +def keep_float_tensor(name: str, t: Tensor, passthrough_orig_dtypes: dict[str, str]) -> Tensor: + if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS): + return t.float().contiguous() + if t.dtype in {torch.float32, torch.bfloat16}: + passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.") + return t.to(dtype=INT8_KEEP_FLOAT_STORE_DTYPE).contiguous() + return t +def quantize_float_tensor(t: Tensor) -> tuple[Tensor, Tensor]: + t32 = t.float() + if t32.ndim == 2: + clip_abs = ( + torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1) + if t32.numel() + else torch.empty((t32.shape[0],), dtype=torch.float32) + ) + clipped = torch.maximum(torch.minimum(t32, clip_abs[:, None]), -clip_abs[:, None]) + scale = (clip_abs / 127.0).clamp_min(1.0 / 127.0) + q = torch.clamp(torch.round(clipped / scale[:, None]), -127, 127).to(torch.int8).contiguous() + return q, scale.to(dtype=INT8_PER_ROW_SCALE_DTYPE).contiguous() + clip_abs = float(torch.quantile(t32.abs().flatten(), INT8_CLIP_Q).item()) if t32.numel() else 0.0 + scale = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, dtype=torch.float32) + q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), -127, 127).to(torch.int8).contiguous() + return q, scale +def quantize_state_dict_int8(state_dict: dict[str, Tensor]): + quantized: dict[str, Tensor] = {} + scales: dict[str, Tensor] = {} + dtypes: dict[str, str] = {} + passthrough: dict[str, Tensor] = {} + passthrough_orig_dtypes: dict[str, str] = {} + qmeta: dict[str, dict[str, object]] = {} + stats = dict.fromkeys( + ("param_count", "num_tensors", "num_float_tensors", "num_nonfloat_tensors", "baseline_tensor_bytes", "int8_payload_bytes"), + 0, + ) + for name, tensor in state_dict.items(): + t = tensor.detach().to("cpu").contiguous() + stats["param_count"] += int(t.numel()) + stats["num_tensors"] += 1 + stats["baseline_tensor_bytes"] += tensor_nbytes(t) + if not t.is_floating_point(): + stats["num_nonfloat_tensors"] += 1 + passthrough[name] = t + stats["int8_payload_bytes"] += tensor_nbytes(t) + continue + if t.numel() <= INT8_KEEP_FLOAT_MAX_NUMEL: + kept = keep_float_tensor(name, t, passthrough_orig_dtypes) + passthrough[name] = kept + stats["int8_payload_bytes"] += tensor_nbytes(kept) + continue + stats["num_float_tensors"] += 1 + q, s = quantize_float_tensor(t) + if s.ndim > 0: + qmeta[name] = {"scheme": "per_row", "axis": 0} + quantized[name] = q + scales[name] = s + dtypes[name] = str(t.dtype).removeprefix("torch.") + stats["int8_payload_bytes"] += tensor_nbytes(q) + tensor_nbytes(s) + obj: dict[str, object] = { + "__quant_format__": "int8_clean_per_row_v1", + "quantized": quantized, + "scales": scales, + "dtypes": dtypes, + "passthrough": passthrough, + } + if qmeta: + obj["qmeta"] = qmeta + if passthrough_orig_dtypes: + obj["passthrough_orig_dtypes"] = passthrough_orig_dtypes + return obj, stats +def dequantize_state_dict_int8(obj: dict[str, object]) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + qmeta = obj.get("qmeta", {}) + passthrough_orig_dtypes = obj.get("passthrough_orig_dtypes", {}) + for name, q in obj["quantized"].items(): + dtype = getattr(torch, obj["dtypes"][name]) + s = obj["scales"][name] + if qmeta.get(name, {}).get("scheme") == "per_row" or s.ndim > 0: + s = s.to(dtype=torch.float32) + out[name] = (q.float() * s.view(q.shape[0], *([1] * (q.ndim - 1)))).to(dtype=dtype).contiguous() + else: + scale = float(s.item()) + out[name] = (q.float() * scale).to(dtype=dtype).contiguous() + for name, t in obj["passthrough"].items(): + out_t = t.detach().to("cpu").contiguous() + orig_dtype = passthrough_orig_dtypes.get(name) + if isinstance(orig_dtype, str): + out_t = out_t.to(dtype=getattr(torch, orig_dtype)).contiguous() + out[name] = out_t + return out + +# --- Data loading --- + +def load_data_shard(file: Path) -> Tensor: + header_bytes = 256 * np.dtype(" None: + self.file_idx = (self.file_idx + 1) % len(self.files) + self.tokens = load_data_shard(self.files[self.file_idx]) + self.pos = 0 + def take(self, n: int) -> Tensor: + chunks: list[Tensor] = [] + remaining = n + while remaining > 0: + avail = self.tokens.numel() - self.pos + if avail <= 0: + self._advance_file() + continue + k = min(remaining, avail) + chunks.append(self.tokens[self.pos : self.pos + k]) + self.pos += k + remaining -= k + return chunks[0] if len(chunks) == 1 else torch.cat(chunks) +class DistributedTokenLoader: + def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device): + self.rank = rank + self.world_size = world_size + self.device = device + self.stream = TokenStream(pattern) + def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]: + local_tokens = global_tokens // (self.world_size * grad_accum_steps) + per_rank_span = local_tokens + 1 + chunk = self.stream.take(per_rank_span * self.world_size) + start = self.rank * per_rank_span + local = chunk[start : start + per_rank_span].to(dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True) + +# --- Transformer modules --- + +class RMSNorm(nn.Module): + def __init__(self, eps: float | None = None): + super().__init__() + self.eps = eps + def forward(self, x: Tensor) -> Tensor: + return F.rms_norm(x, (x.size(-1),), eps=self.eps) +class CastedLinear(nn.Linear): + _qat_enabled: bool = False + def forward(self, x: Tensor) -> Tensor: + w = self.weight.to(x.dtype) + if CastedLinear._qat_enabled and self.training and w.ndim == 2: + with torch.no_grad(): + w32 = self.weight.float() + row_max = w32.abs().amax(dim=1) + scale = (row_max / 31.0).clamp_min(1.0 / 31.0) + w_q = (torch.clamp(torch.round(w32 / scale[:, None]), -32, 31) * scale[:, None]).to(x.dtype) + w = w + (w_q - w).detach() + bias = self.bias.to(x.dtype) if self.bias is not None else None + return F.linear(x, w, bias) +def restore_low_dim_params_to_fp32(module: nn.Module) -> None: + with torch.no_grad(): + for name, param in module.named_parameters(): + if (param.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32: + param.data = param.data.float() +class Rotary(nn.Module): + def __init__(self, dim: int, base: float = 10000.0, train_seq_len: int = 1024, rope_dims: int = 0): + super().__init__() + self.dim = dim + self.base = base + self.train_seq_len = train_seq_len + self.rope_dims = rope_dims if rope_dims > 0 else dim + inv_freq = 1.0 / (base ** (torch.arange(0, self.rope_dims, 2, dtype=torch.float32) / self.rope_dims)) + self.register_buffer("inv_freq", inv_freq, persistent=False) + self._seq_len_cached = 0 + self._cos_cached: Tensor | None = None + self._sin_cached: Tensor | None = None + def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]: + if ( + self._cos_cached is None + or self._sin_cached is None + or self._seq_len_cached != seq_len + or self._cos_cached.device != device + ): + rd = self.rope_dims + if seq_len > self.train_seq_len: + scale = seq_len / self.train_seq_len + new_base = self.base * (scale ** (rd / (rd - 2))) + inv_freq = 1.0 / (new_base ** (torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd)) + else: + inv_freq = self.inv_freq.to(device) + t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype) + freqs = torch.outer(t, inv_freq) + self._cos_cached = freqs.cos()[None, :, None, :] + self._sin_cached = freqs.sin()[None, :, None, :] + self._seq_len_cached = seq_len + return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype) +def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor, rope_dims: int = 0) -> Tensor: + if rope_dims > 0 and rope_dims < x.size(-1): + x_rope, x_pass = x[..., :rope_dims], x[..., rope_dims:] + half = rope_dims // 2 + x1, x2 = x_rope[..., :half], x_rope[..., half:] + x_rope = torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) + return torch.cat((x_rope, x_pass), dim=-1) + half = x.size(-1) // 2 + x1, x2 = x[..., :half], x[..., half:] + return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) + +class CausalSelfAttention(nn.Module): + def __init__( + self, + dim: int, + num_heads: int, + num_kv_heads: int, + rope_base: float, + qk_gain_init: float, + gated_attention: bool = False, + value_residual: bool = False, + ): + super().__init__() + if dim % num_heads != 0: + raise ValueError("model_dim must be divisible by num_heads") + if num_heads % num_kv_heads != 0: + raise ValueError("num_heads must be divisible by num_kv_heads") + self.num_heads = num_heads + self.num_kv_heads = num_kv_heads + self.head_dim = dim // num_heads + if self.head_dim % 2 != 0: + raise ValueError("head_dim must be even for RoPE") + # No CastedLinear -- weights come from banks + self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32)) + self.rope_dims = 0 # set by GPT.__init__ for partial RoPE + self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=1024) + self.use_xsa = False # set by GPT.__init__ for deep layers only + # Gated attention and value residual (non-banked small params) + self.gated_attention = gated_attention + if gated_attention: + self.attn_gate = nn.Linear(dim, num_heads, bias=True) + nn.init.zeros_(self.attn_gate.weight) + nn.init.constant_(self.attn_gate.bias, 4.0) + self.value_residual = value_residual + if value_residual: + self.vr_lambda = nn.Parameter(torch.tensor([0.5, 0.5], dtype=torch.float32)) + def _xsa_efficient(self, y: Tensor, v: Tensor) -> Tensor: + """Efficient XSA: subtract self-value projection via GQA-aware reshape (no repeat_interleave). + y: [B, T, H, D], v: [B, T, Hkv, D]. H must be divisible by Hkv.""" + B, T, H, D = y.shape + Hkv = v.size(-2) + group = H // Hkv + y_g = y.reshape(B, T, Hkv, group, D) # [B, T, Hkv, group, D] + vn = F.normalize(v, dim=-1).unsqueeze(-2) # [B, T, Hkv, 1, D] -- broadcast ready + proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn + return (y_g - proj).reshape(B, T, H, D) + def forward(self, x: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, v_embed: Tensor | None = None, v0: Tensor | None = None) -> tuple[Tensor, Tensor | None]: + bsz, seqlen, dim = x.shape + q = F.linear(x, q_w.to(x.dtype)).reshape(bsz, seqlen, self.num_heads, self.head_dim) + k = F.linear(x, k_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + v = F.linear(x, v_w.to(x.dtype)) + if v_embed is not None: + v = v + v_embed + v = v.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + raw_v = v if self.value_residual else None + if self.value_residual and v0 is not None: + lam = self.vr_lambda.to(dtype=v.dtype) + v = lam[0] * v0 + lam[1] * v + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = self.rotary(seqlen, x.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, self.rope_dims) + k = apply_rotary_emb(k, cos, sin, self.rope_dims) + q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None] + y = flash_attn_3_func(q, k, v, causal=True) + if self.use_xsa: + y = self._xsa_efficient(y, v) + if self.gated_attention: + # gate shape: (bsz, seqlen, num_heads) -> (bsz, seqlen, num_heads, 1) for B,T,H,D layout + gate = torch.sigmoid(self.attn_gate(x)).unsqueeze(-1) + y = y * gate + y = y.reshape(bsz, seqlen, dim) + return F.linear(y, out_w.to(x.dtype)), raw_v + +class SmearGate(nn.Module): + def __init__(self, dim: int): + super().__init__() + self.gate = nn.Parameter(torch.zeros(dim, dtype=torch.float32)) + def forward(self, x: Tensor) -> Tensor: + g = torch.sigmoid(self.gate.to(dtype=x.dtype))[None, None, :] + x_prev = torch.cat([torch.zeros_like(x[:, :1]), x[:, :-1]], dim=1) + return (1 - g) * x + g * x_prev + +class BigramHashEmbedding(nn.Module): + def __init__(self, bigram_vocab_size: int, bigram_dim: int, model_dim: int): + super().__init__() + self.bigram_vocab_size = bigram_vocab_size + self.embed = nn.Embedding(bigram_vocab_size, bigram_dim) + nn.init.zeros_(self.embed.weight) + self.proj = CastedLinear(bigram_dim, model_dim, bias=False) if bigram_dim != model_dim else None + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.05, dtype=torch.float32)) + def bigram_hash(self, tokens: Tensor) -> Tensor: + t = tokens.to(torch.int32) + mod = self.bigram_vocab_size - 1 + out = torch.empty_like(t) + out[..., 0] = mod + out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod + return out.long() + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(self.bigram_hash(token_ids)) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) + +class ValueEmbedding(nn.Module): + """Reinject token identity into attention values at specific layers. + Each table maps vocab tokens to a low-dim embedding, projected to model_dim.""" + def __init__(self, vocab_size: int, ve_dim: int, model_dim: int): + super().__init__() + self.embed = nn.Embedding(vocab_size, ve_dim) + nn.init.normal_(self.embed.weight, std=0.01) + self.proj = CastedLinear(ve_dim, model_dim, bias=False) if ve_dim != model_dim else None + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.1, dtype=torch.float32)) + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(token_ids) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) + +class MLP(nn.Module): + def __init__(self, dim: int, mlp_mult: int): + super().__init__() + # No CastedLinear -- weights come from banks + def forward(self, x: Tensor, up_w: Tensor, down_w: Tensor) -> Tensor: + x = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5) + return F.linear(x.square(), down_w.to(x.dtype)) + +class Block(nn.Module): + def __init__( + self, + dim: int, + num_heads: int, + num_kv_heads: int, + mlp_mult: int, + rope_base: float, + qk_gain_init: float, + layer_idx: int = 0, + ln_scale: bool = False, + dtg: bool = False, + gated_attention: bool = False, + value_residual: bool = False, + ): + super().__init__() + self.attn_norm = RMSNorm() + self.mlp_norm = RMSNorm() + self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init, + gated_attention=gated_attention, value_residual=value_residual) + self.mlp = MLP(dim, mlp_mult) + self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float()) + self.ln_scale_factor = 1.0 / math.sqrt(layer_idx + 1) if ln_scale else 1.0 + if dtg: + self.dtg_gate = nn.Linear(dim, 1, bias=True) + nn.init.zeros_(self.dtg_gate.weight) + nn.init.constant_(self.dtg_gate.bias, 2.0) + else: + self.dtg_gate = None + def forward(self, x: Tensor, x0: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, up_w: Tensor, down_w: Tensor, v_embed: Tensor | None = None, v0: Tensor | None = None) -> tuple[Tensor, Tensor | None]: + mix = self.resid_mix.to(dtype=x.dtype) + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + attn_out, raw_v = self.attn(self.attn_norm(x_in) * self.ln_scale_factor, q_w, k_w, v_w, out_w, v_embed=v_embed, v0=v0) + x_out = x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) + if self.dtg_gate is not None: + gate = torch.sigmoid(self.dtg_gate(x_in.detach())) + x_out = x_in + gate * (x_out - x_in) + return x_out, raw_v + +def _fake_quantize(w: Tensor, bits: int = 6) -> Tensor: + clip_range = (1 << (bits - 1)) - 1 + w32 = w.float() + if w32.ndim >= 2: + row_max = w32.abs().amax(dim=-1) + scale = (row_max / clip_range).clamp_min(1.0 / clip_range) + dims = (slice(None),) * (w32.ndim - 1) + (None,) + w_q = (torch.clamp(torch.round(w32 / scale[dims]), -clip_range, clip_range) * scale[dims]).to(w.dtype) + else: + amax = w32.abs().max() + scale = (amax / clip_range).clamp_min(1.0 / clip_range) + w_q = (torch.clamp(torch.round(w32 / scale), -clip_range, clip_range) * scale).to(w.dtype) + return w + (w_q - w).detach() + +class GPT(nn.Module): + def __init__( + self, + vocab_size: int, + num_layers: int, + model_dim: int, + num_heads: int, + num_kv_heads: int, + mlp_mult: int, + tie_embeddings: bool, + tied_embed_init_std: float, + logit_softcap: float, + rope_base: float, + qk_gain_init: float, + mtp_num_heads: int = 0, + mtp_loss_weight: float = 0.1, + bigram_vocab_size: int = 0, + bigram_dim: int = 128, + xsa_last_n: int = 0, + rope_dims: int = 0, + ln_scale: bool = False, + dtg: bool = False, + ve_enabled: bool = False, + ve_dim: int = 128, + ve_layers: str = "9,10", + gated_attention: bool = False, + value_residual: bool = False, + core_start: int = 3, + core_end: int = 8, + num_passes: int = 1, + core_quant_bits: int = 6, + core_quant_enabled: bool = False, + residual_scale: nn.Module | None = None, + interpass_rmsnorm: bool = True, + ): + super().__init__() + self._ve_target_dim = num_kv_heads * (model_dim // num_heads) + if logit_softcap <= 0.0: + raise ValueError(f"logit_softcap must be positive, got {logit_softcap}") + self.tie_embeddings = tie_embeddings + self.tied_embed_init_std = tied_embed_init_std + self.logit_softcap = logit_softcap + self.value_residual = value_residual + self.mtp_num_heads = mtp_num_heads + self.mtp_loss_weight = mtp_loss_weight + self.core_start = core_start + self.core_end = min(core_end, num_layers) + self.interpass_rmsnorm = interpass_rmsnorm + self.num_passes = num_passes + self.core_quant_bits = core_quant_bits + self.core_quant_enabled = core_quant_enabled + self.num_stem = core_start + self.num_core = self.core_end - core_start + self.num_tail = num_layers - self.core_end + self.residual_scale = residual_scale + self.tok_emb = nn.Embedding(vocab_size, model_dim) + self.bigram = BigramHashEmbedding(bigram_vocab_size, bigram_dim, model_dim) if bigram_vocab_size > 0 else None + self.smear = SmearGate(model_dim) + self.num_skip_weights = min(self.num_stem, self.num_tail) + self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32)) + # Parameter banks: contiguous 3D tensors for batched optimizer + head_dim = model_dim // num_heads + kv_dim = num_kv_heads * head_dim + mlp_dim = int(mlp_mult * model_dim) + self.num_layers = num_layers + self.qo_bank = nn.Parameter(torch.empty(2 * num_layers, model_dim, model_dim)) + self.kv_bank = nn.Parameter(torch.empty(2 * num_layers, kv_dim, model_dim)) + self.mlp_up_bank = nn.Parameter(torch.empty(num_layers, mlp_dim, model_dim)) + self.mlp_down_bank = nn.Parameter(torch.empty(num_layers, model_dim, mlp_dim)) + self.blocks = nn.ModuleList( + [ + Block( + model_dim, + num_heads, + num_kv_heads, + mlp_mult, + rope_base, + qk_gain_init, + layer_idx=i, + ln_scale=ln_scale, + dtg=dtg, + gated_attention=gated_attention, + value_residual=value_residual, + ) + for i in range(num_layers) + ] + ) + if rope_dims > 0: + head_dim = model_dim // num_heads + for block in self.blocks: + block.attn.rope_dims = rope_dims + block.attn.rotary = Rotary(head_dim, base=rope_base, train_seq_len=1024, rope_dims=rope_dims) + self.ve_layer_indices = [int(x) for x in ve_layers.split(",") if x.strip()] if ve_enabled else [] + kv_dim_ve = self._ve_target_dim + if self.ve_layer_indices: + self.ve_shared = ValueEmbedding(vocab_size, ve_dim, kv_dim_ve) + self.ve_layer_scales = nn.ParameterList( + [nn.Parameter(torch.ones(1, dtype=torch.float32)) for _ in self.ve_layer_indices] + ) + else: + self.ve_shared = None + self.ve_layer_scales = nn.ParameterList() + self.value_embeds = nn.ModuleList() # keep empty for compat + self.final_norm = RMSNorm() + self.lm_head = None if tie_embeddings else CastedLinear(model_dim, vocab_size, bias=False) + if self.lm_head is not None: + self.lm_head._zero_init = True + self.mtp_heads = nn.ModuleList( + [CastedLinear(model_dim, vocab_size, bias=False) for _ in range(mtp_num_heads)] + ) + for head in self.mtp_heads: + head._zero_init = True + if xsa_last_n > 0: + for i in range(max(0, num_layers - xsa_last_n), num_layers): + if i < core_start or i >= self.core_end: + self.blocks[i].attn.use_xsa = True + self._init_weights() + def _init_weights(self) -> None: + if self.tie_embeddings: + nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std) + n = self.num_layers + proj_scale = 1.0 / math.sqrt(2 * n) + # Init banks: orthogonal, with proj layers scaled down and out/down zero-init + for i in range(n): + nn.init.orthogonal_(self.qo_bank.data[i], gain=1.0) # Q + nn.init.zeros_(self.qo_bank.data[n + i]) # Out (zero init) + nn.init.orthogonal_(self.kv_bank.data[i], gain=1.0) # K + nn.init.orthogonal_(self.kv_bank.data[n + i], gain=1.0) # V + nn.init.orthogonal_(self.mlp_up_bank.data[i], gain=1.0) # MLP up + nn.init.zeros_(self.mlp_down_bank.data[i]) # MLP down (zero init) + # Scale proj layers (out_proj and mlp_down are "proj" layers) + self.qo_bank.data[n + i].mul_(proj_scale) + self.mlp_down_bank.data[i].mul_(proj_scale) + # Init remaining nn.Linear modules (bigram proj, mtp heads, lm_head) + for name, module in self.named_modules(): + if isinstance(module, nn.Linear): + if getattr(module, "_zero_init", False): + nn.init.zeros_(module.weight) + elif module.weight.ndim == 2 and module.weight.shape[0] >= 64 and module.weight.shape[1] >= 64: + nn.init.orthogonal_(module.weight, gain=1.0) + def _get_ve(self, layer_idx: int, input_ids: Tensor, ve_cache: dict | None = None) -> Tensor | None: + """Get value embedding for a specific layer using shared table + per-layer scale.""" + if self.ve_shared is None or layer_idx not in self.ve_layer_indices: + return None + if ve_cache is not None and 've' not in ve_cache: + ve_cache['ve'] = self.ve_shared(input_ids) + ve_base = ve_cache['ve'] if ve_cache is not None else self.ve_shared(input_ids) + ve_idx = self.ve_layer_indices.index(layer_idx) + return ve_base * self.ve_layer_scales[ve_idx].to(dtype=ve_base.dtype) + def _get_bank_weights(self, bi: int) -> tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]: + n = self.num_layers + q_w = self.qo_bank[bi] + out_w = self.qo_bank[n + bi] + k_w = self.kv_bank[bi] + v_w = self.kv_bank[n + bi] + up_w = self.mlp_up_bank[bi] + down_w = self.mlp_down_bank[bi] + if self.core_quant_enabled and self.training and self.core_start <= bi < self.core_end: + q_w = _fake_quantize(q_w, self.core_quant_bits) + out_w = _fake_quantize(out_w, self.core_quant_bits) + k_w = _fake_quantize(k_w, self.core_quant_bits) + v_w = _fake_quantize(v_w, self.core_quant_bits) + up_w = _fake_quantize(up_w, self.core_quant_bits) + down_w = _fake_quantize(down_w, self.core_quant_bits) + return q_w, k_w, v_w, out_w, up_w, down_w + + def _forward_hidden(self, input_ids: Tensor, feedback_fn=None, + stabilizer=None) -> tuple[Tensor, Tensor, Tensor]: + n = self.num_layers + x = self.tok_emb(input_ids) + if self.bigram is not None: + x = x + self.bigram(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + x = self.smear(x) + x0 = x + v0 = None + skips: list[Tensor] = [] + ve_cache: dict = {} + # --- STEM --- + for i in range(self.core_start): + ve = self._get_ve(i, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) + x, raw_v = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + if v0 is None and raw_v is not None: + v0 = raw_v + skips.append(x) + # --- RECURRENT CORE (Fixes 1, 2, 5) --- + h_core_in = x + for k in range(self.num_passes): + if k > 0 and self.interpass_rmsnorm: + x = F.rms_norm(x, (x.size(-1),)) + if feedback_fn is not None: + x = x + feedback_fn(x, k) + if stabilizer is not None: + x = stabilizer.clip(x) + x_before_pass = x + for j in range(self.core_start, self.core_end): + h_prev = x + ve = self._get_ve(j, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(j) + x, raw_v = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + if v0 is None and raw_v is not None: + v0 = raw_v + if stabilizer is not None and self.training and not torch.compiler.is_compiling(): + stabilizer.record_pass(h_prev, x) + if self.residual_scale is not None and k > 0: + delta = x - x_before_pass + x = x_before_pass + self.residual_scale(delta, k) + h_core_out = x + # --- TAIL --- + for i in range(self.core_end, n): + ti = i - self.core_end + if ti < len(skips): + x = x + self.skip_weights[ti].to(dtype=x.dtype)[None, None, :] * skips.pop() + ve = self._get_ve(i, input_ids, ve_cache) + q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) + x, _ = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, + v_embed=ve, v0=v0) + x = self.final_norm(x) + return x, h_core_in, h_core_out + + def forward(self, input_ids: Tensor, target_ids: Tensor, + feedback_fn=None, stabilizer=None) -> Tensor: + x, h_core_in, h_core_out = self._forward_hidden(input_ids, feedback_fn, stabilizer) + x_flat = x.reshape(-1, x.size(-1)) + targets = target_ids.reshape(-1) + if self.tie_embeddings: + logits_proj = F.linear(x_flat, self.tok_emb.weight) + else: + if self.lm_head is None: + raise RuntimeError("lm_head is required when tie_embeddings=False") + logits_proj = self.lm_head(x_flat) + logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + main_loss = F.cross_entropy(logits.float(), targets, reduction="mean") + if self.training and self.mtp_num_heads > 0 and self.mtp_loss_weight > 0.0: + _, seqlen, dim = x.shape + mtp_loss_sum = x.new_zeros(()) + mtp_loss_count = 0 + for k, mtp_head in enumerate(self.mtp_heads): + valid_t = seqlen - (k + 1) + if valid_t <= 0: + continue + mtp_hidden = x[:, :valid_t, :].reshape(-1, dim) + mtp_targets = target_ids[:, k + 1 :].reshape(-1) + mtp_logits_proj = mtp_head(mtp_hidden) + mtp_logits = self.logit_softcap * torch.tanh(mtp_logits_proj / self.logit_softcap) + mtp_loss_sum = mtp_loss_sum + F.cross_entropy(mtp_logits.float(), mtp_targets, reduction="mean") + mtp_loss_count += 1 + if mtp_loss_count > 0: + main_loss = main_loss + self.mtp_loss_weight * (mtp_loss_sum / mtp_loss_count) + if stabilizer is not None and stabilizer.jacobian_proxy_weight > 0: + main_loss = main_loss + stabilizer.jacobian_proxy_loss(h_core_in, h_core_out) + return main_loss + + def forward_logits(self, input_ids: Tensor, + feedback_fn=None, stabilizer=None) -> Tensor: + """Return logits (bsz, seq_len, vocab) without computing loss.""" + x, _, _ = self._forward_hidden(input_ids, feedback_fn, stabilizer) + if self.tie_embeddings: + logits_proj = F.linear(x, self.tok_emb.weight) + else: + logits_proj = self.lm_head(x) + return self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + +# --- Sliding window evaluation --- + +def eval_val_sliding( + args: Hyperparameters, + base_model: nn.Module, + rank: int, + world_size: int, + device: torch.device, + val_tokens: Tensor, + base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + stride: int, + batch_seqs: int = 32, + eval_seq_len: int | None = None, +) -> tuple[float, float]: + """Sliding window evaluation: each token scored with maximum context.""" + seq_len = eval_seq_len or args.train_seq_len + total_tokens = val_tokens.numel() - 1 + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= 1] + total_windows = len(window_starts) + my_s = (total_windows * rank) // world_size + my_e = (total_windows * (rank + 1)) // world_size + my_windows = window_starts[my_s:my_e] + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + base_model.eval() + compiled_logits = torch.compile(base_model.forward_logits, dynamic=False, fullgraph=True) + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk[:-1] + y_batch[i, :wlen] = chunk[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = compiled_logits(x_batch) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), + reduction="none", + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt = y_batch[i, s:wlen] + prev = x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + val_loss = (loss_sum / token_count).item() + bits_per_token = val_loss / math.log(2.0) + tokens_per_byte = token_count.item() / byte_count.item() + base_model.train() + return val_loss, bits_per_token * tokens_per_byte + + +def eval_val_sliding_ttt( + args: Hyperparameters, base_model: nn.Module, rank: int, world_size: int, + device: torch.device, val_tokens: Tensor, base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, is_boundary_token_lut: Tensor, + stride: int, batch_seqs: int = 32, log0=print, +) -> tuple[float, float]: + """Legal score-first TTT (PR #461 recipe): score each chunk with sliding windows, + then train on it. Every token scored BEFORE any update that could use it.""" + seq_len = args.train_seq_len + total_tokens = val_tokens.numel() - 1 + ttt_chunk = args.ttt_chunk_tokens + + # Pre-compute all window starts + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0] + + # Assign each window to a chunk based on the first token it scores + num_chunks = (total_tokens + ttt_chunk - 1) // ttt_chunk + chunk_windows: list[list[int]] = [[] for _ in range(num_chunks)] + for ws in window_starts: + end = min(ws + seq_len, total_tokens) + wlen = end - ws + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_start = ws + s + ci = min(scored_start // ttt_chunk, num_chunks - 1) + chunk_windows[ci].append(ws) + + log0(f"ttt_sliding:start chunks={num_chunks} chunk_tokens={ttt_chunk} " + f"total_windows={len(window_starts)} stride={stride} " + f"ttt_lr={args.ttt_lr} ttt_epochs={args.ttt_epochs} " + f"freeze_blocks={args.ttt_freeze_blocks}") + + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + + # Freeze first N blocks + frozen_block_ids = set(range(min(args.ttt_freeze_blocks, len(base_model.blocks)))) + ttt_params = [] + for name, p in base_model.named_parameters(): + freeze = False + for bi in frozen_block_ids: + if f"blocks.{bi}." in name: + freeze = True + break + if freeze: + p.requires_grad_(False) + else: + p.requires_grad_(True) + ttt_params.append(p) + + log0(f"ttt_sliding:params unfrozen={sum(p.numel() for p in ttt_params)} " + f"frozen={sum(p.numel() for p in base_model.parameters() if not p.requires_grad)}") + + optimizer = torch.optim.SGD(ttt_params, lr=args.ttt_lr, momentum=args.ttt_momentum) + t0 = time.perf_counter() + + for ci in range(num_chunks): + windows = chunk_windows[ci] + if not windows: + continue + chunk_start = ci * ttt_chunk + chunk_end = min((ci + 1) * ttt_chunk, total_tokens) + + # --- Phase 1: SCORE this chunk's windows (inference_mode) --- + my_s = (len(windows) * rank) // world_size + my_e = (len(windows) * (rank + 1)) // world_size + my_windows = windows[my_s:my_e] + + base_model.eval() + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk_tok = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk_tok[:-1] + y_batch[i, :wlen] = chunk_tok[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = base_model.forward_logits(x_batch) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), reduction="none", + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt, prev = y_batch[i, s:wlen], x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + + # --- Phase 2: TRAIN on this chunk (already scored = legal) --- + is_last_chunk = (ci == num_chunks - 1) + if not is_last_chunk and args.ttt_epochs > 0: + base_model.train() + chunk_seqs = (chunk_end - chunk_start) // seq_len + if chunk_seqs > 0: + cos_lr = args.ttt_lr * 0.5 * (1.0 + math.cos(math.pi * ci / max(num_chunks - 1, 1))) + for pg in optimizer.param_groups: + pg['lr'] = cos_lr + my_seq_s = (chunk_seqs * rank) // world_size + my_seq_e = (chunk_seqs * (rank + 1)) // world_size + my_chunk_seqs = my_seq_e - my_seq_s + for _ep in range(args.ttt_epochs): + for bs in range(0, my_chunk_seqs, args.ttt_batch_seqs): + be = min(bs + args.ttt_batch_seqs, my_chunk_seqs) + actual_bs = my_seq_s + bs + start_tok = chunk_start + actual_bs * seq_len + end_tok = chunk_start + (my_seq_s + be) * seq_len + 1 + if end_tok > val_tokens.numel(): + continue + local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + optimizer.zero_grad(set_to_none=True) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = base_model(x, y) + loss.backward() + if world_size > 1: + for p in ttt_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + torch.nn.utils.clip_grad_norm_(ttt_params, args.ttt_grad_clip) + optimizer.step() + + if rank == 0 and (ci % 10 == 0 or ci == num_chunks - 1): + elapsed = time.perf_counter() - t0 + rl = loss_sum.item() / max(token_count.item(), 1) + rbpb = rl / math.log(2.0) * (token_count.item() / max(byte_count.item(), 1)) if token_count.item() > 0 else 0.0 + log0(f" ttt_chunk [{ci+1}/{num_chunks}] bpb={rbpb:.6f} time={elapsed:.1f}s") + + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + + val_loss = (loss_sum / token_count).item() + val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item()) + + for p in base_model.parameters(): + p.requires_grad_(True) + base_model.eval() + + log0(f"ttt_sliding:done val_loss={val_loss:.6f} val_bpb={val_bpb:.6f} " + f"elapsed={time.perf_counter() - t0:.1f}s") + return val_loss, val_bpb + + +# --- GPTQ-lite int6 quantization --- + +def _classify_param(name: str) -> str: + if "tok_emb" in name or "lm_head" in name: + return "embed" + if ".mlp." in name: + return "mlp" + if ".attn." in name or (".proj." in name and ".mlp." not in name): + return "attn" + return "other" + +def _extract_layer_idx(name: str) -> int | None: + if not name.startswith("blocks."): + return None + parts = name.split(".") + if len(parts) >= 2 and parts[1].isdigit(): + return int(parts[1]) + return None +def quantize_int6_per_row(t: Tensor, clip_range: int = 31) -> tuple[Tensor, Tensor]: + t32 = t.float() + if t32.ndim == 2: + best_q, best_s, best_err = None, None, float('inf') + for pct in [0.9990, 0.9995, 0.9999, 0.99999, 1.0]: + if pct < 1.0: + row_clip = torch.quantile(t32.abs(), pct, dim=1) + else: + row_clip = t32.abs().amax(dim=1) + s = (row_clip / clip_range).clamp_min(1.0 / clip_range).to(torch.float16) + q = torch.clamp(torch.round(t32 / s.float()[:, None]), -clip_range, clip_range).to(torch.int8) + recon = q.float() * s.float()[:, None] + err = (t32 - recon).pow(2).mean().item() + if err < best_err: + best_q, best_s, best_err = q, s, err + return best_q, best_s + amax = t32.abs().max().item() + scale = torch.tensor(amax / clip_range if amax > 0 else 1.0, dtype=torch.float16) + q = torch.clamp(torch.round(t32 / scale.float()), -clip_range, clip_range).to(torch.int8) + return q, scale + +def _unbank_state_dict(sd: dict[str, Tensor], num_layers: int) -> dict[str, Tensor]: + """Convert 3D bank tensors into individual 2D tensors with standard names.""" + out: dict[str, Tensor] = {} + n = num_layers + for name, tensor in sd.items(): + if name == "qo_bank": + for i in range(n): + out[f"blocks.{i}.attn.c_q.weight"] = tensor[i] + out[f"blocks.{i}.attn.proj.weight"] = tensor[n + i] + elif name == "kv_bank": + for i in range(n): + out[f"blocks.{i}.attn.c_k.weight"] = tensor[i] + out[f"blocks.{i}.attn.c_v.weight"] = tensor[n + i] + elif name == "mlp_up_bank": + for i in range(n): + out[f"blocks.{i}.mlp.fc.weight"] = tensor[i] + elif name == "mlp_down_bank": + for i in range(n): + out[f"blocks.{i}.mlp.proj.weight"] = tensor[i] + else: + out[name] = tensor + return out + +def _rebank_state_dict(sd: dict[str, Tensor], num_layers: int, template_sd: dict[str, Tensor]) -> dict[str, Tensor]: + """Convert individual 2D tensors back into 3D bank tensors.""" + out: dict[str, Tensor] = {} + n = num_layers + # Reconstruct banks from individual weight keys + qo_slices = [None] * (2 * n) + kv_slices = [None] * (2 * n) + up_slices = [None] * n + down_slices = [None] * n + consumed = set() + for i in range(n): + qk = f"blocks.{i}.attn.c_q.weight" + if qk in sd: + qo_slices[i] = sd[qk] + consumed.add(qk) + ok = f"blocks.{i}.attn.proj.weight" + if ok in sd: + qo_slices[n + i] = sd[ok] + consumed.add(ok) + kk = f"blocks.{i}.attn.c_k.weight" + if kk in sd: + kv_slices[i] = sd[kk] + consumed.add(kk) + vk = f"blocks.{i}.attn.c_v.weight" + if vk in sd: + kv_slices[n + i] = sd[vk] + consumed.add(vk) + fk = f"blocks.{i}.mlp.fc.weight" + if fk in sd: + up_slices[i] = sd[fk] + consumed.add(fk) + dk = f"blocks.{i}.mlp.proj.weight" + if dk in sd: + down_slices[i] = sd[dk] + consumed.add(dk) + out["qo_bank"] = torch.stack(qo_slices).to(dtype=template_sd["qo_bank"].dtype) + out["kv_bank"] = torch.stack(kv_slices).to(dtype=template_sd["kv_bank"].dtype) + out["mlp_up_bank"] = torch.stack(up_slices).to(dtype=template_sd["mlp_up_bank"].dtype) + out["mlp_down_bank"] = torch.stack(down_slices).to(dtype=template_sd["mlp_down_bank"].dtype) + for name, tensor in sd.items(): + if name not in consumed: + out[name] = tensor + return out + +def mixed_quantize_int6(state_dict: dict[str, Tensor], int6_cats: set[str], + core_start: int = -1, core_end: int = -1): + num_layers_total = max( + (int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")), + default=0, + ) + 1 + late_k_layers = set(range(num_layers_total - 2, num_layers_total)) + result: dict[str, Tensor] = {} + meta: dict[str, object] = {} + for name, tensor in state_dict.items(): + t = tensor.detach().cpu().contiguous() + cat = _classify_param(name) + if not t.is_floating_point() or t.numel() <= 65536: + result[name] = t.to(torch.float16) if t.is_floating_point() else t + meta[name] = "passthrough" + continue + if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS): + result[name] = t.float() + meta[name] = "passthrough_ctrl" + continue + if cat in int6_cats and t.ndim >= 1: + q, s = quantize_int6_per_row(t) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int6"} + else: + q, s = quantize_float_tensor(t) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int8"} + return result, meta +def dequantize_mixed_int6(result: dict[str, Tensor], meta: dict[str, object], + template_sd: dict[str, Tensor]) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + for name, orig in template_sd.items(): + info = meta.get(name) + if info is None: + continue + orig_dtype = orig.dtype + if info in ("passthrough", "passthrough_ctrl", "passthrough_fp16"): + t = result[name] + if t.dtype == torch.float16 and orig_dtype in (torch.float32, torch.bfloat16): + t = t.to(orig_dtype) + out[name] = t + continue + q, s = result[name + ".q"], result[name + ".scale"] + if s.ndim > 0: + out[name] = (q.float() * s.float().view(q.shape[0], *([1] * (q.ndim - 1)))).to(orig_dtype) + else: + out[name] = (q.float() * float(s.item())).to(orig_dtype) + return out + +# --- Training --- + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser(description="Recurrent SOTA with stabilization") + g = parser.add_argument_group("feedback") + g.add_argument("--feedback-rank", type=int, default=2) + g.add_argument("--feedback-mode", type=str, default="diagonal", + choices=["identity", "diagonal", "low_rank", "none"]) + g.add_argument("--per-pass-feedback", action="store_true") + g.add_argument("--affine-junction", action="store_true") + g = parser.add_argument_group("stability") + g.add_argument("--clip-hidden", action="store_true") + g.add_argument("--clip-value", type=float, default=10.0) + g.add_argument("--residual-scale-init", type=float, default=0.5) + g.add_argument("--jacobian-proxy-weight", type=float, default=0.01) + g.add_argument("--no-interpass-rmsnorm", action="store_true") + return parser.parse_args() + +def main() -> None: + cli = parse_args() + code = Path(__file__).read_text(encoding="utf-8") + args = Hyperparameters() + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + if world_size <= 0: + raise ValueError(f"WORLD_SIZE must be positive, got {world_size}") + if 8 % world_size != 0: + raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral") + grad_accum_steps = 8 // world_size + grad_scale = 1.0 / grad_accum_steps + if not torch.cuda.is_available(): + raise RuntimeError("CUDA is required") + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + master_process = rank == 0 + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp + enable_cudnn_sdp(False) + enable_flash_sdp(True) + enable_mem_efficient_sdp(False) + enable_math_sdp(False) + logfile = None + if master_process: + os.makedirs("logs", exist_ok=True) + logfile = f"logs/{args.run_id}.txt" + print(logfile) + def log0(msg: str, console: bool = True) -> None: + if not master_process: + return + if console: + print(msg) + if logfile is not None: + with open(logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + log0(code, console=False) + log0("=" * 100, console=False) + log0(f"Running Python {sys.version}", console=False) + log0(f"Running PyTorch {torch.__version__}", console=False) + log0( + subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=False).stdout, + console=False, + ) + log0("=" * 100, console=False) + random.seed(args.seed) + np.random.seed(args.seed) + torch.manual_seed(args.seed) + torch.cuda.manual_seed_all(args.seed) + if not args.tokenizer_path.endswith(".model"): + raise ValueError(f"Script only setup for SentencePiece .model file: {args.tokenizer_path}") + sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) + if int(sp.vocab_size()) != args.vocab_size: + raise ValueError( + f"VOCAB_SIZE={args.vocab_size} does not match tokenizer vocab_size={int(sp.vocab_size())}" + ) + dataset_dir = Path(args.data_path).resolve() + actual_train_files = len(list(dataset_dir.glob("fineweb_train_*.bin"))) + effective_eval_seq_len = args.eval_seq_len if args.eval_seq_len > 0 else args.train_seq_len + val_seq_len = max(args.train_seq_len, effective_eval_seq_len) + val_tokens = load_validation_tokens(args.val_files, val_seq_len) + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts( + sp, args.vocab_size, device + ) + log0(f"val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path={args.tokenizer_path}") + log0(f"train_loader:dataset:{dataset_dir.name} train_shards:{actual_train_files}") + log0(f"val_loader:shards pattern={args.val_files} tokens:{val_tokens.numel() - 1}") + CastedLinear._qat_enabled = args.qat_enabled + base_model = GPT( + vocab_size=args.vocab_size, + num_layers=args.num_layers, + model_dim=args.model_dim, + num_heads=args.num_heads, + num_kv_heads=args.num_kv_heads, + mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, + tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, + rope_base=args.rope_base, + qk_gain_init=args.qk_gain_init, + mtp_num_heads=args.mtp_num_heads, + mtp_loss_weight=args.mtp_loss_weight, + bigram_vocab_size=args.bigram_vocab_size, + bigram_dim=args.bigram_dim, + xsa_last_n=args.xsa_last_n, + rope_dims=args.rope_dims, + ln_scale=args.ln_scale, + dtg=args.dtg_enabled, + ve_enabled=args.ve_enabled, + ve_dim=args.ve_dim, + ve_layers=args.ve_layers, + gated_attention=args.gated_attention, + value_residual=args.value_residual, + core_start=args.core_start, + core_end=args.core_end, + num_passes=args.num_passes, + core_quant_bits=args.core_quant_bits, + core_quant_enabled=args.core_quant_enabled, + residual_scale=None, + interpass_rmsnorm=not cli.no_interpass_rmsnorm, + ).to(device).bfloat16() + # Banks stay FP32 (like CastedLinear weights), cast to BF16 in forward + base_model.qo_bank.data = base_model.qo_bank.data.float() + base_model.kv_bank.data = base_model.kv_bank.data.float() + base_model.mlp_up_bank.data = base_model.mlp_up_bank.data.float() + base_model.mlp_down_bank.data = base_model.mlp_down_bank.data.float() + for module in base_model.modules(): + if isinstance(module, CastedLinear): + module.float() + restore_low_dim_params_to_fp32(base_model) + # --- feedback / stabilizer --- + feedback = None + feedback_fn = None + stabilizer = None + residual_scale = None + extra_scalar_params: list[nn.Parameter] = [] + # Parse progressive passes schedule + passes_schedule: list[tuple[int, int]] = [] + if args.passes_schedule_str: + for entry in args.passes_schedule_str.split(","): + s, p = entry.strip().split(":") + passes_schedule.append((int(s), int(p))) + passes_schedule.sort(key=lambda x: x[0]) + max_passes = max((p for _, p in passes_schedule), default=args.num_passes) + max_passes = max(max_passes, args.eval_passes if args.eval_passes > 0 else args.num_passes) + needs_recurrence = max_passes > 1 + if cli.feedback_mode != "none" and needs_recurrence: + feedback = ErrorFeedbackModule( + dim=args.model_dim, rank=cli.feedback_rank, + feedback_mode=cli.feedback_mode, + per_pass=cli.per_pass_feedback, + num_passes=max_passes, + affine_junction=cli.affine_junction, + ).to(device).bfloat16() + restore_low_dim_params_to_fp32(feedback) + extra_scalar_params.extend(feedback.parameters()) + def feedback_fn(h, pass_idx): + return feedback(h, pass_idx) + log0(f"feedback: mode={cli.feedback_mode} rank={cli.feedback_rank} " + f"per_pass={cli.per_pass_feedback} params={sum(p.numel() for p in feedback.parameters())}") + if needs_recurrence: + stabilizer = RecurrentStabilizer( + clip_hidden=cli.clip_hidden, clip_value=cli.clip_value, + jacobian_proxy_weight=cli.jacobian_proxy_weight) + if cli.residual_scale_init != 1.0: + residual_scale = ResidualScale(max_passes, cli.residual_scale_init).to(device) + base_model.residual_scale = residual_scale + extra_scalar_params.extend(residual_scale.parameters()) + sched_str = f" schedule={passes_schedule}" if passes_schedule else "" + log0(f"recurrence: core_start={args.core_start} core_end={args.core_end} " + f"num_passes={args.num_passes} max_passes={max_passes} stem={base_model.num_stem} " + f"core={base_model.num_core} tail={base_model.num_tail}{sched_str}") + + # No DDP -- Parallel Muon handles bank grad communication via reduce-scatter, + # and non-bank grads are manually all-reduced before Adam steps. + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + model = compiled_model + + # Optimizer split: + # - 4 parameter banks -> Muon (batched Newton-Schulz) + # - token embedding -> Adam + # - scalars/control tensors -> Adam + # - bigram proj, mtp heads, VE proj -> Adam (small matrix params not worth banking) + matrix_params = [ + base_model.qo_bank, base_model.kv_bank, + base_model.mlp_up_bank, base_model.mlp_down_bank, + ] + block_named_params = list(base_model.blocks.named_parameters()) + scalar_params = [ + p + for name, p in block_named_params + if p.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS) + ] + if base_model.skip_weights.numel() > 0: + scalar_params.append(base_model.skip_weights) + scalar_params.append(base_model.smear.gate) + if base_model.bigram is not None: + scalar_params.append(base_model.bigram.scale) + token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr + tok_params = [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}] + if base_model.bigram is not None: + tok_params.append({"params": [base_model.bigram.embed.weight], "lr": token_lr, "base_lr": token_lr}) + if base_model.bigram.proj is not None: + scalar_params.append(base_model.bigram.proj.weight) + if base_model.ve_shared is not None: + tok_params.append({"params": [base_model.ve_shared.embed.weight], "lr": token_lr, "base_lr": token_lr}) + if base_model.ve_shared.proj is not None: + scalar_params.append(base_model.ve_shared.proj.weight) + scalar_params.append(base_model.ve_shared.scale) + for s in base_model.ve_layer_scales: + scalar_params.append(s) + optimizer_tok = torch.optim.AdamW( + tok_params, + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + weight_decay=args.adam_wd, + fused=True, + ) + optimizer_muon = Muon( + matrix_params, + lr=args.matrix_lr, + momentum=args.muon_momentum, + backend_steps=args.muon_backend_steps, + weight_decay=args.muon_wd, + ) + for group in optimizer_muon.param_groups: + group["base_lr"] = args.matrix_lr + scalar_params.extend(extra_scalar_params) + optimizer_scalar = torch.optim.AdamW( + [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}], + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + weight_decay=args.adam_wd, + fused=True, + ) + # Non-bank params that need manual all-reduce (replicated across GPUs) + replicated_params = list(optimizer_tok.param_groups[0]["params"]) + for pg in optimizer_tok.param_groups[1:]: + replicated_params.extend(pg["params"]) + replicated_params.extend(scalar_params) + + optimizer_head = None + if base_model.lm_head is not None: + optimizer_head = torch.optim.Adam( + [{"params": [base_model.lm_head.weight], "lr": args.head_lr, "base_lr": args.head_lr}], + betas=(args.beta1, args.beta2), + eps=args.adam_eps, + fused=True, + ) + replicated_params.append(base_model.lm_head.weight) + optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar] + if optimizer_head is not None: + optimizers.append(optimizer_head) + n_params = sum(p.numel() for p in base_model.parameters()) + mtp_params = sum(p.numel() for p in base_model.mtp_heads.parameters()) + log0(f"model_params:{n_params}") + log0(f"mtp_num_heads:{args.mtp_num_heads} mtp_loss_weight:{args.mtp_loss_weight} mtp_params:{mtp_params}") + xsa_layers = [i for i, b in enumerate(base_model.blocks) if b.attn.use_xsa] + log0(f"XSA:last_{args.xsa_last_n} active_layers:{xsa_layers}") + log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}") + log0("sdp_backends:cudnn=False flash=True mem_efficient=False math=False") + log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}") + log0( + f"tie_embeddings:{args.tie_embeddings} embed_lr:{token_lr} " + f"head_lr:{args.head_lr if base_model.lm_head is not None else 0.0} " + f"matrix_lr:{args.matrix_lr} scalar_lr:{args.scalar_lr}" + ) + log0( + f"train_batch_tokens:{args.train_batch_tokens} train_seq_len:{args.train_seq_len} " + f"iterations:{args.iterations} warmup_steps:{args.warmup_steps} " + f"max_wallclock_seconds:{args.max_wallclock_seconds:.3f}" + ) + log0(f"seed:{args.seed}") + use_wandb = False + train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) + def zero_grad_all() -> None: + for opt in optimizers: + opt.zero_grad(set_to_none=True) + max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None + def lr_mul(step: int, elapsed_ms: float) -> float: + if args.warmdown_iters <= 0: + return 1.0 + if max_wallclock_ms is None: + warmdown_start = max(args.iterations - args.warmdown_iters, 0) + return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0) if warmdown_start <= step < args.iterations else 1.0 + step_ms = elapsed_ms / max(step, 1) + warmdown_ms = args.warmdown_iters * step_ms + remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) + return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0 + if args.warmup_steps > 0: + initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()} + initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] + _precompile_passes = sorted(set(p for _, p in passes_schedule) - {args.num_passes}) if passes_schedule else [] + _qat_precompile_passes = _precompile_passes[-2:] if len(_precompile_passes) >= 2 else _precompile_passes[:] + _total_precompile = len(_precompile_passes) + len(_qat_precompile_passes) + _precompile_start = args.warmup_steps - _total_precompile + model.train() + for warmup_step in range(args.warmup_steps): + if warmup_step >= _precompile_start: + _pc_idx = warmup_step - _precompile_start + if _pc_idx < len(_precompile_passes): + base_model.num_passes = _precompile_passes[_pc_idx] + CastedLinear._qat_enabled = False + base_model.core_quant_enabled = False + else: + _qat_idx = _pc_idx - len(_precompile_passes) + base_model.num_passes = _qat_precompile_passes[_qat_idx] + CastedLinear._qat_enabled = True + base_model.core_quant_enabled = True + zero_grad_all() + for micro_step in range(grad_accum_steps): + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + (warmup_loss * grad_scale).backward() + if distributed: + for p in base_model.parameters(): + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + if feedback is not None: + for p in feedback.parameters(): + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + for opt in optimizers: + opt.step() + zero_grad_all() + if args.warmup_steps <= 20 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == args.warmup_steps: + log0(f"warmup_step:{warmup_step + 1}/{args.warmup_steps}") + base_model.num_passes = args.num_passes + CastedLinear._qat_enabled = args.qat_enabled + base_model.core_quant_enabled = args.core_quant_enabled + if stabilizer is not None: + stabilizer.reset() + base_model.load_state_dict(initial_model_state, strict=True) + for opt, state in zip(optimizers, initial_optimizer_states, strict=True): + opt.load_state_dict(state) + zero_grad_all() + train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) + swa_state: dict[str, Tensor] | None = None + swa_count = 0 + from collections import deque + lawa_queue: deque[dict[str, Tensor]] = deque(maxlen=args.lawa_k) + _all_state = dict(base_model.state_dict()) + if feedback is not None: + for k, v in feedback.state_dict().items(): + _all_state[f"_fb.{k}"] = v + ema_state = {name: t.detach().float().clone() for name, t in _all_state.items()} + ema_decay = 0.997 + training_time_ms = 0.0 + stop_after_step: int | None = None + torch.cuda.synchronize() + t0 = time.perf_counter() + step = 0 + while True: + last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step) + should_validate = last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0) + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1000.0 * (time.perf_counter() - t0) + val_loss, val_bpb = eval_val( + args, + model, + rank, + world_size, + device, + grad_accum_steps, + val_tokens, + base_bytes_lut, + has_leading_space_lut, + is_boundary_token_lut, + ) + diag_str = "" + if stabilizer is not None and stabilizer.diagnostics.h_norms: + hn = [f"{v:.1f}" for v in stabilizer.diagnostics.h_norms[-args.num_passes*base_model.num_core:]] + gr = [f"{v:.3f}" for v in stabilizer.diagnostics.growth_ratios[-args.num_passes*base_model.num_core:]] + diag_str = f" h_norms={hn} growth={gr}" + stabilizer.reset() + log0( + f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} " + f"train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms / max(step, 1):.2f}ms" + f"{diag_str}" + ) + if use_wandb: + wb_data = {"val_loss": val_loss, "val_bpb": val_bpb} + if stabilizer is not None and stabilizer.diagnostics.growth_ratios: + wb_data["max_growth"] = max(stabilizer.diagnostics.growth_ratios) + wb_data["mean_growth"] = sum(stabilizer.diagnostics.growth_ratios) / len(stabilizer.diagnostics.growth_ratios) + _wandb.log(wb_data, step=step) + torch.cuda.synchronize() + t0 = time.perf_counter() + if last_step: + if stop_after_step is not None and step < args.iterations: + log0( + f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms " + f"step:{step}/{args.iterations}" + ) + break + elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + scale = lr_mul(step, elapsed_ms) + if passes_schedule: + target_passes = args.num_passes + for threshold_step, p in passes_schedule: + if step >= threshold_step: + target_passes = p + if target_passes != base_model.num_passes: + base_model.num_passes = target_passes + log0(f"progressive_passes: step:{step} num_passes:{target_passes}") + if args.late_qat_threshold > 0 and step > 100 and scale < args.late_qat_threshold and not CastedLinear._qat_enabled: + CastedLinear._qat_enabled = True + base_model.core_quant_enabled = True + log0(f"late_qat:enabled step:{step} scale:{scale:.4f} core_quant:on") + zero_grad_all() + train_loss = torch.zeros((), device=device) + for micro_step in range(grad_accum_steps): + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer) + train_loss += loss.detach() + (loss * grad_scale).backward() + train_loss /= grad_accum_steps + frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0 + muon_momentum = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum + for group in optimizer_muon.param_groups: + group["momentum"] = muon_momentum + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group["base_lr"] * scale + grad_norm = None + if args.grad_clip_norm > 0: + grad_norm = torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm) + # === 3-phase overlapped optimizer step === + # Phase 1: Launch async reduce-scatter for banks (biggest first) + optimizer_muon.launch_reduce_scatters() + # Phase 2: All-reduce non-bank grads + step Adam (while bank RS is in-flight) + if distributed: + for p in replicated_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + optimizer_tok.step() + optimizer_scalar.step() + if optimizer_head is not None: + optimizer_head.step() + # Phase 3: Wait for RS, local NS5, all-gather (banks processed last) + optimizer_muon.step() + zero_grad_all() + # EMA update + with torch.no_grad(): + _cur = dict(base_model.state_dict()) + if feedback is not None: + for k, v in feedback.state_dict().items(): + _cur[f"_fb.{k}"] = v + for name, t in _cur.items(): + ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay) + step += 1 + approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + if args.swa_enabled and scale < 0.2 and step % args.swa_every == 0: + if swa_state is None: + swa_state = {name: t.detach().cpu().clone() for name, t in base_model.state_dict().items()} + swa_count = 1 + log0(f"swa:start step:{step}") + else: + for name, t in base_model.state_dict().items(): + swa_state[name] += t.detach().cpu() + swa_count += 1 + if args.lawa_enabled and step % args.lawa_freq == 0: + lawa_queue.append({name: t.detach().cpu().clone() for name, t in base_model.state_dict().items()}) + should_log_train = ( + args.train_log_every > 0 + and (step <= 10 or step % args.train_log_every == 0 or stop_after_step is not None) + ) + if should_log_train: + tl = train_loss.item() + gn_str = f" grad_norm:{grad_norm:.4f}" if grad_norm is not None else "" + log0( + f"step:{step}/{args.iterations} train_loss:{tl:.4f}{gn_str} " + f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms / step:.2f}ms" + ) + if use_wandb: + wlog = {"train_loss": tl, "step_avg_ms": approx_training_time_ms / step, "lr_scale": scale} + if grad_norm is not None: + wlog["grad_norm"] = float(grad_norm) + _wandb.log(wlog, step=step) + reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms + if distributed and max_wallclock_ms is not None: + reached_cap_tensor = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX) + reached_cap = bool(reached_cap_tensor.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + log0( + f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB " + f"reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB" + ) + # Apply weight averaging + if args.lawa_enabled and len(lawa_queue) > 1: + log0(f"lawa:applying LAWA averaging k={len(lawa_queue)}") + current_state = base_model.state_dict() + avg_state = {name: torch.zeros(t.shape, dtype=torch.float32, device='cpu') for name, t in current_state.items()} + for snap in lawa_queue: + for name in avg_state: + avg_state[name] += snap[name].float() + for name in avg_state: + avg_state[name] /= len(lawa_queue) + avg_state[name] = avg_state[name].to(dtype=current_state[name].dtype) + base_model.load_state_dict(avg_state, strict=True) + else: + log0("ema:applying EMA weights") + current_state = base_model.state_dict() + model_ema = {k: v for k, v in ema_state.items() if not k.startswith("_fb.")} + avg_state = {name: model_ema[name].to(dtype=current_state[name].dtype) for name in current_state} + base_model.load_state_dict(avg_state, strict=True) + if feedback is not None: + fb_ema = {k.removeprefix("_fb."): v for k, v in ema_state.items() if k.startswith("_fb.")} + fb_state = feedback.state_dict() + fb_avg = {k: fb_ema[k].to(dtype=fb_state[k].dtype) for k in fb_state} + feedback.load_state_dict(fb_avg, strict=True) + torch.cuda.synchronize() + t_diag = time.perf_counter() + diag_val_loss, diag_val_bpb = eval_val( + args, compiled_model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + ) + torch.cuda.synchronize() + log0( + f"DIAGNOSTIC post_ema val_loss:{diag_val_loss:.4f} val_bpb:{diag_val_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_diag):.0f}ms" + ) + full_state_dict = base_model.state_dict() + export_sd = {k: v for k, v in full_state_dict.items() if "mtp_heads" not in k} + excluded_mtp = sum(int(t.numel()) for k, t in full_state_dict.items() if "mtp_heads" in k) + if excluded_mtp > 0: + log0(f"export_excluding_mtp_params:{excluded_mtp}") + if master_process: + torch.save(export_sd, "final_model.pt") + model_bytes = os.path.getsize("final_model.pt") + code_bytes = len(code.encode("utf-8")) + log0(f"Serialized model: {model_bytes} bytes") + log0(f"Code size: {code_bytes} bytes") + # Override passes for eval phase (train cheap, eval deep) + eval_num_passes = args.eval_passes if args.eval_passes > 0 else args.num_passes + if eval_num_passes != args.num_passes: + log0(f"eval_override: num_passes {args.num_passes} -> {eval_num_passes}") + base_model.num_passes = eval_num_passes + if base_model.residual_scale is not None: + old_s = base_model.residual_scale.scales.data + new_s = torch.full((eval_num_passes,), cli.residual_scale_init, + dtype=torch.float32, device=old_s.device) + copy_len = min(eval_num_passes, old_s.shape[0]) + new_s[:copy_len] = old_s[:copy_len] + base_model.residual_scale.scales = nn.Parameter(new_s) + export_sd = {k: v for k, v in base_model.state_dict().items() if "mtp_heads" not in k} + # Unbank 3D tensors into individual 2D tensors for quantization + sd_cpu = {k: v.detach().cpu() for k, v in export_sd.items()} + unbanked_sd = _unbank_state_dict(sd_cpu, args.num_layers) + quant_result, quant_meta = mixed_quantize_int6(unbanked_sd, {"mlp", "attn"}) + quant_buf = io.BytesIO() + torch.save({"w": quant_result, "m": quant_meta}, quant_buf) + quant_raw = quant_buf.getvalue() + quant_blob = lzma.compress(quant_raw, preset=6) + if master_process: + with open("final_model.int6.ptz", "wb") as f: + f.write(quant_blob) + quant_file_bytes = len(quant_blob) + code_bytes = len(code.encode("utf-8")) + log0(f"Serialized model int6+lzma: {quant_file_bytes} bytes") + log0(f"Total submission size int6+lzma: {quant_file_bytes + code_bytes} bytes") + if distributed: + dist.barrier() + with open("final_model.int6.ptz", "rb") as f: + quant_blob_disk = f.read() + quant_state = torch.load( + io.BytesIO(lzma.decompress(quant_blob_disk)), + map_location="cpu", + ) + deq_unbanked = dequantize_mixed_int6(quant_state["w"], quant_state["m"], unbanked_sd) + # Re-bank the dequantized tensors + deq_state = _rebank_state_dict(deq_unbanked, args.num_layers, sd_cpu) + eval_model = GPT( + vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim, + num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init, + mtp_num_heads=0, mtp_loss_weight=0.0, + bigram_vocab_size=args.bigram_vocab_size, bigram_dim=args.bigram_dim, + xsa_last_n=args.xsa_last_n, + rope_dims=args.rope_dims, ln_scale=args.ln_scale, dtg=args.dtg_enabled, + ve_enabled=args.ve_enabled, ve_dim=args.ve_dim, ve_layers=args.ve_layers, + gated_attention=args.gated_attention, value_residual=args.value_residual, + core_start=args.core_start, core_end=args.core_end, + num_passes=eval_num_passes, + interpass_rmsnorm=not cli.no_interpass_rmsnorm, + ).to(device).bfloat16() + if residual_scale is not None: + eval_rs = ResidualScale(eval_num_passes, cli.residual_scale_init).to(device) + eval_model.residual_scale = eval_rs + eval_model.qo_bank.data = eval_model.qo_bank.data.float() + eval_model.kv_bank.data = eval_model.kv_bank.data.float() + eval_model.mlp_up_bank.data = eval_model.mlp_up_bank.data.float() + eval_model.mlp_down_bank.data = eval_model.mlp_down_bank.data.float() + for m in eval_model.modules(): + if isinstance(m, CastedLinear): + m.float() + restore_low_dim_params_to_fp32(eval_model) + eval_model.load_state_dict(deq_state, strict=True) + # Legal score-first TTT (PR #461 recipe) -- skip intermediate evals to maximize TTT time budget + if args.ttt_enabled: + torch.cuda.synchronize() + t_ttt = time.perf_counter() + ttt_loss, ttt_bpb = eval_val_sliding_ttt( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, log0=log0, + ) + torch.cuda.synchronize() + log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_ttt):.0f}ms") + log0(f"legal_ttt_exact val_loss:{ttt_loss:.8f} val_bpb:{ttt_bpb:.8f}") + if use_wandb: + _wandb.finish() + if distributed: + dist.destroy_process_group() +if __name__ == "__main__": + main() From 36924c047a24f6f82f81fd2232004da3b6fb54bc Mon Sep 17 00:00:00 2001 From: nesta Date: Mon, 30 Mar 2026 19:14:31 +0000 Subject: [PATCH 15/23] yay? --- .../ARCHITECTURE.md | 22 ++ .../run_earlyqat.sh | 21 +- .../train_gpt.py | 236 +++++++++++++++++- .../files/config.yaml | 98 ++++++++ .../files/wandb-summary.json | 1 + 5 files changed, 355 insertions(+), 23 deletions(-) create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/ARCHITECTURE.md create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_204103-e5ujh1d7/files/config.yaml create mode 100644 records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_204103-e5ujh1d7/files/wandb-summary.json diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/ARCHITECTURE.md b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/ARCHITECTURE.md new file mode 100644 index 0000000000..1180221288 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/ARCHITECTURE.md @@ -0,0 +1,22 @@ +# Progressive Recurrence + +``` + ┌───────────┐ ┌───────────┐ ┌───────────┐ + │ │ │ │ │ │ + │ Head │ │ Head │ │ Head │ + │ [7-10] │ │ [7-10] │ │ [7-10] │ + │ │ │ │ │ │ + ├───────────┤ ├───────────┤╮ ├───────────┤╮ + │ │ 4500 steps │ ││ 1500 steps │ ││ + │ Core │ ───────────> │ Core ││ ──────────> │ Core ││ + │ [4-6] │ │ [4-6] │2x │ [4-6] │4x + │ │ │ ││ │ ││ + ├───────────┤ ├───────────┤╯ ├───────────┤╯ + │ │ │ │ │ │ + │Translation│ │Translation│ │Translation│ + │ [0-3] │ │ [0-3] │ │ [0-3] │ + │ │ │ │ │ │ + └───────────┘ └───────────┘ └───────────┘ + + 11 layers 14 layers 20 layers +``` diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh index ff6b0847d7..b04c868a95 100755 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh @@ -20,20 +20,21 @@ export NUM_LAYERS=11 export MODEL_DIM=512 export NUM_HEADS=8 export NUM_KV_HEADS=4 -export BIGRAM_VOCAB_SIZE=0 +export BIGRAM_VOCAB_SIZE=512 +export BIGRAM_DIM=32 export XSA_LAST_N=4 export ROPE_DIMS=16 export LN_SCALE=1 -export VE_ENABLED=0 +export VE_ENABLED=1 export VE_DIM=128 export VE_LAYERS="9,10" # --- Training schedule (progressive 1->4 passes, wallclock-capped at 600s on 8xH100) --- -export ITERATIONS=9000 -export MAX_WALLCLOCK_SECONDS=600 -export VAL_LOSS_EVERY=500 +export ITERATIONS=6500 +export MAX_WALLCLOCK_SECONDS=0 +export VAL_LOSS_EVERY=4000 export TRAIN_LOG_EVERY=50 -export WARMUP_STEPS=24 +export WARMUP_STEPS=20 export WARMDOWN_ITERS=3500 export TRAIN_BATCH_TOKENS=786432 export TRAIN_SEQ_LEN=2048 @@ -55,7 +56,7 @@ export GRAD_CLIP_NORM=0.3 # EARLY QAT: threshold 0.25 (vs 0.15 in winning config) to reduce weight entropy export SWA_ENABLED=1 export SWA_EVERY=50 -export LATE_QAT_THRESHOLD=0.30 +export LATE_QAT_THRESHOLD=0.15 # --- TTT (matches SOTA, freeze_blocks=0) --- export TTT_ENABLED=1 @@ -77,11 +78,11 @@ export CORE_QUANT_ENABLED=0 export PASSES_SCHEDULE="0:1,4500:2,5500:3,6000:4" export SEED=1337 -export RUN_ID="earlyqat_v3" +export RUN_ID="bigram_ve_wd3500" -torchrun --standalone --nproc_per_node=8 train_gpt.py \ +torchrun --standalone --nproc_per_node=1 train_gpt.py \ --feedback-mode diagonal --feedback-rank 2 \ --residual-scale-init 0.5 \ --jacobian-proxy-weight 0.1 \ --no-interpass-rmsnorm \ - 2>&1 | tee logs/earlyqat_v3.txt + 2>&1 | tee logs/bigram_ve_wd3500.txt diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py index c7d002313d..42e5bdcda7 100644 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py @@ -256,6 +256,11 @@ class Hyperparameters: core_quant_enabled = bool(int(os.environ.get("CORE_QUANT_ENABLED", "0"))) eval_passes = int(os.environ.get("EVAL_PASSES", 0)) passes_schedule_str = os.environ.get("PASSES_SCHEDULE", "") + bigram_vocab_size = int(os.environ.get("BIGRAM_VOCAB_SIZE", 0)) + bigram_dim = int(os.environ.get("BIGRAM_DIM", 32)) + ve_enabled = bool(int(os.environ.get("VE_ENABLED", "0"))) + ve_dim = int(os.environ.get("VE_DIM", 128)) + ve_layers = os.environ.get("VE_LAYERS", "9,10") def zeropower_via_newtonschulz5(G: Tensor, steps: int = 5, eps: float = 1e-7) -> Tensor: """""" a, b, c = (3.4445, -4.7750, 2.0315) @@ -485,7 +490,7 @@ def eval_val( pattern for pattern in os.environ.get( "CONTROL_TENSOR_NAME_PATTERNS", - "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,smear", + "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,smear,ve_layer_scales,ve_shared.scale", ).split(",") if pattern ) @@ -647,6 +652,44 @@ def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> x = local[:-1].reshape(-1, seq_len) y = local[1:].reshape(-1, seq_len) return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True) +class BigramHashEmbedding(nn.Module): + def __init__(self, bigram_vocab_size: int, bigram_dim: int, model_dim: int): + super().__init__() + self.bigram_vocab_size = bigram_vocab_size + self.embed = nn.Embedding(bigram_vocab_size, bigram_dim) + nn.init.zeros_(self.embed.weight) + self.proj = CastedLinear(bigram_dim, model_dim, bias=False) if bigram_dim != model_dim else None + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.05, dtype=torch.float32)) + def bigram_hash(self, tokens: Tensor) -> Tensor: + t = tokens.to(torch.int32) + mod = self.bigram_vocab_size - 1 + out = torch.empty_like(t) + out[..., 0] = mod + out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod + return out.long() + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(self.bigram_hash(token_ids)) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) + +class ValueEmbedding(nn.Module): + def __init__(self, vocab_size: int, ve_dim: int, model_dim: int): + super().__init__() + self.embed = nn.Embedding(vocab_size, ve_dim) + nn.init.normal_(self.embed.weight, std=0.01) + self.proj = CastedLinear(ve_dim, model_dim, bias=False) if ve_dim != model_dim else None + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.1, dtype=torch.float32)) + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(token_ids) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) + class RMSNorm(nn.Module): def __init__(self, eps: float | None = None): super().__init__() @@ -745,11 +788,13 @@ def _xsa_efficient(self, y: Tensor, v: Tensor) -> Tensor: vn = F.normalize(v, dim=-1).unsqueeze(-2) # [B, T, Hkv, 1, D] -- broadcast ready proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn return (y_g - proj).reshape(B, T, H, D) - def forward(self, x: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor) -> tuple[Tensor, Tensor | None]: + def forward(self, x: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, v_embed: Tensor | None = None) -> tuple[Tensor, Tensor | None]: bsz, seqlen, dim = x.shape q = F.linear(x, q_w.to(x.dtype)).reshape(bsz, seqlen, self.num_heads, self.head_dim) k = F.linear(x, k_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) v = F.linear(x, v_w.to(x.dtype)) + if v_embed is not None: + v = v + v_embed v = v.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) raw_v = None q = F.rms_norm(q, (q.size(-1),)) @@ -799,10 +844,10 @@ def __init__( self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float()) self.ln_scale_factor = 1.0 / math.sqrt(layer_idx + 1) if ln_scale else 1.0 - def forward(self, x: Tensor, x0: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, up_w: Tensor, down_w: Tensor) -> tuple[Tensor, Tensor | None]: + def forward(self, x: Tensor, x0: Tensor, q_w: Tensor, k_w: Tensor, v_w: Tensor, out_w: Tensor, up_w: Tensor, down_w: Tensor, v_embed: Tensor | None = None) -> tuple[Tensor, Tensor | None]: mix = self.resid_mix.to(dtype=x.dtype) x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 - attn_out, _ = self.attn(self.attn_norm(x_in) * self.ln_scale_factor, q_w, k_w, v_w, out_w) + attn_out, _ = self.attn(self.attn_norm(x_in) * self.ln_scale_factor, q_w, k_w, v_w, out_w, v_embed=v_embed) x_out = x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w) return x_out, None @@ -843,6 +888,11 @@ def __init__( core_quant_enabled: bool = False, residual_scale: nn.Module | None = None, interpass_rmsnorm: bool = True, + bigram_vocab_size: int = 0, + bigram_dim: int = 32, + ve_enabled: bool = False, + ve_dim: int = 128, + ve_layers: str = "9,10", ): super().__init__() self._ve_target_dim = num_kv_heads * (model_dim // num_heads) @@ -862,7 +912,7 @@ def __init__( self.num_tail = num_layers - self.core_end self.residual_scale = residual_scale self.tok_emb = nn.Embedding(vocab_size, model_dim) - self.bigram = None + self.bigram = BigramHashEmbedding(bigram_vocab_size, bigram_dim, model_dim) if bigram_vocab_size > 0 else None self.smear = SmearGate(model_dim) self.num_skip_weights = min(self.num_stem, self.num_tail) self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32)) @@ -894,9 +944,16 @@ def __init__( for block in self.blocks: block.attn.rope_dims = rope_dims block.attn.rotary = Rotary(head_dim, base=rope_base, train_seq_len=1024, rope_dims=rope_dims) - self.ve_shared = None - self.ve_layer_indices = [] - self.ve_layer_scales = nn.ParameterList() + self.ve_layer_indices = [int(x) for x in ve_layers.split(",") if x.strip()] if ve_enabled else [] + kv_dim_ve = self._ve_target_dim + if self.ve_layer_indices: + self.ve_shared = ValueEmbedding(vocab_size, ve_dim, kv_dim_ve) + self.ve_layer_scales = nn.ParameterList( + [nn.Parameter(torch.ones(1, dtype=torch.float32)) for _ in self.ve_layer_indices] + ) + else: + self.ve_shared = None + self.ve_layer_scales = nn.ParameterList() self.value_embeds = nn.ModuleList() self.final_norm = RMSNorm() self.lm_head = None if tie_embeddings else CastedLinear(model_dim, vocab_size, bias=False) @@ -928,6 +985,14 @@ def _init_weights(self) -> None: nn.init.zeros_(module.weight) elif module.weight.ndim == 2 and module.weight.shape[0] >= 64 and module.weight.shape[1] >= 64: nn.init.orthogonal_(module.weight, gain=1.0) + def _get_ve(self, layer_idx: int, input_ids: Tensor, ve_cache: dict | None = None) -> Tensor | None: + if self.ve_shared is None or layer_idx not in self.ve_layer_indices: + return None + if ve_cache is not None and 've' not in ve_cache: + ve_cache['ve'] = self.ve_shared(input_ids) + ve_base = ve_cache['ve'] if ve_cache is not None else self.ve_shared(input_ids) + ve_idx = self.ve_layer_indices.index(layer_idx) + return ve_base * self.ve_layer_scales[ve_idx].to(dtype=ve_base.dtype) def _get_bank_weights(self, bi: int) -> tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]: n = self.num_layers q_w = self.qo_bank[bi] @@ -948,13 +1013,17 @@ def _forward_hidden(self, input_ids: Tensor, feedback_fn=None, stabilizer=None) -> tuple[Tensor, Tensor, Tensor]: n = self.num_layers x = self.tok_emb(input_ids) + if self.bigram is not None: + x = x + self.bigram(input_ids) x = F.rms_norm(x, (x.size(-1),)) x = self.smear(x) x0 = x skips: list[Tensor] = [] + ve_cache: dict = {} for i in range(self.core_start): + ve = self._get_ve(i, input_ids, ve_cache) q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) - x, _ = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w) + x, _ = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, v_embed=ve) skips.append(x) h_core_in = x for k in range(self.num_passes): @@ -967,8 +1036,9 @@ def _forward_hidden(self, input_ids: Tensor, feedback_fn=None, x_before_pass = x for j in range(self.core_start, self.core_end): h_prev = x + ve = self._get_ve(j, input_ids, ve_cache) q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(j) - x, _ = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w) + x, _ = self.blocks[j](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, v_embed=ve) if stabilizer is not None and self.training and not torch.compiler.is_compiling(): stabilizer.record_pass(h_prev, x) if self.residual_scale is not None and k > 0: @@ -979,8 +1049,9 @@ def _forward_hidden(self, input_ids: Tensor, feedback_fn=None, ti = i - self.core_end if ti < len(skips): x = x + self.skip_weights[ti].to(dtype=x.dtype)[None, None, :] * skips.pop() + ve = self._get_ve(i, input_ids, ve_cache) q_w, k_w, v_w, out_w, up_w, down_w = self._get_bank_weights(i) - x, _ = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w) + x, _ = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, v_embed=ve) x = self.final_norm(x) return x, h_core_in, h_core_out def forward(self, input_ids: Tensor, target_ids: Tensor, @@ -1082,6 +1153,7 @@ def eval_val_sliding_ttt( device: torch.device, val_tokens: Tensor, base_bytes_lut: Tensor, has_leading_space_lut: Tensor, is_boundary_token_lut: Tensor, stride: int, batch_seqs: int = 32, log0=print, + feedback_fn=None, feedback_module: nn.Module | None = None, ) -> tuple[float, float]: """""" seq_len = args.train_seq_len @@ -1118,6 +1190,10 @@ def eval_val_sliding_ttt( else: p.requires_grad_(True) ttt_params.append(p) + if feedback_module is not None: + for p in feedback_module.parameters(): + p.requires_grad_(True) + ttt_params.append(p) log0(f"ttt_sliding:params unfrozen={sum(p.numel() for p in ttt_params)} " f"frozen={sum(p.numel() for p in base_model.parameters() if not p.requires_grad)}") optimizer = torch.optim.SGD(ttt_params, lr=args.ttt_lr, momentum=args.ttt_momentum) @@ -1147,7 +1223,7 @@ def eval_val_sliding_ttt( x_batch[i, :wlen] = chunk_tok[:-1] y_batch[i, :wlen] = chunk_tok[1:] with torch.autocast(device_type="cuda", dtype=torch.bfloat16): - logits = base_model.forward_logits(x_batch) + logits = base_model.forward_logits(x_batch, feedback_fn=feedback_fn) nll = F.cross_entropy( logits.reshape(-1, logits.size(-1)).float(), y_batch.reshape(-1), reduction="none", @@ -1186,7 +1262,7 @@ def eval_val_sliding_ttt( y = local[1:].reshape(-1, seq_len) optimizer.zero_grad(set_to_none=True) with torch.autocast(device_type="cuda", dtype=torch.bfloat16): - loss = base_model(x, y) + loss = base_model(x, y, feedback_fn=feedback_fn) loss.backward() if world_size > 1: for p in ttt_params: @@ -1375,6 +1451,9 @@ def parse_args() -> argparse.Namespace: g.add_argument("--residual-scale-init", type=float, default=0.5) g.add_argument("--jacobian-proxy-weight", type=float, default=0.01) g.add_argument("--no-interpass-rmsnorm", action="store_true") + g = parser.add_argument_group("eval-only") + g.add_argument("--eval-only", action="store_true", + help="Skip training, load existing final_model.int6.ptz and run TTT only") return parser.parse_args() def main() -> None: cli = parse_args() @@ -1472,6 +1551,11 @@ def log0(msg: str, console: bool = True) -> None: core_quant_enabled=args.core_quant_enabled, residual_scale=None, interpass_rmsnorm=not cli.no_interpass_rmsnorm, + bigram_vocab_size=args.bigram_vocab_size, + bigram_dim=args.bigram_dim, + ve_enabled=args.ve_enabled, + ve_dim=args.ve_dim, + ve_layers=args.ve_layers, ).to(device).bfloat16() base_model.qo_bank.data = base_model.qo_bank.data.float() base_model.kv_bank.data = base_model.kv_bank.data.float() @@ -1538,6 +1622,18 @@ def feedback_fn(h, pass_idx): scalar_params.append(base_model.smear.gate) token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr tok_params = [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}] + if base_model.bigram is not None: + tok_params.append({"params": [base_model.bigram.embed.weight], "lr": token_lr, "base_lr": token_lr}) + if base_model.bigram.proj is not None: + scalar_params.append(base_model.bigram.proj.weight) + scalar_params.append(base_model.bigram.scale) + if base_model.ve_shared is not None: + tok_params.append({"params": [base_model.ve_shared.embed.weight], "lr": token_lr, "base_lr": token_lr}) + if base_model.ve_shared.proj is not None: + scalar_params.append(base_model.ve_shared.proj.weight) + scalar_params.append(base_model.ve_shared.scale) + for s in base_model.ve_layer_scales: + scalar_params.append(s) optimizer_tok = torch.optim.AdamW( tok_params, betas=(args.beta1, args.beta2), @@ -1611,6 +1707,95 @@ def lr_mul(step: int, elapsed_ms: float) -> float: warmdown_ms = args.warmdown_iters * step_ms remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0 + if cli.eval_only: + log0("eval_only: skipping training, loading existing artifact") + eval_num_passes = args.eval_passes if args.eval_passes > 0 else args.num_passes + if eval_num_passes != args.num_passes: + log0(f"eval_override: num_passes {args.num_passes} -> {eval_num_passes}") + with open("final_model.int6.ptz", "rb") as f: + quant_blob_disk = f.read() + log0(f"eval_only: artifact size {len(quant_blob_disk)} bytes") + quant_state = torch.load( + io.BytesIO(lzma.decompress(quant_blob_disk)), + map_location="cpu", + ) + tmp_model = GPT( + vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim, + num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init, + xsa_last_n=args.xsa_last_n, rope_dims=args.rope_dims, ln_scale=args.ln_scale, + core_start=args.core_start, core_end=args.core_end, + num_passes=eval_num_passes, + interpass_rmsnorm=not cli.no_interpass_rmsnorm, + bigram_vocab_size=args.bigram_vocab_size, bigram_dim=args.bigram_dim, + ve_enabled=args.ve_enabled, ve_dim=args.ve_dim, ve_layers=args.ve_layers, + ) + if cli.residual_scale_init > 0: + tmp_model.residual_scale = ResidualScale(eval_num_passes, cli.residual_scale_init) + tmp_sd = {k: v.detach().cpu() for k, v in tmp_model.state_dict().items()} + unbanked_tmp = _unbank_state_dict(tmp_sd, args.num_layers) + for k in quant_state["m"]: + if k.startswith("_feedback.") and k not in unbanked_tmp: + unbanked_tmp[k] = torch.zeros(1) + deq_unbanked = dequantize_mixed_int6(quant_state["w"], quant_state["m"], unbanked_tmp) + deq_state = _rebank_state_dict(deq_unbanked, args.num_layers, tmp_sd) + eval_feedback = None + eval_feedback_fn = None + fb_keys = {k: v for k, v in deq_state.items() if k.startswith("_feedback.")} + if fb_keys: + deq_state = {k: v for k, v in deq_state.items() if not k.startswith("_feedback.")} + eval_feedback = ErrorFeedbackModule( + dim=args.model_dim, rank=cli.feedback_rank, + feedback_mode=cli.feedback_mode, + per_pass=cli.per_pass_feedback, + num_passes=eval_num_passes, + ).to(device).bfloat16() + fb_sd = {k.removeprefix("_feedback."): v for k, v in fb_keys.items()} + eval_feedback.load_state_dict(fb_sd, strict=True) + def eval_feedback_fn(h, pass_idx): + return eval_feedback(h, pass_idx) + log0(f"eval_feedback: loaded from artifact, params={eval_feedback.param_count()}") + eval_model = GPT( + vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim, + num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init, + xsa_last_n=args.xsa_last_n, rope_dims=args.rope_dims, ln_scale=args.ln_scale, + core_start=args.core_start, core_end=args.core_end, + num_passes=eval_num_passes, + interpass_rmsnorm=not cli.no_interpass_rmsnorm, + bigram_vocab_size=args.bigram_vocab_size, bigram_dim=args.bigram_dim, + ve_enabled=args.ve_enabled, ve_dim=args.ve_dim, ve_layers=args.ve_layers, + ).to(device).bfloat16() + if cli.residual_scale_init > 0: + eval_rs = ResidualScale(eval_num_passes, cli.residual_scale_init).to(device) + eval_model.residual_scale = eval_rs + eval_model.qo_bank.data = eval_model.qo_bank.data.float() + eval_model.kv_bank.data = eval_model.kv_bank.data.float() + eval_model.mlp_up_bank.data = eval_model.mlp_up_bank.data.float() + eval_model.mlp_down_bank.data = eval_model.mlp_down_bank.data.float() + for m in eval_model.modules(): + if isinstance(m, CastedLinear): + m.float() + restore_low_dim_params_to_fp32(eval_model) + eval_model.load_state_dict(deq_state, strict=True) + if args.ttt_enabled: + torch.cuda.synchronize() + t_ttt = time.perf_counter() + ttt_loss, ttt_bpb = eval_val_sliding_ttt( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, log0=log0, + feedback_fn=eval_feedback_fn, feedback_module=eval_feedback, + ) + torch.cuda.synchronize() + log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_ttt):.0f}ms") + log0(f"legal_ttt_exact val_loss:{ttt_loss:.8f} val_bpb:{ttt_bpb:.8f}") + if distributed: + dist.destroy_process_group() + return if args.warmup_steps > 0: initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()} initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] @@ -1818,6 +2003,9 @@ def lr_mul(step: int, elapsed_ms: float) -> float: ) full_state_dict = base_model.state_dict() export_sd = full_state_dict + if feedback is not None: + for k, v in feedback.state_dict().items(): + export_sd[f"_feedback.{k}"] = v if master_process: torch.save(export_sd, "final_model.pt") model_bytes = os.path.getsize("final_model.pt") @@ -1836,6 +2024,9 @@ def lr_mul(step: int, elapsed_ms: float) -> float: new_s[:copy_len] = old_s[:copy_len] base_model.residual_scale.scales = nn.Parameter(new_s) export_sd = base_model.state_dict() + if feedback is not None: + for k, v in feedback.state_dict().items(): + export_sd[f"_feedback.{k}"] = v sd_cpu = {k: v.detach().cpu() for k, v in export_sd.items()} unbanked_sd = _unbank_state_dict(sd_cpu, args.num_layers) quant_result, quant_meta = mixed_quantize_int6(unbanked_sd, {"mlp", "attn"}) @@ -1860,6 +2051,22 @@ def lr_mul(step: int, elapsed_ms: float) -> float: ) deq_unbanked = dequantize_mixed_int6(quant_state["w"], quant_state["m"], unbanked_sd) deq_state = _rebank_state_dict(deq_unbanked, args.num_layers, sd_cpu) + eval_feedback = None + eval_feedback_fn = None + fb_keys = {k: v for k, v in deq_state.items() if k.startswith("_feedback.")} + if fb_keys: + deq_state = {k: v for k, v in deq_state.items() if not k.startswith("_feedback.")} + eval_feedback = ErrorFeedbackModule( + dim=args.model_dim, rank=cli.feedback_rank, + feedback_mode=cli.feedback_mode, + per_pass=cli.per_pass_feedback, + num_passes=eval_num_passes, + ).to(device).bfloat16() + fb_sd = {k.removeprefix("_feedback."): v for k, v in fb_keys.items()} + eval_feedback.load_state_dict(fb_sd, strict=True) + def eval_feedback_fn(h, pass_idx): + return eval_feedback(h, pass_idx) + log0(f"eval_feedback: loaded from artifact, params={eval_feedback.param_count()}") eval_model = GPT( vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim, num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, mlp_mult=args.mlp_mult, @@ -1869,6 +2076,8 @@ def lr_mul(step: int, elapsed_ms: float) -> float: rope_dims=args.rope_dims, ln_scale=args.ln_scale, core_start=args.core_start, core_end=args.core_end, num_passes=eval_num_passes, interpass_rmsnorm=not cli.no_interpass_rmsnorm, + bigram_vocab_size=args.bigram_vocab_size, bigram_dim=args.bigram_dim, + ve_enabled=args.ve_enabled, ve_dim=args.ve_dim, ve_layers=args.ve_layers, ).to(device).bfloat16() if residual_scale is not None: eval_rs = ResidualScale(eval_num_passes, cli.residual_scale_init).to(device) @@ -1889,6 +2098,7 @@ def lr_mul(step: int, elapsed_ms: float) -> float: args, eval_model, rank, world_size, device, val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, stride=args.eval_stride, log0=log0, + feedback_fn=eval_feedback_fn, feedback_module=eval_feedback, ) torch.cuda.synchronize() log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} " diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_204103-e5ujh1d7/files/config.yaml b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_204103-e5ujh1d7/files/config.yaml new file mode 100644 index 0000000000..995b42cc27 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_204103-e5ujh1d7/files/config.yaml @@ -0,0 +1,98 @@ +_wandb: + value: + cli_version: 0.25.1 + e: + 4meixsdgz5im6n6dg04xzn4o4hrbs3lg: + args: + - --feedback-mode + - diagonal + - --feedback-rank + - "2" + - --residual-scale-init + - "0.5" + - --jacobian-proxy-weight + - "0.1" + - --no-interpass-rmsnorm + codePath: records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + codePathLocal: train_gpt_recurrent.py + cpu_count: 8 + cpu_count_logical: 16 + cudaVersion: "13.0" + disk: + /: + total: "1330227675136" + used: "42672181248" + email: nesta.midavaine@prosus.com + executable: /home/nesta/parameter-golf/.venv/bin/python + git: + commit: 0375751244eeb7a472968ecab738e82207af1242 + remote: https://github.com/nestamidavaine/parameter-golf.git + gpu: NVIDIA H200 + gpu_count: 1 + gpu_nvidia: + - architecture: Hopper + cudaCores: 16896 + memoryTotal: "150754820096" + name: NVIDIA H200 + uuid: GPU-e312faf2-f704-c38a-00a2-ba4137b99846 + host: computeinstance-e00c09e8zde17qbk32 + memory: + total: "211069919232" + os: Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.39 + program: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py + python: CPython 3.12.3 + root: /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback + startedAt: "2026-03-29T20:41:03.814826Z" + writerId: 4meixsdgz5im6n6dg04xzn4o4hrbs3lg + m: [] + python_version: 3.12.3 + t: + "1": + - 1 + "2": + - 1 + "3": + - 2 + - 13 + - 16 + - 61 + "4": 3.12.3 + "5": 0.25.1 + "10": + - 20 + "12": 0.25.1 + "13": linux-x86_64 +core_end: + value: 7 +core_start: + value: 4 +feedback_mode: + value: diagonal +feedback_rank: + value: 2 +interpass_rmsnorm: + value: false +iterations: + value: 6500 +jacobian_proxy_weight: + value: 0.1 +matrix_lr: + value: 0.025 +model_dim: + value: 512 +n_params: + value: 26927712 +num_layers: + value: 11 +num_passes: + value: 1 +residual_scale_init: + value: 0.5 +scalar_lr: + value: 0.025 +seed: + value: 1337 +train_batch_tokens: + value: 786432 +train_seq_len: + value: 2048 diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_204103-e5ujh1d7/files/wandb-summary.json b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_204103-e5ujh1d7/files/wandb-summary.json new file mode 100644 index 0000000000..ff336dba23 --- /dev/null +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260329_204103-e5ujh1d7/files/wandb-summary.json @@ -0,0 +1 @@ +{"step_avg_ms":676.9941978683211,"_runtime":7566.023401367,"_timestamp":1.7748215802918837e+09,"_wandb":{"runtime":7566},"_step":6500,"lr_scale":0.0004,"grad_norm":0.03700580447912216,"val_loss":1.9184241376737514,"train_loss":1.9833201169967651,"val_bpb":1.1361988339405418} \ No newline at end of file From 66df5aa7d062cad4be4594e9e78484f52b548126 Mon Sep 17 00:00:00 2001 From: nesta Date: Wed, 1 Apr 2026 10:52:26 +0000 Subject: [PATCH 16/23] change to 2 pass with slighly better performance --- .../run_earlyqat.sh | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh index b04c868a95..9d09a91d07 100755 --- a/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh +++ b/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/run_earlyqat.sh @@ -62,7 +62,7 @@ export LATE_QAT_THRESHOLD=0.15 export TTT_ENABLED=1 export TTT_LR=0.002 export TTT_EPOCHS=3 -export TTT_CHUNK_TOKENS=49152 +export TTT_CHUNK_TOKENS=32768 #49152 export TTT_FREEZE_BLOCKS=0 export TTT_MOMENTUM=0.9 export TTT_BATCH_SEQS=32 @@ -72,17 +72,17 @@ export TTT_GRAD_CLIP=1.0 export CORE_START=4 export CORE_END=7 export NUM_PASSES=1 -export EVAL_PASSES=4 +export EVAL_PASSES=3 export CORE_QUANT_ENABLED=0 -# Progressive: 1-pass until step 4500, then ramp 2->3->4 -export PASSES_SCHEDULE="0:1,4500:2,5500:3,6000:4" +# Progressive: 1-pass until step 4500, then ramp 2->3 +export PASSES_SCHEDULE="0:1,4500:2,5500:3" export SEED=1337 -export RUN_ID="bigram_ve_wd3500" +export RUN_ID="bigram_ve_wd3500_3pass" torchrun --standalone --nproc_per_node=1 train_gpt.py \ --feedback-mode diagonal --feedback-rank 2 \ --residual-scale-init 0.5 \ --jacobian-proxy-weight 0.1 \ --no-interpass-rmsnorm \ - 2>&1 | tee logs/bigram_ve_wd3500.txt + 2>&1 | tee logs/bigram_ve_wd3500_3pass.txt From 44722c4d19d647bf0dd59d5fe0b9e1e245dfaeca Mon Sep 17 00:00:00 2001 From: nesta Date: Wed, 1 Apr 2026 19:58:20 +0000 Subject: [PATCH 17/23] Add recurrent depth with progressive pass growth + error feedback (non-record) 3-seed mean val_bpb: 1.1163 (std 0.0013), -0.0031 vs PR #549 LeakyReLU baseline. Progressive depth recurrence (1->2->3 passes) with error feedback + jacobian proxy stabilization. Late growth preserves fast step times, avoiding the step/capacity trade-off. Code minified with python-minifier to fit all seeds under 16MB. --- .gitignore | 1 + .../README.md | 172 +++++ .../submission.json | 9 + .../train_gpt.py | 617 ++++++++++++++++++ .../train_seed1337.log | 387 +++++++++++ .../train_seed2025.log | 387 +++++++++++ .../train_seed42.log | 387 +++++++++++ 7 files changed, 1960 insertions(+) create mode 100644 records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md create mode 100644 records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json create mode 100644 records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py create mode 100644 records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed1337.log create mode 100644 records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed2025.log create mode 100644 records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed42.log diff --git a/.gitignore b/.gitignore index a6f2bcabc9..4fe67dc24e 100644 --- a/.gitignore +++ b/.gitignore @@ -11,6 +11,7 @@ data/docs_selected.jsonl logs/ *.log *.txt +!records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/*.log *.pt *.ptz *.wandb \ No newline at end of file diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md new file mode 100644 index 0000000000..aaf3d3b9bb --- /dev/null +++ b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md @@ -0,0 +1,172 @@ +# Recurrent Depth with Progressive Pass Growth + Error Feedback + +**val_bpb: 1.1163** (3-seed mean, std 0.0013) | **~15.96 MB** | 8×H100 SXM + +A non-record submission targeting significant improvement over [PR #549](https://github.com/openai/parameter-golf/pull/549) (LeakyReLU² baseline, 1.1194 mean bpb). Achieves **-0.0031 bpb** vs that baseline. For an in-depth analysis of depth recurrence in this competition, see [PR #363](https://github.com/openai/parameter-golf/pull/363). + +## Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128) + +| Seed | step_avg | steps | Pre-TTT bpb | **Post-TTT bpb** | TTT gain | TTT time | Artifact | +|------|----------|-------|-------------|-----------------|----------|----------|----------| +| 1337 | 83.5ms | 6,328 | 1.1353 | **1.1157** | -0.0196 | 566s | 15,909,018 | +| 42 | 83.5ms | 6,334 | 1.1372 | **1.1177** | -0.0195 | 579s | 15,897,530 | +| 2025 | 83.4ms | 6,334 | 1.1351 | **1.1155** | -0.0197 | 588s | 15,995,558 | +| **Mean** | **83.5ms** | **6,332** | **1.1359** | **1.1163 (std 0.0013)** | **-0.0196** | **~578s** | | + +## Progressive Recurrence Architecture + +``` + ┌───────────┐ ┌───────────┐ ┌───────────┐ + │ │ │ │ │ │ + │ Tail │ │ Tail │ │ Tail │ + │ [7-10] │ │ [7-10] │ │ [7-10] │ + │ │ │ │ │ │ + ├───────────┤ ├───────────┤╮ ├───────────┤╮ + │ │ 4500 steps │ ││ 1000 steps │ ││ + │ Core │ ───────────> │ Core ││ ──────────> │ Core ││ + │ [4-6] │ │ [4-6] │2x │ [4-6] │3x + │ │ │ ││ │ ││ + ├───────────┤ ├───────────┤╯ ├───────────┤╯ + │ │ │ │ │ │ + │ Stem │ │ Stem │ │ Stem │ + │ [0-3] │ │ [0-3] │ │ [0-3] │ + │ │ │ │ │ │ + └───────────┘ └───────────┘ └───────────┘ + + 11 layers 14 layers 17 layers + (steps 0-4499) (steps 4500-5499) (steps 5500+, eval) +``` + +## The Problem: Depth Recurrence Fails Under Competition Constraints + +[PR #363](https://github.com/openai/parameter-golf/pull/363) demonstrated that depth recurrence — reusing a shared block of transformer layers multiple times — saves parameters but *hurts* bpb under the 10-minute / 16MB competition constraints. Their controlled experiments showed a **+0.025 bpb gap** (looped worse) due to two compounding taxes: + +1. **Quantization error amplification.** When shared weights are quantized to int6, the quantization error is injected at every pass. After K passes through the same core, the cumulative error grows superlinearly. +2. **Step time overhead.** Each additional recurrence pass adds forward/backward compute. With 4 passes, +32ms/step translates to ~1200 fewer training steps in the 600s budget. + +## Our Solution: Late Growth + Contractive Stabilization + +We address both taxes by growing recurrence depth progressively during training and stabilizing the recurrent dynamics. + +### Progressive Pass Schedule (Late Growth) + +The key insight: **start training with 1 pass and gradually add passes late in training**. This preserves fast step times for the majority of training (83.5ms/step at 1-pass vs ~95ms at 3-pass), maximizing the total number of gradient updates within the 600s wallclock budget. The schedule: + +| Step range | Passes | Effective layers | step_avg | +|------------|--------|-----------------|----------| +| 0–4499 | 1 | 11 | ~83.5ms | +| 4500–5499 | 2 | 14 | ~85.5ms | +| 5500–6328 | 3 | 17 | ~91ms | + +This reduces the step/capacity trade-off that normally makes recurrence impractical under competition constraints. We get ~6,330 training steps (vs ~7,180 for the flat LeakyReLU baseline), but the final model has 17 effective layers at eval vs the baseline's 11. + +We also tested training with 4 recurrence passes. While 4-pass shows better per-step loss, the additional step time cost (~105ms/step) means fewer total steps within the wallclock budget. Under the competition's 600s constraint, **3-pass wins the step/capacity trade-off** — the extra training steps from the faster 3-pass schedule outweigh the marginal per-step quality gain from 4 passes. + +### Learnable Residual Scaling + +Per-pass learnable scalars contract the residual update, preventing hidden state magnitude growth across passes: + +$$h_{k+1} = h_k + \alpha_k \cdot F(h_k + c_k)$$ + +where $\alpha_k$ is initialized to 0.5 and learned during training. This ensures the recurrent dynamics are contractive — later passes refine rather than amplify. + +### Error Feedback Module + +A low-rank correction compensates for accumulated error before each recurrence pass: + +$$e_k = U(V^\top h_k), \qquad c_k = \mathrm{diag}(d) \cdot e_k$$ + +where $U, V \in \mathbb{R}^{d \times r}$ with rank $r=2$ and $d \in \mathbb{R}^d$ is a learnable diagonal. The correction is zero on pass 0 (no prior error to correct) and active on subsequent passes. Total parameter overhead: **2,560 params** (negligible vs 26.7M model params). + +The feedback module is important but not strictly required — we confirmed that stable training is possible without it, and even running eval-only without feedback works, at a cost of ~0.001 bpb higher. The feedback module's main contribution is providing the recurrent passes with an error signal about the previous iteration's residual. + +### Jacobian Proxy Loss (Stabilizer) + +A regularization term penalizes hidden state growth ratio above 1.0, enforcing contractive dynamics without computing the full Jacobian: + +$$\mathcal{L}_J = \lambda \cdot \mathrm{ReLU}\left(\frac{\|h_{k+1} - h_k\|}{\|h_k\| + \epsilon} - 1\right)^{2}$$ + +with $\lambda = 0.01$. This is a cheap finite-difference proxy for the spectral norm of the Jacobian $\partial h_{k+1}/\partial h_k$, encouraging it to stay below 1 (contractive map). + +This loss term is critical for training stability. **Without it, gradient norms and hidden state magnitudes explode** during the multi-pass phases, destabilizing training. The proxy loss keeps the recurrent dynamics well-behaved without the computational cost of full Jacobian computation. + +Note: the jacobian proxy loss is only added to the training loss — it does not affect evaluation scoring, which uses pure cross-entropy. + +## Legal TTT Protocol + +Score-first legal TTT following [PR #461](https://github.com/openai/parameter-golf/pull/461): + +1. Val tokens split into 1,893 non-overlapping 32K-token chunks +2. **For each chunk**: + - **SCORE**: Sliding window eval under `torch.inference_mode()` — no gradients, no weight mutation + - **TRAIN**: SGD on the already-scored chunk. 3 epochs, all blocks unfrozen, cosine LR decay, grad clip 1.0 +3. Last chunk scored but never trained on + +| Parameter | Value | +|-----------|-------| +| Chunk size | 32,768 tokens | +| Optimizer | SGD + momentum(0.9) | +| Learning rate | 0.002 (cosine decay) | +| Epochs per chunk | 3 | +| Frozen blocks | None (all blocks adapt) | +| Gradient clip | 1.0 | +| Eval passes | 3 (matching final training phase) | + +### Timing Budget + +| Phase | Time | +|-------|------| +| Training (wallclock cap) | 600s (10 min) | +| Standard eval (int6 + sliding window) | ~3s | +| Legal TTT (score-first + adaptation) | ~578s | +| **Total eval** | **~581s (< 10 min)** | + +## Architecture + +Built on the [PR #414](https://github.com/openai/parameter-golf/pull/414) stack with [PR #399](https://github.com/openai/parameter-golf/pull/399) Parallel Muon: + +| Component | Setting | +|-----------|---------| +| Layers | 11 unique (512d, 8H, 4KV) | +| Effective layers (eval) | 17 (4 stem + 3 core ×3 + 4 tail) | +| MLP | 3× with LeakyReLU(0.5)² | +| BigramHash | 512 | +| XSA | Last 4 layers | +| RoPE | Partial (16/64 dims) | +| LN Scale | 1/√(layer+1) | +| VE128 | Layers 9-10 | +| Recurrence core | Layers 4-6, progressive 1→2→3 passes | +| ResidualScale | Per-pass learnable, init 0.5 | +| Error Feedback | Diagonal mode, rank 2, 2560 params | +| Jacobian proxy | λ=0.01 | +| Weight avg | EMA(0.997) + SWA(every 50) | +| Quantization | Late QAT (threshold 0.15) + GPTQ-lite int6 + lzma | +| Optimizer | Parameter Banking + Parallel Muon | + +## Run Command + +```bash +cd records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback +bash run_earlyqat.sh # Single seed (set SEED env var) +``` + +Key flags: +```bash +torchrun --standalone --nproc_per_node=8 train_gpt.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.01 \ + --no-interpass-rmsnorm +``` + +## Code Size + +The original training script was 88,253 bytes, which caused seed 2025 to exceed the 16MB submission limit (16,025,625 bytes). Dead code paths (eval-only mode, int8 quantization, unused feedback variants, verbose logging) were removed and the code was minified with [python-minifier](https://github.com/dflook/python-minifier) (no local variable renaming) to 58,186 bytes, bringing all seeds under the limit. + +## Credits + +- **Base model**: [PR #414](https://github.com/openai/parameter-golf/pull/414) by @signalrush +- **Optimizer (Parameter Banking + Parallel Muon)**: [PR #399](https://github.com/openai/parameter-golf/pull/399) by @abaybektursun +- **LeakyReLU² activation**: [PR #493](https://github.com/openai/parameter-golf/pull/493) by @parinzee +- **TTT recipe**: [PR #461](https://github.com/openai/parameter-golf/pull/461) by @Christopher-Lee-McClendon +- **Depth recurrence analysis**: [PR #363](https://github.com/openai/parameter-golf/pull/363) by @evangelinehelsinki diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json new file mode 100644 index 0000000000..f8d545d33b --- /dev/null +++ b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json @@ -0,0 +1,9 @@ +{ + "name": "Recurrent Depth with Progressive Pass Growth + Error Feedback", + "val_bpb": 1.1163, + "bytes_total": 15995558, + "blurb": "Progressive depth recurrence (1->2->3 passes) with error feedback + jacobian proxy stabilization. Late growth preserves fast step times for most of training, avoiding the step/capacity trade-off that makes naive recurrence impractical. 3-seed mean: 1.1163 (std 0.0013), -0.0031 vs PR #549 LeakyReLU baseline (1.1194). Built on PR #414 stack with Parallel Muon (PR #399). All artifacts under 16MB, all eval under 10 min.", + "author": "abaybektursun", + "github_id": "abaybektursun", + "date": "2026-03-26" +} diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py new file mode 100644 index 0000000000..42b03eb369 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py @@ -0,0 +1,617 @@ +from __future__ import annotations +_Z='passthrough_ctrl' +_Y='passthrough' +_X='momentum' +_W='shard_mom' +_V='padded_grad' +_U='fineweb_train_*.bin' +_T='diagonal' +_S='.scale' +_R='mlp_down_bank' +_Q='mlp_up_bank' +_P='kv_bank' +_O='qo_bank' +_N='attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,smear,ve_layer_scales,ve_shared.scale' +_M='shard' +_L='scale' +_K='full_update' +_J='utf-8' +_I='cuda' +_H='0' +_G='lr' +_F='params' +_E=.0 +_D=False +_C=1. +_B=True +_A=None +import copy,glob,io,lzma,math,os,random,time,uuid +from pathlib import Path +import numpy as np,sentencepiece as spm,torch,torch._dynamo +torch._dynamo.config.recompile_limit=32 +import torch.distributed as dist,torch.nn.functional as F +from torch import Tensor,nn +_gpu_mem_frac=float(os.environ.get('CUDA_MEM_FRACTION',_H)) +if _gpu_mem_frac>0:torch.cuda.set_per_process_memory_fraction(_gpu_mem_frac,0) +from flash_attn_interface import flash_attn_func as flash_attn_3_func +import argparse +class RecurrentStabilizer: + def __init__(self,jacobian_proxy_weight=_E,eps=1e-06,**kw):self.jacobian_proxy_weight=jacobian_proxy_weight;self.eps=eps + def clip(self,h):return h + def jacobian_proxy_loss(self,h_in,h_out): + if self.jacobian_proxy_weight<=0:return h_in.new_zeros(()) + delta=h_out-h_in;ratio=delta.norm()/(h_in.norm()+self.eps);return self.jacobian_proxy_weight*torch.relu(ratio-_C).square() + def reset(self):0 +class ResidualScale(nn.Module): + def __init__(self,num_passes,init_value=_C):super().__init__();self.scales=nn.Parameter(torch.full((num_passes,),init_value,dtype=torch.float32)) + def forward(self,residual,pass_idx):return self.scales[pass_idx].to(dtype=residual.dtype)*residual +class LowRankResidual(nn.Module): + def __init__(self,dim,rank=2):super().__init__();self.V=nn.Parameter(torch.zeros(dim,rank));self.U=nn.Parameter(torch.zeros(dim,rank)) + def forward(self,h):return h@self.V@self.U.T +class DiagonalFeedback(nn.Module): + def __init__(self,dim,init_ones=_D):super().__init__();init_val=torch.ones(dim)if init_ones else torch.zeros(dim);self.d=nn.Parameter(init_val) + def forward(self,e):return self.d.to(dtype=e.dtype)*e +class ErrorFeedbackModule(nn.Module): + def __init__(self,dim,rank=2,feedback_mode=_T,per_pass=_D,num_passes=3,**kw): + super().__init__();self.per_pass=per_pass;self.residual=LowRankResidual(dim,rank) + if feedback_mode=='identity':self.correction=_A + elif per_pass:self.correction=nn.ModuleList([DiagonalFeedback(dim)for _ in range(num_passes)]) + else:self.correction=DiagonalFeedback(dim) + def forward(self,h,pass_idx): + e=self.residual(h) + if self.correction is _A:c=e + elif self.per_pass:c=self.correction[pass_idx](e) + else:c=self.correction(e) + mask=torch.tensor(_C if pass_idx>0 else _E,device=h.device,dtype=h.dtype);return c*mask + def param_count(self):return sum(p.numel()for p in self.parameters()) +_e=os.environ.get +_i=lambda k,d:int(_e(k,d)) +_f=lambda k,d:float(_e(k,d)) +_b=lambda k,d:bool(int(_e(k,d))) +class Hyperparameters:data_path=_e('DATA_PATH','./data/datasets/fineweb10B_sp1024');train_files=os.path.join(data_path,_U);val_files=os.path.join(data_path,'fineweb_val_*.bin');tokenizer_path=_e('TOKENIZER_PATH','./data/tokenizers/fineweb_1024_bpe.model');run_id=_e('RUN_ID',str(uuid.uuid4()));seed=_i('SEED',1337);val_batch_size=_i('VAL_BATCH_SIZE',524288);val_loss_every=_i('VAL_LOSS_EVERY',4000);train_log_every=_i('TRAIN_LOG_EVERY',500);iterations=_i('ITERATIONS',20000);warmdown_iters=_i('WARMDOWN_ITERS',3500);warmup_steps=_i('WARMUP_STEPS',20);train_batch_tokens=_i('TRAIN_BATCH_TOKENS',786432);train_seq_len=_i('TRAIN_SEQ_LEN',2048);eval_seq_len=_i('EVAL_SEQ_LEN',2048);max_wallclock_seconds=_f('MAX_WALLCLOCK_SECONDS',6e2);qk_gain_init=_f('QK_GAIN_INIT',1.5);vocab_size=_i('VOCAB_SIZE',1024);num_layers=_i('NUM_LAYERS',11);num_kv_heads=_i('NUM_KV_HEADS',4);model_dim=_i('MODEL_DIM',512);num_heads=_i('NUM_HEADS',8);mlp_mult=_f('MLP_MULT',3.);tie_embeddings=_b('TIE_EMBEDDINGS','1');rope_base=_f('ROPE_BASE',1e4);logit_softcap=_f('LOGIT_SOFTCAP',3e1);embed_lr=_f('EMBED_LR',.6);head_lr=_f('HEAD_LR',.008);tied_embed_lr=_f('TIED_EMBED_LR',.035);tied_embed_init_std=_f('TIED_EMBED_INIT_STD',.005);matrix_lr=_f('MATRIX_LR',.025);scalar_lr=_f('SCALAR_LR',.025);muon_momentum=_f('MUON_MOMENTUM',.99);muon_backend_steps=_i('MUON_BACKEND_STEPS',5);muon_momentum_warmup_start=_f('MUON_MOMENTUM_WARMUP_START',.92);muon_momentum_warmup_steps=_i('MUON_MOMENTUM_WARMUP_STEPS',1500);beta1=_f('BETA1',.9);beta2=_f('BETA2',.95);adam_eps=_f('ADAM_EPS',1e-08);grad_clip_norm=_f('GRAD_CLIP_NORM',.3);eval_stride=_i('EVAL_STRIDE',64);muon_beta2=_f('MUON_BETA2',.95);swa_enabled=_b('SWA_ENABLED','1');swa_every=_i('SWA_EVERY',50);muon_wd=_f('MUON_WD',.04);adam_wd=_f('ADAM_WD',.04);qat_enabled=_b('QAT_ENABLED',_H);xsa_last_n=_i('XSA_LAST_N',4);rope_dims=_i('ROPE_DIMS',16);ln_scale=_b('LN_SCALE','1');late_qat_threshold=_f('LATE_QAT_THRESHOLD',.15);ttt_enabled=_b('TTT_ENABLED',_H);ttt_lr=_f('TTT_LR',.002);ttt_epochs=_i('TTT_EPOCHS',3);ttt_chunk_tokens=_i('TTT_CHUNK_TOKENS',32768);ttt_freeze_blocks=_i('TTT_FREEZE_BLOCKS',2);ttt_momentum=_f('TTT_MOMENTUM',.9);ttt_batch_seqs=_i('TTT_BATCH_SEQS',32);ttt_grad_clip=_f('TTT_GRAD_CLIP',_C);core_start=_i('CORE_START',3);core_end=_i('CORE_END',8);num_passes=_i('NUM_PASSES',1);core_quant_bits=_i('CORE_QUANT_BITS',6);core_quant_enabled=_b('CORE_QUANT_ENABLED',_H);eval_passes=_i('EVAL_PASSES',0);passes_schedule_str=_e('PASSES_SCHEDULE','');bigram_vocab_size=_i('BIGRAM_VOCAB_SIZE',0);bigram_dim=_i('BIGRAM_DIM',32);ve_enabled=_b('VE_ENABLED',_H);ve_dim=_i('VE_DIM',128);ve_layers=_e('VE_LAYERS','9,10') +def zeropower_via_newtonschulz5(G,steps=5,eps=1e-07): + a,b,c=3.4445,-4.775,2.0315;was_2d=G.ndim==2 + if was_2d:G=G.unsqueeze(0) + X=G.bfloat16();transposed=X.size(-2)>X.size(-1) + if transposed:X=X.mT + X=X/(X.norm(dim=(-2,-1),keepdim=_B)+eps) + for _ in range(steps):A=X@X.mT;B=b*A+c*(A@A);X=a*X+B@X + if transposed:X=X.mT + if was_2d:X=X.squeeze(0) + return X +class Muon(torch.optim.Optimizer): + def __init__(self,params,lr,momentum,backend_steps,nesterov=_B,weight_decay=_E):super().__init__(params,dict(lr=lr,momentum=momentum,backend_steps=backend_steps,nesterov=nesterov,weight_decay=weight_decay));self._built=_D + def _build(self): + self._distributed=dist.is_available()and dist.is_initialized();self._world_size=dist.get_world_size()if self._distributed else 1;self._rank=dist.get_rank()if self._distributed else 0;ws=self._world_size;self._bank_meta=[] + for group in self.param_groups: + for p in group[_F]:B=p.shape[0];padded_B=(B+ws-1)//ws*ws;shard_B=padded_B//ws;tail=p.shape[1:];dev=p.device;self._bank_meta.append({'p':p,'B':B,_V:torch.zeros(padded_B,*tail,device=dev,dtype=torch.bfloat16),_M:torch.zeros(shard_B,*tail,device=dev,dtype=torch.bfloat16),_W:torch.zeros(shard_B,*tail,device=dev,dtype=torch.bfloat16),_K:torch.zeros(padded_B,*tail,device=dev,dtype=torch.bfloat16),_L:max(1,p.shape[-2]/p.shape[-1])**.5}) + self._bank_meta.sort(key=lambda m:-m['p'].numel());self._built=_B + def launch_reduce_scatters(self): + '' + if not self._built:self._build() + if not self._distributed:return + self._rs_futures=[] + for m in self._bank_meta: + p=m['p'] + if p.grad is _A:self._rs_futures.append(_A);continue + pg=m[_V];pg[:m['B']].copy_(p.grad.bfloat16()) + if pg.shape[0]>m['B']:pg[m['B']:].zero_() + fut=dist.reduce_scatter_tensor(m[_M],pg,op=dist.ReduceOp.AVG,async_op=_B);self._rs_futures.append(fut) + @torch.no_grad() + def step(self,closure=_A): + '';B='_rs_futures';A='momentum_buffer';loss=_A + if closure is not _A: + with torch.enable_grad():loss=closure() + if not self._built:self._build() + for group in self.param_groups: + lr=group[_G];momentum=group[_X];backend_steps=group['backend_steps'];nesterov=group['nesterov'];wd=group.get('weight_decay',_E);prev_ag_handle=_A;prev_m=_A;sharded=self._distributed and hasattr(self,B) + for(i,m)in enumerate(self._bank_meta): + p=m['p'] + if p.grad is _A:continue + if prev_ag_handle is not _A: + prev_ag_handle.wait();pp=prev_m['p'];upd=prev_m[_K][:prev_m['B']] + if wd>_E:pp.data.mul_(_C-lr*wd) + pp.add_(upd.to(dtype=pp.dtype),alpha=-lr*prev_m[_L]) + if sharded and self._rs_futures[i]is not _A:self._rs_futures[i].wait();g=m[_M];buf=m[_W] + else: + g=p.grad.bfloat16();state=self.state[p] + if A not in state:state[A]=torch.zeros_like(g) + buf=state[A] + buf.mul_(momentum).add_(g) + if nesterov:update=g.add(buf,alpha=momentum) + else:update=buf + update=zeropower_via_newtonschulz5(update,steps=backend_steps) + if sharded:prev_ag_handle=dist.all_gather_into_tensor(m[_K],update,async_op=_B);prev_m=m + else: + if wd>_E:p.data.mul_(_C-lr*wd) + p.add_(update.to(dtype=p.dtype),alpha=-lr*m[_L]) + if prev_ag_handle is not _A: + prev_ag_handle.wait();pp=prev_m['p'];upd=prev_m[_K][:prev_m['B']] + if wd>_E:pp.data.mul_(_C-lr*wd) + pp.add_(upd.to(dtype=pp.dtype),alpha=-lr*prev_m[_L]) + if hasattr(self,B):del self._rs_futures + return loss +def build_sentencepiece_luts(sp,vocab_size,device): + sp_vocab_size=int(sp.vocab_size());table_size=max(sp_vocab_size,vocab_size);base_bytes_np=np.zeros((table_size,),dtype=np.int16);has_leading_space_np=np.zeros((table_size,),dtype=np.bool_);is_boundary_token_np=np.ones((table_size,),dtype=np.bool_) + for token_id in range(sp_vocab_size): + if sp.is_control(token_id)or sp.is_unknown(token_id)or sp.is_unused(token_id):continue + is_boundary_token_np[token_id]=_D + if sp.is_byte(token_id):base_bytes_np[token_id]=1;continue + piece=sp.id_to_piece(token_id) + if piece.startswith('▁'):has_leading_space_np[token_id]=_B;piece=piece[1:] + base_bytes_np[token_id]=len(piece.encode(_J)) + return torch.tensor(base_bytes_np,dtype=torch.int16,device=device),torch.tensor(has_leading_space_np,dtype=torch.bool,device=device),torch.tensor(is_boundary_token_np,dtype=torch.bool,device=device) +def load_validation_tokens(pattern,seq_len): + files=[Path(p)for p in sorted(glob.glob(pattern))] + if not files:raise FileNotFoundError(f"No files found for pattern: {pattern}") + tokens=torch.cat([load_data_shard(file)for file in files]).contiguous();usable=(tokens.numel()-1)//seq_len*seq_len + if usable<=0:raise ValueError('val split too short') + return tokens[:usable+1] +def eval_val(args,model,rank,world_size,device,grad_accum_steps,val_tokens,base_bytes_lut,has_leading_space_lut,is_boundary_token_lut,eval_seq_len=_A): + seq_len=eval_seq_len or args.train_seq_len;local_batch_tokens=args.val_batch_size//(world_size*grad_accum_steps) + if local_batch_tokens0 else _C,dtype=torch.float32);q=torch.clamp(torch.round(torch.clamp(t32,-clip_abs,clip_abs)/scale),-127,127).to(torch.int8).contiguous();return q,scale +def load_data_shard(file): + B='0: + avail=self.tokens.numel()-self.pos + if avail<=0:self._advance_file();continue + k=min(remaining,avail);chunks.append(self.tokens[self.pos:self.pos+k]);self.pos+=k;remaining-=k + return chunks[0]if len(chunks)==1 else torch.cat(chunks) +class DistributedTokenLoader: + def __init__(self,pattern,rank,world_size,device):self.rank=rank;self.world_size=world_size;self.device=device;self.stream=TokenStream(pattern) + def next_batch(self,global_tokens,seq_len,grad_accum_steps):local_tokens=global_tokens//(self.world_size*grad_accum_steps);per_rank_span=local_tokens+1;chunk=self.stream.take(per_rank_span*self.world_size);start=self.rank*per_rank_span;local=chunk[start:start+per_rank_span].to(dtype=torch.int64);x=local[:-1].reshape(-1,seq_len);y=local[1:].reshape(-1,seq_len);return x.to(self.device,non_blocking=_B),y.to(self.device,non_blocking=_B) +class BigramHashEmbedding(nn.Module): + def __init__(self,bigram_vocab_size,bigram_dim,model_dim): + super().__init__();self.bigram_vocab_size=bigram_vocab_size;self.embed=nn.Embedding(bigram_vocab_size,bigram_dim);nn.init.zeros_(self.embed.weight);self.proj=CastedLinear(bigram_dim,model_dim,bias=_D)if bigram_dim!=model_dim else _A + if self.proj is not _A:nn.init.zeros_(self.proj.weight) + self.scale=nn.Parameter(torch.tensor(.05,dtype=torch.float32)) + def bigram_hash(self,tokens):t=tokens.to(torch.int32);mod=self.bigram_vocab_size-1;out=torch.empty_like(t);out[...,0]=mod;out[...,1:]=torch.bitwise_xor(36313*t[...,1:],27191*t[...,:-1])%mod;return out.long() + def forward(self,token_ids): + h=self.embed(self.bigram_hash(token_ids)) + if self.proj is not _A:h=self.proj(h) + return h*self.scale.to(dtype=h.dtype) +class ValueEmbedding(nn.Module): + def __init__(self,vocab_size,ve_dim,model_dim): + super().__init__();self.embed=nn.Embedding(vocab_size,ve_dim);nn.init.normal_(self.embed.weight,std=.01);self.proj=CastedLinear(ve_dim,model_dim,bias=_D)if ve_dim!=model_dim else _A + if self.proj is not _A:nn.init.zeros_(self.proj.weight) + self.scale=nn.Parameter(torch.tensor(.1,dtype=torch.float32)) + def forward(self,token_ids): + h=self.embed(token_ids) + if self.proj is not _A:h=self.proj(h) + return h*self.scale.to(dtype=h.dtype) +class RMSNorm(nn.Module): + def __init__(self,eps=_A):super().__init__();self.eps=eps + def forward(self,x):return F.rms_norm(x,(x.size(-1),),eps=self.eps) +class CastedLinear(nn.Linear): + _qat_enabled:bool=_D + def forward(self,x): + w=self.weight.to(x.dtype) + if CastedLinear._qat_enabled and self.training and w.ndim==2: + with torch.no_grad():w32=self.weight.float();row_max=w32.abs().amax(dim=1);scale=(row_max/31.).clamp_min(_C/31.);w_q=(torch.clamp(torch.round(w32/scale[:,_A]),-32,31)*scale[:,_A]).to(x.dtype) + w=w+(w_q-w).detach() + bias=self.bias.to(x.dtype)if self.bias is not _A else _A;return F.linear(x,w,bias) +def restore_low_dim_params_to_fp32(module): + with torch.no_grad(): + for(name,param)in module.named_parameters(): + if(param.ndim<2 or any(p in name for p in _N.split(',')))and param.dtype!=torch.float32:param.data=param.data.float() +class Rotary(nn.Module): + def __init__(self,dim,base=1e4,train_seq_len=1024,rope_dims=0):super().__init__();self.dim=dim;self.base=base;self.train_seq_len=train_seq_len;self.rope_dims=rope_dims if rope_dims>0 else dim;inv_freq=_C/base**(torch.arange(0,self.rope_dims,2,dtype=torch.float32)/self.rope_dims);self.register_buffer('inv_freq',inv_freq,persistent=_D);self._seq_len_cached=0;self._cos_cached=_A;self._sin_cached=_A + def forward(self,seq_len,device,dtype): + if self._cos_cached is _A or self._sin_cached is _A or self._seq_len_cached!=seq_len or self._cos_cached.device!=device: + rd=self.rope_dims + if seq_len>self.train_seq_len:scale=seq_len/self.train_seq_len;new_base=self.base*scale**(rd/(rd-2));inv_freq=_C/new_base**(torch.arange(0,rd,2,dtype=torch.float32,device=device)/rd) + else:inv_freq=self.inv_freq.to(device) + t=torch.arange(seq_len,device=device,dtype=inv_freq.dtype);freqs=torch.outer(t,inv_freq);self._cos_cached=freqs.cos()[_A,:,_A,:];self._sin_cached=freqs.sin()[_A,:,_A,:];self._seq_len_cached=seq_len + return self._cos_cached.to(dtype=dtype),self._sin_cached.to(dtype=dtype) +def apply_rotary_emb(x,cos,sin,rope_dims=0): + if rope_dims>0 and rope_dims=2:row_max=w32.abs().amax(dim=-1);scale=(row_max/clip_range).clamp_min(_C/clip_range);dims=(slice(_A),)*(w32.ndim-1)+(_A,);w_q=(torch.clamp(torch.round(w32/scale[dims]),-clip_range,clip_range)*scale[dims]).to(w.dtype) + else:amax=w32.abs().max();scale=(amax/clip_range).clamp_min(_C/clip_range);w_q=(torch.clamp(torch.round(w32/scale),-clip_range,clip_range)*scale).to(w.dtype) + return w+(w_q-w).detach() +class GPT(nn.Module): + def __init__(self,vocab_size,num_layers,model_dim,num_heads,num_kv_heads,mlp_mult,tie_embeddings,tied_embed_init_std,logit_softcap,rope_base,qk_gain_init,xsa_last_n=0,rope_dims=0,ln_scale=_D,core_start=3,core_end=8,num_passes=1,core_quant_bits=6,core_quant_enabled=_D,residual_scale=_A,interpass_rmsnorm=_B,bigram_vocab_size=0,bigram_dim=32,ve_enabled=_D,ve_dim=128,ve_layers='9,10'): + super().__init__();self._ve_target_dim=num_kv_heads*(model_dim//num_heads) + if logit_softcap<=_E:raise ValueError('logit_softcap must be >0') + self.tie_embeddings=tie_embeddings;self.tied_embed_init_std=tied_embed_init_std;self.logit_softcap=logit_softcap;self.core_start=core_start;self.core_end=min(core_end,num_layers);self.interpass_rmsnorm=interpass_rmsnorm;self.num_passes=num_passes;self.core_quant_bits=core_quant_bits;self.core_quant_enabled=core_quant_enabled;self.num_stem=core_start;self.num_core=self.core_end-core_start;self.num_tail=num_layers-self.core_end;self.residual_scale=residual_scale;self.tok_emb=nn.Embedding(vocab_size,model_dim);self.bigram=BigramHashEmbedding(bigram_vocab_size,bigram_dim,model_dim)if bigram_vocab_size>0 else _A;self.smear=SmearGate(model_dim);self.num_skip_weights=min(self.num_stem,self.num_tail);self.skip_weights=nn.Parameter(torch.ones(self.num_skip_weights,model_dim,dtype=torch.float32));head_dim=model_dim//num_heads;kv_dim=num_kv_heads*head_dim;mlp_dim=int(mlp_mult*model_dim);self.num_layers=num_layers;self.qo_bank=nn.Parameter(torch.empty(2*num_layers,model_dim,model_dim));self.kv_bank=nn.Parameter(torch.empty(2*num_layers,kv_dim,model_dim));self.mlp_up_bank=nn.Parameter(torch.empty(num_layers,mlp_dim,model_dim));self.mlp_down_bank=nn.Parameter(torch.empty(num_layers,model_dim,mlp_dim));self.blocks=nn.ModuleList([Block(model_dim,num_heads,num_kv_heads,mlp_mult,rope_base,qk_gain_init,layer_idx=i,ln_scale=ln_scale)for i in range(num_layers)]) + if rope_dims>0: + head_dim=model_dim//num_heads + for block in self.blocks:block.attn.rope_dims=rope_dims;block.attn.rotary=Rotary(head_dim,base=rope_base,train_seq_len=1024,rope_dims=rope_dims) + self.ve_layer_indices=[int(x)for x in ve_layers.split(',')if x.strip()]if ve_enabled else[];kv_dim_ve=self._ve_target_dim + if self.ve_layer_indices:self.ve_shared=ValueEmbedding(vocab_size,ve_dim,kv_dim_ve);self.ve_layer_scales=nn.ParameterList([nn.Parameter(torch.ones(1,dtype=torch.float32))for _ in self.ve_layer_indices]) + else:self.ve_shared=_A;self.ve_layer_scales=nn.ParameterList() + self.value_embeds=nn.ModuleList();self.final_norm=RMSNorm();self.lm_head=_A if tie_embeddings else CastedLinear(model_dim,vocab_size,bias=_D) + if self.lm_head is not _A:self.lm_head._zero_init=_B + self.mtp_heads=nn.ModuleList() + if xsa_last_n>0: + for i in range(max(0,num_layers-xsa_last_n),num_layers): + if i=self.core_end:self.blocks[i].attn.use_xsa=_B + self._init_weights() + def _init_weights(self): + if self.tie_embeddings:nn.init.normal_(self.tok_emb.weight,mean=_E,std=self.tied_embed_init_std) + n=self.num_layers;proj_scale=_C/math.sqrt(2*n) + for i in range(n):nn.init.orthogonal_(self.qo_bank.data[i],gain=_C);nn.init.zeros_(self.qo_bank.data[n+i]);nn.init.orthogonal_(self.kv_bank.data[i],gain=_C);nn.init.orthogonal_(self.kv_bank.data[n+i],gain=_C);nn.init.orthogonal_(self.mlp_up_bank.data[i],gain=_C);nn.init.zeros_(self.mlp_down_bank.data[i]);self.qo_bank.data[n+i].mul_(proj_scale);self.mlp_down_bank.data[i].mul_(proj_scale) + for(name,module)in self.named_modules(): + if isinstance(module,nn.Linear): + if getattr(module,'_zero_init',_D):nn.init.zeros_(module.weight) + elif module.weight.ndim==2 and module.weight.shape[0]>=64 and module.weight.shape[1]>=64:nn.init.orthogonal_(module.weight,gain=_C) + def _get_ve(self,layer_idx,input_ids,ve_cache=_A): + A='ve' + if self.ve_shared is _A or layer_idx not in self.ve_layer_indices:return + if ve_cache is not _A and A not in ve_cache:ve_cache[A]=self.ve_shared(input_ids) + ve_base=ve_cache[A]if ve_cache is not _A else self.ve_shared(input_ids);ve_idx=self.ve_layer_indices.index(layer_idx);return ve_base*self.ve_layer_scales[ve_idx].to(dtype=ve_base.dtype) + def _get_bank_weights(self,bi): + n=self.num_layers;q_w=self.qo_bank[bi];out_w=self.qo_bank[n+bi];k_w=self.kv_bank[bi];v_w=self.kv_bank[n+bi];up_w=self.mlp_up_bank[bi];down_w=self.mlp_down_bank[bi] + if self.core_quant_enabled and self.training and self.core_start<=bi0 and self.interpass_rmsnorm:x=F.rms_norm(x,(x.size(-1),)) + if feedback_fn is not _A:x=x+feedback_fn(x,k) + if stabilizer is not _A:x=stabilizer.clip(x) + x_before_pass=x + for j in range(self.core_start,self.core_end):h_prev=x;ve=self._get_ve(j,input_ids,ve_cache);q_w,k_w,v_w,out_w,up_w,down_w=self._get_bank_weights(j);x,_=self.blocks[j](x,x0,q_w,k_w,v_w,out_w,up_w,down_w,v_embed=ve) + if self.residual_scale is not _A and k>0:delta=x-x_before_pass;x=x_before_pass+self.residual_scale(delta,k) + h_core_out=x + for i in range(self.core_end,n): + ti=i-self.core_end + if ti0:main_loss=main_loss+stabilizer.jacobian_proxy_loss(h_core_in,h_core_out) + return main_loss + def forward_logits(self,input_ids,feedback_fn=_A,stabilizer=_A): + '';x,_,_=self._forward_hidden(input_ids,feedback_fn,stabilizer) + if self.tie_embeddings:logits_proj=F.linear(x,self.tok_emb.weight) + else:logits_proj=self.lm_head(x) + return self.logit_softcap*torch.tanh(logits_proj/self.logit_softcap) +def eval_val_sliding_ttt(args,base_model,rank,world_size,device,val_tokens,base_bytes_lut,has_leading_space_lut,is_boundary_token_lut,stride,batch_seqs=32,log0=print,feedback_fn=_A,feedback_module=_A): + seq_len=args.train_seq_len;total_tokens=val_tokens.numel()-1;ttt_chunk=args.ttt_chunk_tokens;window_starts=[ws for ws in range(0,total_tokens,stride)if min(ws+seq_len,total_tokens)-ws>=stride or ws==0];num_chunks=(total_tokens+ttt_chunk-1)//ttt_chunk;chunk_windows=[[]for _ in range(num_chunks)] + for ws in window_starts:end=min(ws+seq_len,total_tokens);wlen=end-ws;s=0 if ws==0 else max(wlen-stride,0);scored_start=ws+s;ci=min(scored_start//ttt_chunk,num_chunks-1);chunk_windows[ci].append(ws) + log0(f"ttt_sliding:start chunks={num_chunks} chunk_tokens={ttt_chunk} total_windows={len(window_starts)} stride={stride} ttt_lr={args.ttt_lr} ttt_epochs={args.ttt_epochs} freeze_blocks={args.ttt_freeze_blocks}");loss_sum=torch.zeros((),device=device,dtype=torch.float64);token_count=torch.zeros((),device=device,dtype=torch.float64);byte_count=torch.zeros((),device=device,dtype=torch.float64);frozen_block_ids=set(range(min(args.ttt_freeze_blocks,len(base_model.blocks))));ttt_params=[] + for(name,p)in base_model.named_parameters(): + freeze=_D + for bi in frozen_block_ids: + if f"blocks.{bi}."in name:freeze=_B;break + if freeze:p.requires_grad_(_D) + else:p.requires_grad_(_B);ttt_params.append(p) + if feedback_module is not _A: + for p in feedback_module.parameters():p.requires_grad_(_B);ttt_params.append(p) + log0(f"ttt_sliding:params unfrozen={sum(p.numel()for p in ttt_params)} frozen={sum(p.numel()for p in base_model.parameters()if not p.requires_grad)}");optimizer=torch.optim.SGD(ttt_params,lr=args.ttt_lr,momentum=args.ttt_momentum);t0=time.perf_counter() + for ci in range(num_chunks): + windows=chunk_windows[ci] + if not windows:continue + chunk_start=ci*ttt_chunk;chunk_end=min((ci+1)*ttt_chunk,total_tokens);my_s=len(windows)*rank//world_size;my_e=len(windows)*(rank+1)//world_size;my_windows=windows[my_s:my_e];base_model.eval() + with torch.inference_mode(): + for bi in range(0,len(my_windows),batch_seqs): + batch_ws=my_windows[bi:bi+batch_seqs];bsz=len(batch_ws);x_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);y_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);wlens=[] + for(i,ws)in enumerate(batch_ws):end=min(ws+seq_len,total_tokens);wlen=end-ws;wlens.append(wlen);chunk_tok=val_tokens[ws:end+1].to(dtype=torch.int64,device=device);x_batch[i,:wlen]=chunk_tok[:-1];y_batch[i,:wlen]=chunk_tok[1:] + with torch.autocast(device_type=_I,dtype=torch.bfloat16):logits=base_model.forward_logits(x_batch,feedback_fn=feedback_fn) + nll=F.cross_entropy(logits.reshape(-1,logits.size(-1)).float(),y_batch.reshape(-1),reduction='none').reshape(bsz,seq_len) + for(i,ws)in enumerate(batch_ws):wlen=wlens[i];s=0 if ws==0 else max(wlen-stride,0);scored_nll=nll[i,s:wlen].to(torch.float64);loss_sum+=scored_nll.sum();token_count+=float(wlen-s);tgt,prev=y_batch[i,s:wlen],x_batch[i,s:wlen];tb=base_bytes_lut[tgt].to(torch.float64);tb+=(has_leading_space_lut[tgt]&~is_boundary_token_lut[prev]).to(torch.float64);byte_count+=tb.sum() + is_last_chunk=ci==num_chunks-1 + if not is_last_chunk and args.ttt_epochs>0: + base_model.train();chunk_seqs=(chunk_end-chunk_start)//seq_len + if chunk_seqs>0: + cos_lr=args.ttt_lr*.5*(_C+math.cos(math.pi*ci/max(num_chunks-1,1))) + for pg in optimizer.param_groups:pg[_G]=cos_lr + my_seq_s=chunk_seqs*rank//world_size;my_seq_e=chunk_seqs*(rank+1)//world_size;my_chunk_seqs=my_seq_e-my_seq_s + for _ep in range(args.ttt_epochs): + for bs in range(0,my_chunk_seqs,args.ttt_batch_seqs): + be=min(bs+args.ttt_batch_seqs,my_chunk_seqs);actual_bs=my_seq_s+bs;start_tok=chunk_start+actual_bs*seq_len;end_tok=chunk_start+(my_seq_s+be)*seq_len+1 + if end_tok>val_tokens.numel():continue + local=val_tokens[start_tok:end_tok].to(device=device,dtype=torch.int64);x=local[:-1].reshape(-1,seq_len);y=local[1:].reshape(-1,seq_len);optimizer.zero_grad(set_to_none=_B) + with torch.autocast(device_type=_I,dtype=torch.bfloat16):loss=base_model(x,y,feedback_fn=feedback_fn) + loss.backward() + if world_size>1: + for p in ttt_params: + if p.grad is not _A:dist.all_reduce(p.grad,op=dist.ReduceOp.AVG) + torch.nn.utils.clip_grad_norm_(ttt_params,args.ttt_grad_clip);optimizer.step() + if rank==0 and(ci%10==0 or ci==num_chunks-1):elapsed=time.perf_counter()-t0;rl=loss_sum.item()/max(token_count.item(),1);rbpb=rl/math.log(2.)*(token_count.item()/max(byte_count.item(),1))if token_count.item()>0 else _E;log0(f" ttt_chunk [{ci+1}/{num_chunks}] bpb={rbpb:.6f} time={elapsed:.1f}s") + if dist.is_available()and dist.is_initialized():dist.all_reduce(loss_sum,op=dist.ReduceOp.SUM);dist.all_reduce(token_count,op=dist.ReduceOp.SUM);dist.all_reduce(byte_count,op=dist.ReduceOp.SUM) + val_loss=(loss_sum/token_count).item();val_bpb=val_loss/math.log(2.)*(token_count.item()/byte_count.item()) + for p in base_model.parameters():p.requires_grad_(_B) + base_model.eval();log0(f"ttt_sliding:done val_loss={val_loss:.6f}{ val_bpb=:.6f} elapsed={time.perf_counter()-t0:.1f}s");return val_loss,val_bpb +def quantize_int6_per_row(t,clip_range=31): + t32=t.float() + if t32.ndim==2: + best_q,best_s,best_err=_A,_A,float('inf') + for pct in[.999,.9995,.9999,.99999,_C]: + if pct<_C:row_clip=torch.quantile(t32.abs(),pct,dim=1) + else:row_clip=t32.abs().amax(dim=1) + s=(row_clip/clip_range).clamp_min(_C/clip_range).to(torch.float16);q=torch.clamp(torch.round(t32/s.float()[:,_A]),-clip_range,clip_range).to(torch.int8);recon=q.float()*s.float()[:,_A];err=(t32-recon).pow(2).mean().item() + if err0 else _C,dtype=torch.float16);q=torch.clamp(torch.round(t32/scale.float()),-clip_range,clip_range).to(torch.int8);return q,scale +def _unbank_state_dict(sd,num_layers): + out={};n=num_layers + for(name,tensor)in sd.items(): + if name==_O: + for i in range(n):out[f"blocks.{i}.attn.c_q.weight"]=tensor[i];out[f"blocks.{i}.attn.proj.weight"]=tensor[n+i] + elif name==_P: + for i in range(n):out[f"blocks.{i}.attn.c_k.weight"]=tensor[i];out[f"blocks.{i}.attn.c_v.weight"]=tensor[n+i] + elif name==_Q: + for i in range(n):out[f"blocks.{i}.mlp.fc.weight"]=tensor[i] + elif name==_R: + for i in range(n):out[f"blocks.{i}.mlp.proj.weight"]=tensor[i] + else:out[name]=tensor + return out +def _rebank_state_dict(sd,num_layers,template_sd): + out={};n=num_layers;qo_slices=[_A]*(2*n);kv_slices=[_A]*(2*n);up_slices=[_A]*n;down_slices=[_A]*n;consumed=set() + for i in range(n): + qk=f"blocks.{i}.attn.c_q.weight" + if qk in sd:qo_slices[i]=sd[qk];consumed.add(qk) + ok=f"blocks.{i}.attn.proj.weight" + if ok in sd:qo_slices[n+i]=sd[ok];consumed.add(ok) + kk=f"blocks.{i}.attn.c_k.weight" + if kk in sd:kv_slices[i]=sd[kk];consumed.add(kk) + vk=f"blocks.{i}.attn.c_v.weight" + if vk in sd:kv_slices[n+i]=sd[vk];consumed.add(vk) + fk=f"blocks.{i}.mlp.fc.weight" + if fk in sd:up_slices[i]=sd[fk];consumed.add(fk) + dk=f"blocks.{i}.mlp.proj.weight" + if dk in sd:down_slices[i]=sd[dk];consumed.add(dk) + out[_O]=torch.stack(qo_slices).to(dtype=template_sd[_O].dtype);out[_P]=torch.stack(kv_slices).to(dtype=template_sd[_P].dtype);out[_Q]=torch.stack(up_slices).to(dtype=template_sd[_Q].dtype);out[_R]=torch.stack(down_slices).to(dtype=template_sd[_R].dtype) + for(name,tensor)in sd.items(): + if name not in consumed:out[name]=tensor + return out +def mixed_quantize_int6(state_dict,int6_cats,core_start=-1,core_end=-1): + A='type';num_layers_total=max((int(k.split('.')[1])for k in state_dict if k.startswith('blocks.')),default=0)+1;late_k_layers=set(range(num_layers_total-2,num_layers_total));result={};meta={} + for(name,tensor)in state_dict.items(): + t=tensor.detach().cpu().contiguous();cat='embed'if'tok_emb'in name or'lm_head'in name else'mlp'if'.mlp.'in name else'attn'if'.attn.'in name else'other' + if not t.is_floating_point()or t.numel()<=65536:result[name]=t.to(torch.float16)if t.is_floating_point()else t;meta[name]=_Y;continue + if any(p in name for p in _N.split(',')):result[name]=t.float();meta[name]=_Z;continue + if cat in int6_cats and t.ndim>=1:q,s=quantize_int6_per_row(t);result[name+'.q']=q;result[name+_S]=s;meta[name]={A:'int6'} + else:q,s=quantize_float_tensor(t);result[name+'.q']=q;result[name+_S]=s;meta[name]={A:'int8'} + return result,meta +def dequantize_mixed_int6(result,meta,template_sd): + out={} + for(name,orig)in template_sd.items(): + info=meta.get(name) + if info is _A:continue + orig_dtype=orig.dtype + if info in(_Y,_Z,'passthrough_fp16'): + t=result[name] + if t.dtype==torch.float16 and orig_dtype in(torch.float32,torch.bfloat16):t=t.to(orig_dtype) + out[name]=t;continue + q,s=result[name+'.q'],result[name+_S] + if s.ndim>0:out[name]=(q.float()*s.float().view(q.shape[0],*[1]*(q.ndim-1))).to(orig_dtype) + else:out[name]=(q.float()*float(s.item())).to(orig_dtype) + return out +def parse_args():A='store_true';p=argparse.ArgumentParser();p.add_argument('--feedback-rank',type=int,default=2);p.add_argument('--feedback-mode',type=str,default=_T);p.add_argument('--per-pass-feedback',action=A);p.add_argument('--residual-scale-init',type=float,default=.5);p.add_argument('--jacobian-proxy-weight',type=float,default=.01);p.add_argument('--no-interpass-rmsnorm',action=A);return p.parse_args() +def _make_gpt(args,cli,num_passes,**kw):return GPT(vocab_size=args.vocab_size,num_layers=args.num_layers,model_dim=args.model_dim,num_heads=args.num_heads,num_kv_heads=args.num_kv_heads,mlp_mult=args.mlp_mult,tie_embeddings=args.tie_embeddings,tied_embed_init_std=args.tied_embed_init_std,logit_softcap=args.logit_softcap,rope_base=args.rope_base,qk_gain_init=args.qk_gain_init,xsa_last_n=args.xsa_last_n,rope_dims=args.rope_dims,ln_scale=args.ln_scale,core_start=args.core_start,core_end=args.core_end,num_passes=num_passes,interpass_rmsnorm=not cli.no_interpass_rmsnorm,bigram_vocab_size=args.bigram_vocab_size,bigram_dim=args.bigram_dim,ve_enabled=args.ve_enabled,ve_dim=args.ve_dim,ve_layers=args.ve_layers,**kw) +def _promote_fp32(m): + m.qo_bank.data=m.qo_bank.data.float();m.kv_bank.data=m.kv_bank.data.float();m.mlp_up_bank.data=m.mlp_up_bank.data.float();m.mlp_down_bank.data=m.mlp_down_bank.data.float() + for mod in m.modules(): + if isinstance(mod,CastedLinear):mod.float() + restore_low_dim_params_to_fp32(m) +def main(): + G='final_model.int6.ptz';F='final_model.pt';E='WORLD_SIZE';D='RANK';C='_feedback.';B='_fb.';A='base_lr';cli=parse_args();code=Path(__file__).read_text(encoding=_J);args=Hyperparameters();distributed=D in os.environ and E in os.environ;rank=int(os.environ.get(D,_H));world_size=int(os.environ.get(E,'1'));local_rank=int(os.environ.get('LOCAL_RANK',_H)) + if world_size<=0:raise ValueError('bad WORLD_SIZE') + if 8%world_size!=0:raise ValueError('WORLD_SIZE must divide 8') + grad_accum_steps=8//world_size;grad_scale=_C/grad_accum_steps + if not torch.cuda.is_available():raise RuntimeError('CUDA is required') + device=torch.device(_I,local_rank);torch.cuda.set_device(device) + if distributed:dist.init_process_group(backend='nccl',device_id=device);dist.barrier() + master_process=rank==0;torch.backends.cuda.matmul.allow_tf32=_B;torch.backends.cudnn.allow_tf32=_B;from torch.backends.cuda import enable_cudnn_sdp,enable_flash_sdp,enable_math_sdp,enable_mem_efficient_sdp;enable_cudnn_sdp(_D);enable_flash_sdp(_B);enable_mem_efficient_sdp(_D);enable_math_sdp(_D);logfile=_A + if master_process:os.makedirs('logs',exist_ok=_B);logfile=f"logs/{args.run_id}.txt";print(logfile) + def log0(msg,console=_B): + if not master_process:return + if console:print(msg) + if logfile is not _A: + with open(logfile,'a',encoding=_J)as f:print(msg,file=f) + log0(code,console=_D);random.seed(args.seed);np.random.seed(args.seed);torch.manual_seed(args.seed);torch.cuda.manual_seed_all(args.seed) + if not args.tokenizer_path.endswith('.model'):raise ValueError('need .model tokenizer') + sp=spm.SentencePieceProcessor(model_file=args.tokenizer_path) + if int(sp.vocab_size())!=args.vocab_size:raise ValueError('vocab size mismatch') + dataset_dir=Path(args.data_path).resolve();actual_train_files=len(list(dataset_dir.glob(_U)));effective_eval_seq_len=args.eval_seq_len if args.eval_seq_len>0 else args.train_seq_len;val_seq_len=max(args.train_seq_len,effective_eval_seq_len);val_tokens=load_validation_tokens(args.val_files,val_seq_len);base_bytes_lut,has_leading_space_lut,is_boundary_token_lut=build_sentencepiece_luts(sp,args.vocab_size,device);log0(f"val_bpb:enabled tokenizer_path={args.tokenizer_path}");log0(f"train:{dataset_dir.name} shards:{actual_train_files} val_tokens:{val_tokens.numel()-1}");CastedLinear._qat_enabled=args.qat_enabled;base_model=_make_gpt(args,cli,args.num_passes,core_quant_bits=args.core_quant_bits,core_quant_enabled=args.core_quant_enabled,residual_scale=_A).to(device).bfloat16();_promote_fp32(base_model);feedback=_A;feedback_fn=_A;stabilizer=_A;residual_scale=_A;extra_scalar_params=[];passes_schedule=[] + if args.passes_schedule_str: + for entry in args.passes_schedule_str.split(','):s,p=entry.strip().split(':');passes_schedule.append((int(s),int(p))) + passes_schedule.sort(key=lambda x:x[0]) + max_passes=max((p for(_,p)in passes_schedule),default=args.num_passes);max_passes=max(max_passes,args.eval_passes if args.eval_passes>0 else args.num_passes);needs_recurrence=max_passes>1 + if cli.feedback_mode!='none'and needs_recurrence: + feedback=ErrorFeedbackModule(dim=args.model_dim,rank=cli.feedback_rank,feedback_mode=cli.feedback_mode,per_pass=cli.per_pass_feedback,num_passes=max_passes).to(device).bfloat16();restore_low_dim_params_to_fp32(feedback);extra_scalar_params.extend(feedback.parameters()) + def feedback_fn(h,pass_idx):return feedback(h,pass_idx) + log0(f"feedback: {cli.feedback_mode} r={cli.feedback_rank} params={sum(p.numel()for p in feedback.parameters())}") + if needs_recurrence: + stabilizer=RecurrentStabilizer(jacobian_proxy_weight=cli.jacobian_proxy_weight) + if cli.residual_scale_init!=_C:residual_scale=ResidualScale(max_passes,cli.residual_scale_init).to(device);base_model.residual_scale=residual_scale;extra_scalar_params.extend(residual_scale.parameters()) + log0(f"recurrence: {args.core_start}-{args.core_end} passes={args.num_passes}/{max_passes} s/c/t={base_model.num_stem}/{base_model.num_core}/{base_model.num_tail} sched={passes_schedule}");compiled_model=torch.compile(base_model,dynamic=_D,fullgraph=_B);model=compiled_model;matrix_params=[base_model.qo_bank,base_model.kv_bank,base_model.mlp_up_bank,base_model.mlp_down_bank];block_named_params=list(base_model.blocks.named_parameters());scalar_params=[p for(name,p)in block_named_params if p.ndim<2 or any(p in name for p in _N.split(','))] + if base_model.skip_weights.numel()>0:scalar_params.append(base_model.skip_weights) + scalar_params.append(base_model.smear.gate);token_lr=args.tied_embed_lr if args.tie_embeddings else args.embed_lr;tok_params=[{_F:[base_model.tok_emb.weight],_G:token_lr,A:token_lr}] + if base_model.bigram is not _A: + tok_params.append({_F:[base_model.bigram.embed.weight],_G:token_lr,A:token_lr}) + if base_model.bigram.proj is not _A:scalar_params.append(base_model.bigram.proj.weight) + scalar_params.append(base_model.bigram.scale) + if base_model.ve_shared is not _A: + tok_params.append({_F:[base_model.ve_shared.embed.weight],_G:token_lr,A:token_lr}) + if base_model.ve_shared.proj is not _A:scalar_params.append(base_model.ve_shared.proj.weight) + scalar_params.append(base_model.ve_shared.scale) + for s in base_model.ve_layer_scales:scalar_params.append(s) + optimizer_tok=torch.optim.AdamW(tok_params,betas=(args.beta1,args.beta2),eps=args.adam_eps,weight_decay=args.adam_wd,fused=_B);optimizer_muon=Muon(matrix_params,lr=args.matrix_lr,momentum=args.muon_momentum,backend_steps=args.muon_backend_steps,weight_decay=args.muon_wd) + for group in optimizer_muon.param_groups:group[A]=args.matrix_lr + scalar_params.extend(extra_scalar_params);optimizer_scalar=torch.optim.AdamW([{_F:scalar_params,_G:args.scalar_lr,A:args.scalar_lr}],betas=(args.beta1,args.beta2),eps=args.adam_eps,weight_decay=args.adam_wd,fused=_B);replicated_params=list(optimizer_tok.param_groups[0][_F]) + for pg in optimizer_tok.param_groups[1:]:replicated_params.extend(pg[_F]) + replicated_params.extend(scalar_params);optimizer_head=_A + if base_model.lm_head is not _A:optimizer_head=torch.optim.Adam([{_F:[base_model.lm_head.weight],_G:args.head_lr,A:args.head_lr}],betas=(args.beta1,args.beta2),eps=args.adam_eps,fused=_B);replicated_params.append(base_model.lm_head.weight) + optimizers=[optimizer_tok,optimizer_muon,optimizer_scalar] + if optimizer_head is not _A:optimizers.append(optimizer_head) + log0(f"params:{sum(p.numel()for p in base_model.parameters())} ws:{world_size} ga:{grad_accum_steps} iters:{args.iterations} wc:{args.max_wallclock_seconds:.0f}s seed:{args.seed}");train_loader=DistributedTokenLoader(args.train_files,rank,world_size,device) + def zero_grad_all(): + for opt in optimizers:opt.zero_grad(set_to_none=_B) + max_wallclock_ms=1e3*args.max_wallclock_seconds if args.max_wallclock_seconds>0 else _A + def lr_mul(step,elapsed_ms): + if args.warmdown_iters<=0:return _C + if max_wallclock_ms is _A:warmdown_start=max(args.iterations-args.warmdown_iters,0);return max((args.iterations-step)/max(args.warmdown_iters,1),_E)if warmdown_start<=step0: + initial_model_state={name:tensor.detach().cpu().clone()for(name,tensor)in base_model.state_dict().items()};initial_optimizer_states=[copy.deepcopy(opt.state_dict())for opt in optimizers];_precompile_passes=sorted(set(p for(_,p)in passes_schedule)-{args.num_passes})if passes_schedule else[];_qat_precompile_passes=_precompile_passes[-2:]if len(_precompile_passes)>=2 else _precompile_passes[:];_total_precompile=len(_precompile_passes)+len(_qat_precompile_passes);_precompile_start=args.warmup_steps-_total_precompile;model.train() + for warmup_step in range(args.warmup_steps): + if warmup_step>=_precompile_start: + _pc_idx=warmup_step-_precompile_start + if _pc_idx=stop_after_step;should_validate=last_step or args.val_loss_every>0 and step%args.val_loss_every==0 + if should_validate:torch.cuda.synchronize();training_time_ms+=1e3*(time.perf_counter()-t0);val_loss,val_bpb=eval_val(args,model,rank,world_size,device,grad_accum_steps,val_tokens,base_bytes_lut,has_leading_space_lut,is_boundary_token_lut);log0(f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms/max(step,1):.2f}ms");torch.cuda.synchronize();t0=time.perf_counter() + if last_step: + if stop_after_step is not _A and step=threshold_step:target_passes=p + if target_passes!=base_model.num_passes:base_model.num_passes=target_passes;log0(f"progressive_passes: step:{step} num_passes:{target_passes}") + if args.late_qat_threshold>0 and step>100 and scale0 else _C;muon_momentum=(1-frac)*args.muon_momentum_warmup_start+frac*args.muon_momentum + for group in optimizer_muon.param_groups:group[_X]=muon_momentum + for opt in optimizers: + for group in opt.param_groups:group[_G]=group[A]*scale + grad_norm=_A + if args.grad_clip_norm>0:grad_norm=torch.nn.utils.clip_grad_norm_(base_model.parameters(),args.grad_clip_norm) + optimizer_muon.launch_reduce_scatters() + if distributed: + for p in replicated_params: + if p.grad is not _A:dist.all_reduce(p.grad,op=dist.ReduceOp.AVG) + optimizer_tok.step();optimizer_scalar.step() + if optimizer_head is not _A:optimizer_head.step() + optimizer_muon.step();zero_grad_all() + with torch.no_grad(): + _cur=dict(base_model.state_dict()) + if feedback is not _A: + for(k,v)in feedback.state_dict().items():_cur[f"_fb.{k}"]=v + for(name,t)in _cur.items():ema_state[name].mul_(ema_decay).add_(t.detach().float(),alpha=_C-ema_decay) + step+=1;approx_training_time_ms=training_time_ms+1e3*(time.perf_counter()-t0) + if args.swa_enabled and scale<.2 and step%args.swa_every==0: + if swa_state is _A:swa_state={name:t.detach().cpu().clone()for(name,t)in base_model.state_dict().items()};swa_count=1;log0(f"swa:start step:{step}") + else: + for(name,t)in base_model.state_dict().items():swa_state[name]+=t.detach().cpu() + swa_count+=1 + should_log_train=args.train_log_every>0 and(step<=10 or step%args.train_log_every==0 or stop_after_step is not _A) + if should_log_train:tl=train_loss.item();gn_str=f" grad_norm:{grad_norm:.4f}"if grad_norm is not _A else'';log0(f"step:{step}/{args.iterations} train_loss:{tl:.4f}{gn_str} train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms/step:.2f}ms") + reached_cap=max_wallclock_ms is not _A and approx_training_time_ms>=max_wallclock_ms + if distributed and max_wallclock_ms is not _A:reached_cap_tensor=torch.tensor(int(reached_cap),device=device);dist.all_reduce(reached_cap_tensor,op=dist.ReduceOp.MAX);reached_cap=bool(reached_cap_tensor.item()) + if stop_after_step is _A and reached_cap:stop_after_step=step + log0(f"peak memory allocated: {torch.cuda.max_memory_allocated()//1024//1024} MiB reserved: {torch.cuda.max_memory_reserved()//1024//1024} MiB");log0('ema:applying EMA weights');current_state=base_model.state_dict();model_ema={k:v for(k,v)in ema_state.items()if not k.startswith(B)};avg_state={name:model_ema[name].to(dtype=current_state[name].dtype)for name in current_state};base_model.load_state_dict(avg_state,strict=_B) + if feedback is not _A:fb_ema={k.removeprefix(B):v for(k,v)in ema_state.items()if k.startswith(B)};fb_state=feedback.state_dict();fb_avg={k:fb_ema[k].to(dtype=fb_state[k].dtype)for k in fb_state};feedback.load_state_dict(fb_avg,strict=_B) + torch.cuda.synchronize();t_diag=time.perf_counter();diag_val_loss,diag_val_bpb=eval_val(args,compiled_model,rank,world_size,device,grad_accum_steps,val_tokens,base_bytes_lut,has_leading_space_lut,is_boundary_token_lut);torch.cuda.synchronize();log0(f"DIAGNOSTIC post_ema val_loss:{diag_val_loss:.4f} val_bpb:{diag_val_bpb:.4f} eval_time:{1e3*(time.perf_counter()-t_diag):.0f}ms");full_state_dict=base_model.state_dict();export_sd=full_state_dict + if feedback is not _A: + for(k,v)in feedback.state_dict().items():export_sd[f"_feedback.{k}"]=v + if master_process:torch.save(export_sd,F);model_bytes=os.path.getsize(F);code_bytes=len(code.encode(_J));log0(f"Serialized model: {model_bytes} bytes");log0(f"Code size: {code_bytes} bytes") + eval_num_passes=args.eval_passes if args.eval_passes>0 else args.num_passes + if eval_num_passes!=args.num_passes: + log0(f"eval_override: num_passes {args.num_passes} -> {eval_num_passes}");base_model.num_passes=eval_num_passes + if base_model.residual_scale is not _A:old_s=base_model.residual_scale.scales.data;new_s=torch.full((eval_num_passes,),cli.residual_scale_init,dtype=torch.float32,device=old_s.device);copy_len=min(eval_num_passes,old_s.shape[0]);new_s[:copy_len]=old_s[:copy_len];base_model.residual_scale.scales=nn.Parameter(new_s) + export_sd=base_model.state_dict() + if feedback is not _A: + for(k,v)in feedback.state_dict().items():export_sd[f"_feedback.{k}"]=v + sd_cpu={k:v.detach().cpu()for(k,v)in export_sd.items()};unbanked_sd=_unbank_state_dict(sd_cpu,args.num_layers);quant_result,quant_meta=mixed_quantize_int6(unbanked_sd,{'mlp','attn'});quant_buf=io.BytesIO();torch.save({'w':quant_result,'m':quant_meta},quant_buf);quant_raw=quant_buf.getvalue();quant_blob=lzma.compress(quant_raw,preset=6) + if master_process: + with open(G,'wb')as f:f.write(quant_blob) + quant_file_bytes=len(quant_blob);code_bytes=len(code.encode(_J));log0(f"Serialized model int6+lzma: {quant_file_bytes} bytes");log0(f"Total submission size int6+lzma: {quant_file_bytes+code_bytes} bytes") + if distributed:dist.barrier() + with open(G,'rb')as f:quant_blob_disk=f.read() + quant_state=torch.load(io.BytesIO(lzma.decompress(quant_blob_disk)),map_location='cpu');deq_unbanked=dequantize_mixed_int6(quant_state['w'],quant_state['m'],unbanked_sd);deq_state=_rebank_state_dict(deq_unbanked,args.num_layers,sd_cpu);eval_feedback=_A;eval_feedback_fn=_A;fb_keys={k:v for(k,v)in deq_state.items()if k.startswith(C)} + if fb_keys: + deq_state={k:v for(k,v)in deq_state.items()if not k.startswith(C)};eval_feedback=ErrorFeedbackModule(dim=args.model_dim,rank=cli.feedback_rank,feedback_mode=cli.feedback_mode,per_pass=cli.per_pass_feedback,num_passes=eval_num_passes).to(device).bfloat16();fb_sd={k.removeprefix(C):v for(k,v)in fb_keys.items()};eval_feedback.load_state_dict(fb_sd,strict=_B) + def eval_feedback_fn(h,pass_idx):return eval_feedback(h,pass_idx) + log0(f"eval_feedback: loaded from artifact, params={eval_feedback.param_count()}") + eval_model=_make_gpt(args,cli,eval_num_passes).to(device).bfloat16() + if residual_scale is not _A:eval_rs=ResidualScale(eval_num_passes,cli.residual_scale_init).to(device);eval_model.residual_scale=eval_rs + _promote_fp32(eval_model);eval_model.load_state_dict(deq_state,strict=_B) + if args.ttt_enabled:torch.cuda.synchronize();t_ttt=time.perf_counter();ttt_loss,ttt_bpb=eval_val_sliding_ttt(args,eval_model,rank,world_size,device,val_tokens,base_bytes_lut,has_leading_space_lut,is_boundary_token_lut,stride=args.eval_stride,log0=log0,feedback_fn=eval_feedback_fn,feedback_module=eval_feedback);torch.cuda.synchronize();log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} eval_time:{1e3*(time.perf_counter()-t_ttt):.0f}ms");log0(f"legal_ttt_exact val_loss:{ttt_loss:.8f} val_bpb:{ttt_bpb:.8f}") + if distributed:dist.destroy_process_group() +if __name__=='__main__':main() \ No newline at end of file diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed1337.log b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed1337.log new file mode 100644 index 0000000000..6dfbd82729 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed1337.log @@ -0,0 +1,387 @@ +W0401 17:11:57.414000 166208 torch/distributed/run.py:851] +W0401 17:11:57.414000 166208 torch/distributed/run.py:851] ***************************************** +W0401 17:11:57.414000 166208 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0401 17:11:57.414000 166208 torch/distributed/run.py:851] ***************************************** +logs/bigram_ve_wd3500_3pass.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:80 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=4 core_end=7 num_passes=1 max_passes=3 stem=4 core=3 tail=4 schedule=[(0, 1), (4500, 2), (5500, 3)] +model_params:26698335 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:8 grad_accum_steps:1 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:6500 warmup_steps:20 max_wallclock_seconds:600.000 +seed:1337 +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/6500 val_loss:6.9281 val_bpb:4.1032 train_time:0ms step_avg:0.01ms +step:1/6500 train_loss:6.9290 grad_norm:0.3974 train_time:127ms step_avg:126.94ms +step:2/6500 train_loss:8.4773 grad_norm:3.3575 train_time:161ms step_avg:80.53ms +step:3/6500 train_loss:7.6671 grad_norm:1.7536 train_time:242ms step_avg:80.61ms +step:4/6500 train_loss:7.3212 grad_norm:1.1850 train_time:323ms step_avg:80.77ms +step:5/6500 train_loss:7.1339 grad_norm:1.4214 train_time:404ms step_avg:80.83ms +step:6/6500 train_loss:6.8977 grad_norm:1.2182 train_time:485ms step_avg:80.85ms +step:7/6500 train_loss:6.8026 grad_norm:1.3195 train_time:566ms step_avg:80.81ms +step:8/6500 train_loss:6.7622 grad_norm:0.9580 train_time:647ms step_avg:80.85ms +step:9/6500 train_loss:6.5054 grad_norm:0.9286 train_time:728ms step_avg:80.87ms +step:10/6500 train_loss:6.1474 grad_norm:0.9324 train_time:810ms step_avg:80.97ms +step:50/6500 train_loss:3.7892 grad_norm:1.0422 train_time:4097ms step_avg:81.94ms +step:100/6500 train_loss:3.1689 grad_norm:0.6969 train_time:8215ms step_avg:82.15ms +step:150/6500 train_loss:2.8908 grad_norm:0.4749 train_time:12398ms step_avg:82.65ms +step:200/6500 train_loss:2.3761 grad_norm:0.3690 train_time:16541ms step_avg:82.70ms +step:250/6500 train_loss:2.4712 grad_norm:0.3667 train_time:20679ms step_avg:82.72ms +step:300/6500 train_loss:2.5395 grad_norm:0.3156 train_time:24879ms step_avg:82.93ms +step:350/6500 train_loss:2.5270 grad_norm:0.2875 train_time:29023ms step_avg:82.92ms +step:400/6500 train_loss:2.3904 grad_norm:0.2057 train_time:33223ms step_avg:83.06ms +step:450/6500 train_loss:2.3445 grad_norm:0.2296 train_time:37361ms step_avg:83.02ms +step:500/6500 train_loss:2.3812 grad_norm:0.2048 train_time:41503ms step_avg:83.01ms +step:550/6500 train_loss:2.3155 grad_norm:0.2007 train_time:45702ms step_avg:83.09ms +step:600/6500 train_loss:2.3195 grad_norm:0.1791 train_time:49845ms step_avg:83.08ms +step:650/6500 train_loss:2.3103 grad_norm:0.1746 train_time:54046ms step_avg:83.15ms +step:700/6500 train_loss:2.3268 grad_norm:0.1511 train_time:58186ms step_avg:83.12ms +step:750/6500 train_loss:2.3125 grad_norm:0.1521 train_time:62336ms step_avg:83.11ms +step:800/6500 train_loss:2.2223 grad_norm:0.1654 train_time:66542ms step_avg:83.18ms +step:850/6500 train_loss:2.2136 grad_norm:0.1482 train_time:70686ms step_avg:83.16ms +step:900/6500 train_loss:2.1112 grad_norm:0.1492 train_time:74884ms step_avg:83.20ms +step:950/6500 train_loss:2.2038 grad_norm:0.1359 train_time:79035ms step_avg:83.20ms +step:1000/6500 train_loss:2.2579 grad_norm:0.1482 train_time:83183ms step_avg:83.18ms +step:1050/6500 train_loss:2.2082 grad_norm:0.1318 train_time:87392ms step_avg:83.23ms +step:1100/6500 train_loss:2.3091 grad_norm:0.1599 train_time:91542ms step_avg:83.22ms +step:1150/6500 train_loss:2.2358 grad_norm:0.1365 train_time:95750ms step_avg:83.26ms +step:1200/6500 train_loss:2.3392 grad_norm:0.1272 train_time:99901ms step_avg:83.25ms +step:1250/6500 train_loss:2.2380 grad_norm:0.1281 train_time:104058ms step_avg:83.25ms +step:1300/6500 train_loss:2.0824 grad_norm:0.1104 train_time:108275ms step_avg:83.29ms +step:1350/6500 train_loss:2.2349 grad_norm:0.1266 train_time:112426ms step_avg:83.28ms +step:1400/6500 train_loss:2.1730 grad_norm:0.1093 train_time:116641ms step_avg:83.32ms +step:1450/6500 train_loss:2.1051 grad_norm:0.0967 train_time:120789ms step_avg:83.30ms +step:1500/6500 train_loss:2.2093 grad_norm:0.0965 train_time:124938ms step_avg:83.29ms +step:1550/6500 train_loss:2.1683 grad_norm:0.0896 train_time:129151ms step_avg:83.32ms +step:1600/6500 train_loss:2.0650 grad_norm:0.0988 train_time:133311ms step_avg:83.32ms +step:1650/6500 train_loss:2.1750 grad_norm:0.0917 train_time:137469ms step_avg:83.31ms +step:1700/6500 train_loss:2.1297 grad_norm:0.0860 train_time:141678ms step_avg:83.34ms +step:1750/6500 train_loss:2.1828 grad_norm:0.0806 train_time:145835ms step_avg:83.33ms +step:1800/6500 train_loss:2.1386 grad_norm:0.1055 train_time:150044ms step_avg:83.36ms +step:1850/6500 train_loss:2.0174 grad_norm:0.1119 train_time:154193ms step_avg:83.35ms +step:1900/6500 train_loss:2.1075 grad_norm:0.0826 train_time:158345ms step_avg:83.34ms +step:1950/6500 train_loss:2.0046 grad_norm:0.0752 train_time:162561ms step_avg:83.36ms +step:2000/6500 train_loss:2.0548 grad_norm:0.0749 train_time:166715ms step_avg:83.36ms +step:2050/6500 train_loss:2.0980 grad_norm:0.0746 train_time:170936ms step_avg:83.38ms +step:2100/6500 train_loss:2.0319 grad_norm:0.0740 train_time:175087ms step_avg:83.37ms +step:2150/6500 train_loss:2.1359 grad_norm:0.0763 train_time:179243ms step_avg:83.37ms +step:2200/6500 train_loss:2.1221 grad_norm:0.1377 train_time:183459ms step_avg:83.39ms +step:2250/6500 train_loss:2.1587 grad_norm:0.0783 train_time:187611ms step_avg:83.38ms +step:2300/6500 train_loss:2.0955 grad_norm:0.0806 train_time:191822ms step_avg:83.40ms +step:2350/6500 train_loss:2.1575 grad_norm:0.0721 train_time:195977ms step_avg:83.39ms +step:2400/6500 train_loss:2.0526 grad_norm:0.0739 train_time:200128ms step_avg:83.39ms +step:2450/6500 train_loss:2.0661 grad_norm:0.0732 train_time:204340ms step_avg:83.40ms +step:2500/6500 train_loss:2.1566 grad_norm:0.1040 train_time:208499ms step_avg:83.40ms +step:2550/6500 train_loss:2.1926 grad_norm:0.0828 train_time:212714ms step_avg:83.42ms +step:2600/6500 train_loss:2.0968 grad_norm:0.0773 train_time:216864ms step_avg:83.41ms +step:2650/6500 train_loss:2.0556 grad_norm:0.0839 train_time:221016ms step_avg:83.40ms +step:2700/6500 train_loss:2.0891 grad_norm:0.0734 train_time:225232ms step_avg:83.42ms +step:2750/6500 train_loss:2.0194 grad_norm:0.0732 train_time:229393ms step_avg:83.42ms +step:2800/6500 train_loss:2.1399 grad_norm:0.0801 train_time:233614ms step_avg:83.43ms +step:2850/6500 train_loss:2.0551 grad_norm:0.0691 train_time:237762ms step_avg:83.43ms +step:2900/6500 train_loss:2.0127 grad_norm:0.0757 train_time:241916ms step_avg:83.42ms +step:2950/6500 train_loss:2.0701 grad_norm:0.0740 train_time:246137ms step_avg:83.44ms +step:3000/6500 train_loss:2.1529 grad_norm:0.0738 train_time:250300ms step_avg:83.43ms +step:3050/6500 train_loss:2.0333 grad_norm:0.0755 train_time:254456ms step_avg:83.43ms +step:3100/6500 train_loss:2.0245 grad_norm:0.0759 train_time:258671ms step_avg:83.44ms +step:3150/6500 train_loss:1.9603 grad_norm:0.0692 train_time:262829ms step_avg:83.44ms +step:3200/6500 train_loss:2.1628 grad_norm:0.0798 train_time:267049ms step_avg:83.45ms +step:3250/6500 train_loss:2.0394 grad_norm:0.0723 train_time:271202ms step_avg:83.45ms +step:3300/6500 train_loss:2.0628 grad_norm:0.0678 train_time:275363ms step_avg:83.44ms +step:3350/6500 train_loss:2.0850 grad_norm:0.0709 train_time:279583ms step_avg:83.46ms +step:3400/6500 train_loss:2.0100 grad_norm:0.0749 train_time:283734ms step_avg:83.45ms +step:3450/6500 train_loss:2.1029 grad_norm:0.0800 train_time:287953ms step_avg:83.46ms +step:3500/6500 train_loss:2.1720 grad_norm:0.0755 train_time:292103ms step_avg:83.46ms +step:3550/6500 train_loss:1.9111 grad_norm:0.0700 train_time:296260ms step_avg:83.45ms +step:3600/6500 train_loss:2.0871 grad_norm:0.0764 train_time:300476ms step_avg:83.47ms +step:3650/6500 train_loss:1.9652 grad_norm:0.0743 train_time:304632ms step_avg:83.46ms +step:3700/6500 train_loss:2.0881 grad_norm:0.0731 train_time:308849ms step_avg:83.47ms +step:3750/6500 train_loss:1.9141 grad_norm:0.0710 train_time:313006ms step_avg:83.47ms +step:3800/6500 train_loss:2.0665 grad_norm:0.0780 train_time:317165ms step_avg:83.46ms +step:3850/6500 train_loss:2.0818 grad_norm:0.0738 train_time:321374ms step_avg:83.47ms +step:3900/6500 train_loss:2.0701 grad_norm:0.0706 train_time:325529ms step_avg:83.47ms +step:3950/6500 train_loss:2.1639 grad_norm:0.0731 train_time:329743ms step_avg:83.48ms +step:4000/6500 train_loss:1.9696 grad_norm:0.0719 train_time:333902ms step_avg:83.48ms +step:4000/6500 val_loss:2.0584 val_bpb:1.2191 train_time:333951ms step_avg:83.49ms +step:4050/6500 train_loss:2.0853 grad_norm:0.0715 train_time:338056ms step_avg:83.47ms +step:4100/6500 train_loss:2.0046 grad_norm:0.0766 train_time:342277ms step_avg:83.48ms +step:4150/6500 train_loss:2.1088 grad_norm:0.0738 train_time:346432ms step_avg:83.48ms +step:4200/6500 train_loss:2.1411 grad_norm:0.0838 train_time:350646ms step_avg:83.49ms +step:4250/6500 train_loss:2.1075 grad_norm:0.0830 train_time:354797ms step_avg:83.48ms +step:4300/6500 train_loss:2.0524 grad_norm:0.0747 train_time:358948ms step_avg:83.48ms +step:4350/6500 train_loss:2.0648 grad_norm:0.0788 train_time:363174ms step_avg:83.49ms +step:4400/6500 train_loss:2.0249 grad_norm:0.0755 train_time:367326ms step_avg:83.48ms +step:4450/6500 train_loss:2.0430 grad_norm:0.0741 train_time:371479ms step_avg:83.48ms +step:4500/6500 train_loss:2.1161 grad_norm:0.0755 train_time:375699ms step_avg:83.49ms +progressive_passes: step:4500 num_passes:2 +step:4550/6500 train_loss:2.1261 grad_norm:0.0730 train_time:381269ms step_avg:83.80ms +step:4600/6500 train_loss:1.8320 grad_norm:0.0798 train_time:386903ms step_avg:84.11ms +step:4650/6500 train_loss:2.0453 grad_norm:0.0729 train_time:392474ms step_avg:84.40ms +step:4700/6500 train_loss:2.2250 grad_norm:0.1195 train_time:398047ms step_avg:84.69ms +step:4750/6500 train_loss:2.0101 grad_norm:0.0791 train_time:403682ms step_avg:84.99ms +step:4800/6500 train_loss:2.4105 grad_norm:0.1550 train_time:409252ms step_avg:85.26ms +step:4850/6500 train_loss:2.0919 grad_norm:0.0808 train_time:414888ms step_avg:85.54ms +step:4900/6500 train_loss:2.0343 grad_norm:0.0757 train_time:420457ms step_avg:85.81ms +step:4950/6500 train_loss:2.0808 grad_norm:0.0813 train_time:426024ms step_avg:86.07ms +step:5000/6500 train_loss:2.0848 grad_norm:0.0733 train_time:431652ms step_avg:86.33ms +step:5050/6500 train_loss:2.0497 grad_norm:0.0871 train_time:437218ms step_avg:86.58ms +step:5100/6500 train_loss:2.1074 grad_norm:0.0788 train_time:442845ms step_avg:86.83ms +step:5150/6500 train_loss:2.0053 grad_norm:0.0777 train_time:448413ms step_avg:87.07ms +step:5200/6500 train_loss:2.0183 grad_norm:0.0751 train_time:453990ms step_avg:87.31ms +step:5250/6500 train_loss:2.0446 grad_norm:0.0715 train_time:459618ms step_avg:87.55ms +step:5300/6500 train_loss:1.9820 grad_norm:0.0760 train_time:465187ms step_avg:87.77ms +step:5350/6500 train_loss:1.8966 grad_norm:0.0798 train_time:470817ms step_avg:88.00ms +step:5400/6500 train_loss:2.0190 grad_norm:0.0775 train_time:476388ms step_avg:88.22ms +step:5450/6500 train_loss:2.0444 grad_norm:0.0796 train_time:481954ms step_avg:88.43ms +step:5500/6500 train_loss:1.9880 grad_norm:0.0786 train_time:487580ms step_avg:88.65ms +progressive_passes: step:5500 num_passes:3 +step:5550/6500 train_loss:1.9738 grad_norm:0.0820 train_time:494245ms step_avg:89.05ms +step:5600/6500 train_loss:1.9185 grad_norm:0.0789 train_time:500970ms step_avg:89.46ms +step:5650/6500 train_loss:2.0222 grad_norm:0.0827 train_time:507635ms step_avg:89.85ms +step:5700/6500 train_loss:1.9775 grad_norm:0.0862 train_time:514295ms step_avg:90.23ms +step:5750/6500 train_loss:2.0524 grad_norm:0.0933 train_time:521020ms step_avg:90.61ms +step:5800/6500 train_loss:1.9523 grad_norm:0.0860 train_time:528938ms step_avg:91.20ms +step:5850/6500 train_loss:2.0892 grad_norm:0.0870 train_time:535657ms step_avg:91.57ms +swa:start step:5900 +step:5900/6500 train_loss:1.8599 grad_norm:0.0804 train_time:542320ms step_avg:91.92ms +step:5950/6500 train_loss:1.9200 grad_norm:0.0773 train_time:549081ms step_avg:92.28ms +late_qat:enabled step:5968 scale:0.1496 core_quant:on +step:6000/6500 train_loss:1.9028 grad_norm:0.0843 train_time:555878ms step_avg:92.65ms +step:6050/6500 train_loss:1.9311 grad_norm:0.0847 train_time:562586ms step_avg:92.99ms +step:6100/6500 train_loss:1.8777 grad_norm:0.0861 train_time:569299ms step_avg:93.33ms +step:6150/6500 train_loss:1.9797 grad_norm:0.0873 train_time:576065ms step_avg:93.67ms +step:6200/6500 train_loss:1.9020 grad_norm:0.0877 train_time:582782ms step_avg:94.00ms +step:6250/6500 train_loss:2.0230 grad_norm:0.0927 train_time:589555ms step_avg:94.33ms +step:6300/6500 train_loss:1.9017 grad_norm:0.0820 train_time:596266ms step_avg:94.65ms +step:6328/6500 val_loss:1.9200 val_bpb:1.1371 train_time:600113ms step_avg:94.83ms +stopping_early: wallclock_cap train_time:600113ms step:6328/6500 +peak memory allocated: 34074 MiB reserved: 34084 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9168 val_bpb:1.1353 eval_time:2969ms +Serialized model: 105478842 bytes +Code size: 88253 bytes +eval_override: num_passes 1 -> 3 +Serialized model int6+lzma: 15850832 bytes +Total submission size int6+lzma: 15939085 bytes +eval_feedback: loaded from artifact, params=2560 +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=26700895 frozen=0 + ttt_chunk [1/1893] bpb=1.152954 time=0.5s + ttt_chunk [11/1893] bpb=1.141819 time=3.5s + ttt_chunk [21/1893] bpb=1.126763 time=6.5s + ttt_chunk [31/1893] bpb=1.125066 time=9.4s + ttt_chunk [41/1893] bpb=1.111860 time=12.4s + ttt_chunk [51/1893] bpb=1.106199 time=15.4s + ttt_chunk [61/1893] bpb=1.112955 time=18.3s + ttt_chunk [71/1893] bpb=1.111630 time=21.2s + ttt_chunk [81/1893] bpb=1.110738 time=24.2s + ttt_chunk [91/1893] bpb=1.111503 time=27.2s + ttt_chunk [101/1893] bpb=1.114974 time=30.1s + ttt_chunk [111/1893] bpb=1.117455 time=33.1s + ttt_chunk [121/1893] bpb=1.111134 time=37.2s + ttt_chunk [131/1893] bpb=1.111261 time=40.1s + ttt_chunk [141/1893] bpb=1.116930 time=43.1s + ttt_chunk [151/1893] bpb=1.118842 time=46.1s + ttt_chunk [161/1893] bpb=1.118422 time=49.0s + ttt_chunk [171/1893] bpb=1.122751 time=52.0s + ttt_chunk [181/1893] bpb=1.124991 time=54.9s + ttt_chunk [191/1893] bpb=1.132251 time=57.9s + ttt_chunk [201/1893] bpb=1.131038 time=60.9s + ttt_chunk [211/1893] bpb=1.128836 time=63.9s + ttt_chunk [221/1893] bpb=1.130341 time=66.8s + ttt_chunk [231/1893] bpb=1.128959 time=69.8s + ttt_chunk [241/1893] bpb=1.129200 time=72.7s + ttt_chunk [251/1893] bpb=1.128760 time=75.7s + ttt_chunk [261/1893] bpb=1.125903 time=78.6s + ttt_chunk [271/1893] bpb=1.124748 time=81.6s + ttt_chunk [281/1893] bpb=1.126141 time=84.6s + ttt_chunk [291/1893] bpb=1.127927 time=87.5s + ttt_chunk [301/1893] bpb=1.128620 time=90.5s + ttt_chunk [311/1893] bpb=1.130720 time=93.5s + ttt_chunk [321/1893] bpb=1.132607 time=96.4s + ttt_chunk [331/1893] bpb=1.132470 time=99.4s + ttt_chunk [341/1893] bpb=1.131561 time=102.4s + ttt_chunk [351/1893] bpb=1.133871 time=105.3s + ttt_chunk [361/1893] bpb=1.134134 time=108.3s + ttt_chunk [371/1893] bpb=1.133490 time=111.2s + ttt_chunk [381/1893] bpb=1.133694 time=114.2s + ttt_chunk [391/1893] bpb=1.133479 time=117.2s + ttt_chunk [401/1893] bpb=1.131444 time=120.1s + ttt_chunk [411/1893] bpb=1.130286 time=123.1s + ttt_chunk [421/1893] bpb=1.129405 time=126.0s + ttt_chunk [431/1893] bpb=1.129247 time=129.0s + ttt_chunk [441/1893] bpb=1.129642 time=132.0s + ttt_chunk [451/1893] bpb=1.129939 time=134.9s + ttt_chunk [461/1893] bpb=1.128810 time=137.9s + ttt_chunk [471/1893] bpb=1.129386 time=140.9s + ttt_chunk [481/1893] bpb=1.129022 time=143.8s + ttt_chunk [491/1893] bpb=1.127956 time=146.8s + ttt_chunk [501/1893] bpb=1.127475 time=149.7s + ttt_chunk [511/1893] bpb=1.126774 time=152.7s + ttt_chunk [521/1893] bpb=1.124527 time=155.7s + ttt_chunk [531/1893] bpb=1.125727 time=158.6s + ttt_chunk [541/1893] bpb=1.126064 time=161.6s + ttt_chunk [551/1893] bpb=1.125011 time=164.5s + ttt_chunk [561/1893] bpb=1.125537 time=167.5s + ttt_chunk [571/1893] bpb=1.124514 time=170.4s + ttt_chunk [581/1893] bpb=1.123709 time=173.4s + ttt_chunk [591/1893] bpb=1.123103 time=176.4s + ttt_chunk [601/1893] bpb=1.123596 time=179.3s + ttt_chunk [611/1893] bpb=1.123535 time=182.3s + ttt_chunk [621/1893] bpb=1.123429 time=185.3s + ttt_chunk [631/1893] bpb=1.124181 time=188.2s + ttt_chunk [641/1893] bpb=1.123921 time=191.2s + ttt_chunk [651/1893] bpb=1.123995 time=194.2s + ttt_chunk [661/1893] bpb=1.123446 time=197.1s + ttt_chunk [671/1893] bpb=1.123794 time=200.1s + ttt_chunk [681/1893] bpb=1.124511 time=203.1s + ttt_chunk [691/1893] bpb=1.125507 time=206.0s + ttt_chunk [701/1893] bpb=1.124950 time=209.0s + ttt_chunk [711/1893] bpb=1.124905 time=211.9s + ttt_chunk [721/1893] bpb=1.124560 time=214.9s + ttt_chunk [731/1893] bpb=1.124625 time=217.9s + ttt_chunk [741/1893] bpb=1.124698 time=221.7s + ttt_chunk [751/1893] bpb=1.124579 time=225.4s + ttt_chunk [761/1893] bpb=1.124500 time=228.4s + ttt_chunk [771/1893] bpb=1.124177 time=231.3s + ttt_chunk [781/1893] bpb=1.124888 time=234.2s + ttt_chunk [791/1893] bpb=1.124512 time=237.2s + ttt_chunk [801/1893] bpb=1.124853 time=240.1s + ttt_chunk [811/1893] bpb=1.124605 time=243.1s + ttt_chunk [821/1893] bpb=1.124387 time=246.9s + ttt_chunk [831/1893] bpb=1.124196 time=249.8s + ttt_chunk [841/1893] bpb=1.123552 time=252.8s + ttt_chunk [851/1893] bpb=1.123263 time=255.7s + ttt_chunk [861/1893] bpb=1.123022 time=259.5s + ttt_chunk [871/1893] bpb=1.123296 time=262.4s + ttt_chunk [881/1893] bpb=1.123489 time=265.4s + ttt_chunk [891/1893] bpb=1.123064 time=268.3s + ttt_chunk [901/1893] bpb=1.122801 time=271.3s + ttt_chunk [911/1893] bpb=1.122916 time=274.2s + ttt_chunk [921/1893] bpb=1.123400 time=277.2s + ttt_chunk [931/1893] bpb=1.123364 time=280.1s + ttt_chunk [941/1893] bpb=1.123043 time=283.0s + ttt_chunk [951/1893] bpb=1.123428 time=286.0s + ttt_chunk [961/1893] bpb=1.123504 time=288.9s + ttt_chunk [971/1893] bpb=1.124333 time=291.9s + ttt_chunk [981/1893] bpb=1.124425 time=294.8s + ttt_chunk [991/1893] bpb=1.124452 time=297.7s + ttt_chunk [1001/1893] bpb=1.124389 time=300.7s + ttt_chunk [1011/1893] bpb=1.124163 time=303.6s + ttt_chunk [1021/1893] bpb=1.124501 time=306.6s + ttt_chunk [1031/1893] bpb=1.124946 time=309.5s + ttt_chunk [1041/1893] bpb=1.124610 time=312.4s + ttt_chunk [1051/1893] bpb=1.124364 time=315.4s + ttt_chunk [1061/1893] bpb=1.124409 time=318.3s + ttt_chunk [1071/1893] bpb=1.124986 time=321.3s + ttt_chunk [1081/1893] bpb=1.125240 time=324.2s + ttt_chunk [1091/1893] bpb=1.125996 time=327.2s + ttt_chunk [1101/1893] bpb=1.126011 time=330.1s + ttt_chunk [1111/1893] bpb=1.125871 time=333.1s + ttt_chunk [1121/1893] bpb=1.125666 time=336.0s + ttt_chunk [1131/1893] bpb=1.125557 time=339.0s + ttt_chunk [1141/1893] bpb=1.125269 time=342.8s + ttt_chunk [1151/1893] bpb=1.125279 time=345.8s + ttt_chunk [1161/1893] bpb=1.124897 time=348.7s + ttt_chunk [1171/1893] bpb=1.125181 time=351.7s + ttt_chunk [1181/1893] bpb=1.124422 time=354.6s + ttt_chunk [1191/1893] bpb=1.124309 time=357.5s + ttt_chunk [1201/1893] bpb=1.124724 time=360.5s + ttt_chunk [1211/1893] bpb=1.124261 time=363.4s + ttt_chunk [1221/1893] bpb=1.123951 time=366.4s + ttt_chunk [1231/1893] bpb=1.123660 time=369.3s + ttt_chunk [1241/1893] bpb=1.123320 time=372.3s + ttt_chunk [1251/1893] bpb=1.122737 time=375.3s + ttt_chunk [1261/1893] bpb=1.122718 time=378.2s + ttt_chunk [1271/1893] bpb=1.122354 time=381.2s + ttt_chunk [1281/1893] bpb=1.122155 time=384.1s + ttt_chunk [1291/1893] bpb=1.121913 time=387.1s + ttt_chunk [1301/1893] bpb=1.121328 time=390.1s + ttt_chunk [1311/1893] bpb=1.120942 time=393.0s + ttt_chunk [1321/1893] bpb=1.120628 time=396.0s + ttt_chunk [1331/1893] bpb=1.120549 time=398.9s + ttt_chunk [1341/1893] bpb=1.120410 time=401.9s + ttt_chunk [1351/1893] bpb=1.120330 time=404.8s + ttt_chunk [1361/1893] bpb=1.120381 time=407.8s + ttt_chunk [1371/1893] bpb=1.120255 time=410.7s + ttt_chunk [1381/1893] bpb=1.120227 time=413.7s + ttt_chunk [1391/1893] bpb=1.119838 time=416.6s + ttt_chunk [1401/1893] bpb=1.119801 time=419.6s + ttt_chunk [1411/1893] bpb=1.119912 time=422.5s + ttt_chunk [1421/1893] bpb=1.120166 time=425.5s + ttt_chunk [1431/1893] bpb=1.119882 time=428.4s + ttt_chunk [1441/1893] bpb=1.120376 time=431.4s + ttt_chunk [1451/1893] bpb=1.120698 time=434.3s + ttt_chunk [1461/1893] bpb=1.120238 time=437.2s + ttt_chunk [1471/1893] bpb=1.121294 time=440.2s + ttt_chunk [1481/1893] bpb=1.120835 time=443.1s + ttt_chunk [1491/1893] bpb=1.120648 time=446.1s + ttt_chunk [1501/1893] bpb=1.120564 time=449.0s + ttt_chunk [1511/1893] bpb=1.120603 time=452.0s + ttt_chunk [1521/1893] bpb=1.120635 time=454.9s + ttt_chunk [1531/1893] bpb=1.120111 time=457.9s + ttt_chunk [1541/1893] bpb=1.119976 time=460.8s + ttt_chunk [1551/1893] bpb=1.120291 time=463.8s + ttt_chunk [1561/1893] bpb=1.120305 time=466.7s + ttt_chunk [1571/1893] bpb=1.120150 time=469.7s + ttt_chunk [1581/1893] bpb=1.120269 time=472.6s + ttt_chunk [1591/1893] bpb=1.120126 time=475.6s + ttt_chunk [1601/1893] bpb=1.120311 time=478.6s + ttt_chunk [1611/1893] bpb=1.120251 time=481.5s + ttt_chunk [1621/1893] bpb=1.119847 time=484.5s + ttt_chunk [1631/1893] bpb=1.120159 time=487.4s + ttt_chunk [1641/1893] bpb=1.120168 time=490.4s + ttt_chunk [1651/1893] bpb=1.120126 time=493.3s + ttt_chunk [1661/1893] bpb=1.120008 time=497.0s + ttt_chunk [1671/1893] bpb=1.120476 time=500.0s + ttt_chunk [1681/1893] bpb=1.120639 time=502.9s + ttt_chunk [1691/1893] bpb=1.120481 time=505.9s + ttt_chunk [1701/1893] bpb=1.120642 time=508.9s + ttt_chunk [1711/1893] bpb=1.120626 time=511.8s + ttt_chunk [1721/1893] bpb=1.120632 time=514.8s + ttt_chunk [1731/1893] bpb=1.120504 time=517.7s + ttt_chunk [1741/1893] bpb=1.120304 time=520.7s + ttt_chunk [1751/1893] bpb=1.120143 time=523.7s + ttt_chunk [1761/1893] bpb=1.120286 time=526.6s + ttt_chunk [1771/1893] bpb=1.120187 time=529.6s + ttt_chunk [1781/1893] bpb=1.120210 time=532.5s + ttt_chunk [1791/1893] bpb=1.119813 time=535.5s + ttt_chunk [1801/1893] bpb=1.119678 time=538.4s + ttt_chunk [1811/1893] bpb=1.119594 time=541.4s + ttt_chunk [1821/1893] bpb=1.119651 time=544.4s + ttt_chunk [1831/1893] bpb=1.119053 time=547.3s + ttt_chunk [1841/1893] bpb=1.119075 time=550.3s + ttt_chunk [1851/1893] bpb=1.118874 time=553.3s + ttt_chunk [1861/1893] bpb=1.118513 time=556.2s + ttt_chunk [1871/1893] bpb=1.118501 time=559.3s + ttt_chunk [1881/1893] bpb=1.118056 time=562.2s + ttt_chunk [1891/1893] bpb=1.117825 time=565.2s + ttt_chunk [1893/1893] bpb=1.117869 time=565.6s +ttt_sliding:done val_loss=1.883755 val_bpb=1.115669 elapsed=565.6s +legal_ttt val_loss:1.8838 val_bpb:1.1157 eval_time:566013ms +legal_ttt_exact val_loss:1.88375543 val_bpb:1.11566902 diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed2025.log b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed2025.log new file mode 100644 index 0000000000..c825b25919 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed2025.log @@ -0,0 +1,387 @@ +W0401 18:14:56.025000 191441 torch/distributed/run.py:851] +W0401 18:14:56.025000 191441 torch/distributed/run.py:851] ***************************************** +W0401 18:14:56.025000 191441 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0401 18:14:56.025000 191441 torch/distributed/run.py:851] ***************************************** +logs/bigram_ve_wd3500_3pass.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:80 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=4 core_end=7 num_passes=1 max_passes=3 stem=4 core=3 tail=4 schedule=[(0, 1), (4500, 2), (5500, 3)] +model_params:26698335 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:8 grad_accum_steps:1 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:6500 warmup_steps:20 max_wallclock_seconds:600.000 +seed:2025 +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/6500 val_loss:6.9300 val_bpb:4.1044 train_time:0ms step_avg:0.01ms +step:1/6500 train_loss:6.9310 grad_norm:0.3872 train_time:128ms step_avg:127.73ms +step:2/6500 train_loss:8.5729 grad_norm:3.7965 train_time:162ms step_avg:81.13ms +step:3/6500 train_loss:7.7014 grad_norm:2.0579 train_time:243ms step_avg:80.95ms +step:4/6500 train_loss:7.1616 grad_norm:1.1957 train_time:324ms step_avg:81.12ms +step:5/6500 train_loss:6.9726 grad_norm:1.3804 train_time:406ms step_avg:81.15ms +step:6/6500 train_loss:6.9302 grad_norm:1.1740 train_time:487ms step_avg:81.22ms +step:7/6500 train_loss:6.8503 grad_norm:1.3501 train_time:568ms step_avg:81.16ms +step:8/6500 train_loss:6.7528 grad_norm:1.1604 train_time:649ms step_avg:81.13ms +step:9/6500 train_loss:6.4660 grad_norm:0.9217 train_time:730ms step_avg:81.11ms +step:10/6500 train_loss:6.1355 grad_norm:1.3881 train_time:811ms step_avg:81.12ms +step:50/6500 train_loss:3.7789 grad_norm:0.7948 train_time:4100ms step_avg:81.99ms +step:100/6500 train_loss:3.1790 grad_norm:0.5920 train_time:8228ms step_avg:82.28ms +step:150/6500 train_loss:2.8888 grad_norm:0.5019 train_time:12419ms step_avg:82.79ms +step:200/6500 train_loss:2.3854 grad_norm:0.4094 train_time:16561ms step_avg:82.80ms +step:250/6500 train_loss:2.4722 grad_norm:0.3397 train_time:20702ms step_avg:82.81ms +step:300/6500 train_loss:2.5404 grad_norm:0.2761 train_time:24912ms step_avg:83.04ms +step:350/6500 train_loss:2.5242 grad_norm:0.2486 train_time:29054ms step_avg:83.01ms +step:400/6500 train_loss:2.3962 grad_norm:0.2509 train_time:33255ms step_avg:83.14ms +step:450/6500 train_loss:2.3545 grad_norm:0.2513 train_time:37398ms step_avg:83.11ms +step:500/6500 train_loss:2.3796 grad_norm:0.2000 train_time:41540ms step_avg:83.08ms +step:550/6500 train_loss:2.3167 grad_norm:0.2005 train_time:45748ms step_avg:83.18ms +step:600/6500 train_loss:2.3201 grad_norm:0.2006 train_time:49897ms step_avg:83.16ms +step:650/6500 train_loss:2.3102 grad_norm:0.1738 train_time:54099ms step_avg:83.23ms +step:700/6500 train_loss:2.3297 grad_norm:0.1634 train_time:58239ms step_avg:83.20ms +step:750/6500 train_loss:2.3176 grad_norm:0.1924 train_time:62388ms step_avg:83.18ms +step:800/6500 train_loss:2.2269 grad_norm:0.1741 train_time:66598ms step_avg:83.25ms +step:850/6500 train_loss:2.2177 grad_norm:0.1652 train_time:70737ms step_avg:83.22ms +step:900/6500 train_loss:2.1087 grad_norm:0.1458 train_time:74951ms step_avg:83.28ms +step:950/6500 train_loss:2.2081 grad_norm:0.1481 train_time:79104ms step_avg:83.27ms +step:1000/6500 train_loss:2.2609 grad_norm:0.1369 train_time:83253ms step_avg:83.25ms +step:1050/6500 train_loss:2.2108 grad_norm:0.1563 train_time:87470ms step_avg:83.30ms +step:1100/6500 train_loss:2.3090 grad_norm:0.1307 train_time:91612ms step_avg:83.28ms +step:1150/6500 train_loss:2.2338 grad_norm:0.1242 train_time:95814ms step_avg:83.32ms +step:1200/6500 train_loss:2.3379 grad_norm:0.1267 train_time:99964ms step_avg:83.30ms +step:1250/6500 train_loss:2.2385 grad_norm:0.1357 train_time:104106ms step_avg:83.29ms +step:1300/6500 train_loss:2.0882 grad_norm:0.1181 train_time:108319ms step_avg:83.32ms +step:1350/6500 train_loss:2.2403 grad_norm:0.1335 train_time:112466ms step_avg:83.31ms +step:1400/6500 train_loss:2.1715 grad_norm:0.1002 train_time:116669ms step_avg:83.34ms +step:1450/6500 train_loss:2.1064 grad_norm:0.1013 train_time:120817ms step_avg:83.32ms +step:1500/6500 train_loss:2.2087 grad_norm:0.1064 train_time:124963ms step_avg:83.31ms +step:1550/6500 train_loss:2.1723 grad_norm:0.0932 train_time:129174ms step_avg:83.34ms +step:1600/6500 train_loss:2.0630 grad_norm:0.0914 train_time:133323ms step_avg:83.33ms +step:1650/6500 train_loss:2.1777 grad_norm:0.0843 train_time:137474ms step_avg:83.32ms +step:1700/6500 train_loss:2.1294 grad_norm:0.0812 train_time:141693ms step_avg:83.35ms +step:1750/6500 train_loss:2.1849 grad_norm:0.0830 train_time:145855ms step_avg:83.35ms +step:1800/6500 train_loss:2.1436 grad_norm:0.1140 train_time:150071ms step_avg:83.37ms +step:1850/6500 train_loss:2.0157 grad_norm:0.0837 train_time:154220ms step_avg:83.36ms +step:1900/6500 train_loss:2.1111 grad_norm:0.0802 train_time:158375ms step_avg:83.36ms +step:1950/6500 train_loss:2.0071 grad_norm:0.0732 train_time:162600ms step_avg:83.38ms +step:2000/6500 train_loss:2.0536 grad_norm:0.0741 train_time:166756ms step_avg:83.38ms +step:2050/6500 train_loss:2.1021 grad_norm:0.0812 train_time:170980ms step_avg:83.40ms +step:2100/6500 train_loss:2.0340 grad_norm:0.0728 train_time:175134ms step_avg:83.40ms +step:2150/6500 train_loss:2.1384 grad_norm:0.0759 train_time:179289ms step_avg:83.39ms +step:2200/6500 train_loss:2.1237 grad_norm:0.1112 train_time:183511ms step_avg:83.41ms +step:2250/6500 train_loss:2.1585 grad_norm:0.0762 train_time:187667ms step_avg:83.41ms +step:2300/6500 train_loss:2.0963 grad_norm:0.0734 train_time:191892ms step_avg:83.43ms +step:2350/6500 train_loss:2.1587 grad_norm:0.0718 train_time:196043ms step_avg:83.42ms +step:2400/6500 train_loss:2.0555 grad_norm:0.0726 train_time:200201ms step_avg:83.42ms +step:2450/6500 train_loss:2.0712 grad_norm:0.0756 train_time:204419ms step_avg:83.44ms +step:2500/6500 train_loss:2.1573 grad_norm:0.0919 train_time:208575ms step_avg:83.43ms +step:2550/6500 train_loss:2.1984 grad_norm:0.0866 train_time:212796ms step_avg:83.45ms +step:2600/6500 train_loss:2.0966 grad_norm:0.0777 train_time:216958ms step_avg:83.45ms +step:2650/6500 train_loss:2.0599 grad_norm:0.0825 train_time:221118ms step_avg:83.44ms +step:2700/6500 train_loss:2.0877 grad_norm:0.0707 train_time:225342ms step_avg:83.46ms +step:2750/6500 train_loss:2.0203 grad_norm:0.0764 train_time:229499ms step_avg:83.45ms +step:2800/6500 train_loss:2.1453 grad_norm:0.0883 train_time:233725ms step_avg:83.47ms +step:2850/6500 train_loss:2.0545 grad_norm:0.0708 train_time:237877ms step_avg:83.47ms +step:2900/6500 train_loss:2.0150 grad_norm:0.0708 train_time:242036ms step_avg:83.46ms +step:2950/6500 train_loss:2.0711 grad_norm:0.0771 train_time:246257ms step_avg:83.48ms +step:3000/6500 train_loss:2.1547 grad_norm:0.0798 train_time:250411ms step_avg:83.47ms +step:3050/6500 train_loss:2.0322 grad_norm:0.0736 train_time:254570ms step_avg:83.47ms +step:3100/6500 train_loss:2.0229 grad_norm:0.0768 train_time:258787ms step_avg:83.48ms +step:3150/6500 train_loss:1.9607 grad_norm:0.0770 train_time:262937ms step_avg:83.47ms +step:3200/6500 train_loss:2.1643 grad_norm:0.0762 train_time:267158ms step_avg:83.49ms +step:3250/6500 train_loss:2.0406 grad_norm:0.0700 train_time:271320ms step_avg:83.48ms +step:3300/6500 train_loss:2.0632 grad_norm:0.0767 train_time:275481ms step_avg:83.48ms +step:3350/6500 train_loss:2.0895 grad_norm:0.0714 train_time:279706ms step_avg:83.49ms +step:3400/6500 train_loss:2.0162 grad_norm:0.0769 train_time:283866ms step_avg:83.49ms +step:3450/6500 train_loss:2.1041 grad_norm:0.0823 train_time:288086ms step_avg:83.50ms +step:3500/6500 train_loss:2.1725 grad_norm:0.0725 train_time:292241ms step_avg:83.50ms +step:3550/6500 train_loss:1.9173 grad_norm:0.0723 train_time:296398ms step_avg:83.49ms +step:3600/6500 train_loss:2.0913 grad_norm:0.0857 train_time:300633ms step_avg:83.51ms +step:3650/6500 train_loss:1.9646 grad_norm:0.0721 train_time:304792ms step_avg:83.50ms +step:3700/6500 train_loss:2.0906 grad_norm:0.0733 train_time:309024ms step_avg:83.52ms +step:3750/6500 train_loss:1.9146 grad_norm:0.0699 train_time:313189ms step_avg:83.52ms +step:3800/6500 train_loss:2.0682 grad_norm:0.0742 train_time:317342ms step_avg:83.51ms +step:3850/6500 train_loss:2.0838 grad_norm:0.0797 train_time:321569ms step_avg:83.52ms +step:3900/6500 train_loss:2.0688 grad_norm:0.0756 train_time:325727ms step_avg:83.52ms +step:3950/6500 train_loss:2.1688 grad_norm:0.0756 train_time:329952ms step_avg:83.53ms +step:4000/6500 train_loss:1.9681 grad_norm:0.0761 train_time:334116ms step_avg:83.53ms +step:4000/6500 val_loss:2.0597 val_bpb:1.2199 train_time:334165ms step_avg:83.54ms +step:4050/6500 train_loss:2.0878 grad_norm:0.0717 train_time:338276ms step_avg:83.53ms +step:4100/6500 train_loss:2.0089 grad_norm:0.0768 train_time:342492ms step_avg:83.53ms +step:4150/6500 train_loss:2.1031 grad_norm:0.0722 train_time:346653ms step_avg:83.53ms +step:4200/6500 train_loss:2.1474 grad_norm:0.0806 train_time:350873ms step_avg:83.54ms +step:4250/6500 train_loss:2.1078 grad_norm:0.0773 train_time:355034ms step_avg:83.54ms +step:4300/6500 train_loss:2.0497 grad_norm:0.0714 train_time:359189ms step_avg:83.53ms +step:4350/6500 train_loss:2.0626 grad_norm:0.0769 train_time:363402ms step_avg:83.54ms +step:4400/6500 train_loss:2.0247 grad_norm:0.0802 train_time:367559ms step_avg:83.54ms +step:4450/6500 train_loss:2.0398 grad_norm:0.0715 train_time:371710ms step_avg:83.53ms +step:4500/6500 train_loss:2.1202 grad_norm:0.0758 train_time:375934ms step_avg:83.54ms +progressive_passes: step:4500 num_passes:2 +step:4550/6500 train_loss:2.1214 grad_norm:0.0736 train_time:381502ms step_avg:83.85ms +step:4600/6500 train_loss:1.8320 grad_norm:0.0857 train_time:387135ms step_avg:84.16ms +step:4650/6500 train_loss:2.0429 grad_norm:0.0732 train_time:392703ms step_avg:84.45ms +step:4700/6500 train_loss:2.2223 grad_norm:0.1153 train_time:398274ms step_avg:84.74ms +step:4750/6500 train_loss:2.0115 grad_norm:0.0777 train_time:403912ms step_avg:85.03ms +step:4800/6500 train_loss:2.4075 grad_norm:0.1478 train_time:409482ms step_avg:85.31ms +step:4850/6500 train_loss:2.0903 grad_norm:0.0810 train_time:415124ms step_avg:85.59ms +step:4900/6500 train_loss:2.0334 grad_norm:0.0771 train_time:420697ms step_avg:85.86ms +step:4950/6500 train_loss:2.0793 grad_norm:0.0849 train_time:426276ms step_avg:86.12ms +step:5000/6500 train_loss:2.0823 grad_norm:0.0777 train_time:431912ms step_avg:86.38ms +step:5050/6500 train_loss:2.0483 grad_norm:0.0879 train_time:437486ms step_avg:86.63ms +step:5100/6500 train_loss:2.1072 grad_norm:0.0761 train_time:443134ms step_avg:86.89ms +step:5150/6500 train_loss:2.0028 grad_norm:0.0791 train_time:448705ms step_avg:87.13ms +step:5200/6500 train_loss:2.0176 grad_norm:0.0750 train_time:454285ms step_avg:87.36ms +step:5250/6500 train_loss:2.0460 grad_norm:0.0718 train_time:459925ms step_avg:87.60ms +step:5300/6500 train_loss:1.9792 grad_norm:0.0751 train_time:465491ms step_avg:87.83ms +step:5350/6500 train_loss:1.8959 grad_norm:0.0778 train_time:471138ms step_avg:88.06ms +step:5400/6500 train_loss:2.0199 grad_norm:0.0772 train_time:476710ms step_avg:88.28ms +step:5450/6500 train_loss:2.0468 grad_norm:0.0773 train_time:482281ms step_avg:88.49ms +step:5500/6500 train_loss:1.9886 grad_norm:0.0815 train_time:487909ms step_avg:88.71ms +progressive_passes: step:5500 num_passes:3 +step:5550/6500 train_loss:1.9760 grad_norm:0.0816 train_time:494576ms step_avg:89.11ms +step:5600/6500 train_loss:1.9213 grad_norm:0.0796 train_time:501307ms step_avg:89.52ms +step:5650/6500 train_loss:2.0222 grad_norm:0.0810 train_time:507972ms step_avg:89.91ms +step:5700/6500 train_loss:1.9784 grad_norm:0.0853 train_time:514641ms step_avg:90.29ms +step:5750/6500 train_loss:2.0558 grad_norm:0.0940 train_time:521373ms step_avg:90.67ms +step:5800/6500 train_loss:1.9551 grad_norm:0.0925 train_time:528043ms step_avg:91.04ms +step:5850/6500 train_loss:2.0869 grad_norm:0.0836 train_time:534780ms step_avg:91.42ms +swa:start step:5900 +step:5900/6500 train_loss:1.8584 grad_norm:0.0822 train_time:541437ms step_avg:91.77ms +step:5950/6500 train_loss:1.9194 grad_norm:0.0798 train_time:548208ms step_avg:92.14ms +late_qat:enabled step:5974 scale:0.1500 core_quant:on +step:6000/6500 train_loss:1.9026 grad_norm:0.0840 train_time:555011ms step_avg:92.50ms +step:6050/6500 train_loss:1.9272 grad_norm:0.0839 train_time:561738ms step_avg:92.85ms +step:6100/6500 train_loss:1.8776 grad_norm:0.0841 train_time:568450ms step_avg:93.19ms +step:6150/6500 train_loss:1.9784 grad_norm:0.0836 train_time:575232ms step_avg:93.53ms +step:6200/6500 train_loss:1.9036 grad_norm:0.0847 train_time:581961ms step_avg:93.86ms +step:6250/6500 train_loss:2.0241 grad_norm:0.0935 train_time:588767ms step_avg:94.20ms +step:6300/6500 train_loss:1.9003 grad_norm:0.0837 train_time:595474ms step_avg:94.52ms +step:6334/6500 val_loss:1.9198 val_bpb:1.1370 train_time:600111ms step_avg:94.74ms +stopping_early: wallclock_cap train_time:600111ms step:6334/6500 +peak memory allocated: 34074 MiB reserved: 34084 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9165 val_bpb:1.1351 eval_time:2972ms +Serialized model: 105478842 bytes +Code size: 88253 bytes +eval_override: num_passes 1 -> 3 +Serialized model int6+lzma: 15937372 bytes +Total submission size int6+lzma: 16025625 bytes +eval_feedback: loaded from artifact, params=2560 +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=26700895 frozen=0 + ttt_chunk [1/1893] bpb=1.159074 time=0.5s + ttt_chunk [11/1893] bpb=1.143360 time=3.6s + ttt_chunk [21/1893] bpb=1.126563 time=6.7s + ttt_chunk [31/1893] bpb=1.124558 time=9.8s + ttt_chunk [41/1893] bpb=1.111652 time=12.9s + ttt_chunk [51/1893] bpb=1.106247 time=16.0s + ttt_chunk [61/1893] bpb=1.112587 time=19.0s + ttt_chunk [71/1893] bpb=1.111006 time=22.1s + ttt_chunk [81/1893] bpb=1.110399 time=25.2s + ttt_chunk [91/1893] bpb=1.111234 time=28.3s + ttt_chunk [101/1893] bpb=1.114763 time=31.4s + ttt_chunk [111/1893] bpb=1.117112 time=34.5s + ttt_chunk [121/1893] bpb=1.110626 time=37.6s + ttt_chunk [131/1893] bpb=1.110903 time=40.7s + ttt_chunk [141/1893] bpb=1.116692 time=43.7s + ttt_chunk [151/1893] bpb=1.118475 time=46.8s + ttt_chunk [161/1893] bpb=1.118025 time=49.9s + ttt_chunk [171/1893] bpb=1.122484 time=53.0s + ttt_chunk [181/1893] bpb=1.124696 time=56.0s + ttt_chunk [191/1893] bpb=1.131805 time=59.1s + ttt_chunk [201/1893] bpb=1.130526 time=62.2s + ttt_chunk [211/1893] bpb=1.128334 time=65.3s + ttt_chunk [221/1893] bpb=1.129947 time=68.4s + ttt_chunk [231/1893] bpb=1.128611 time=71.5s + ttt_chunk [241/1893] bpb=1.128931 time=74.6s + ttt_chunk [251/1893] bpb=1.128412 time=77.6s + ttt_chunk [261/1893] bpb=1.125529 time=80.7s + ttt_chunk [271/1893] bpb=1.124390 time=83.8s + ttt_chunk [281/1893] bpb=1.125835 time=86.9s + ttt_chunk [291/1893] bpb=1.127584 time=90.0s + ttt_chunk [301/1893] bpb=1.128341 time=93.0s + ttt_chunk [311/1893] bpb=1.130443 time=96.1s + ttt_chunk [321/1893] bpb=1.132480 time=99.2s + ttt_chunk [331/1893] bpb=1.132351 time=102.3s + ttt_chunk [341/1893] bpb=1.131368 time=105.4s + ttt_chunk [351/1893] bpb=1.133657 time=108.5s + ttt_chunk [361/1893] bpb=1.133975 time=111.5s + ttt_chunk [371/1893] bpb=1.133283 time=114.6s + ttt_chunk [381/1893] bpb=1.133417 time=117.7s + ttt_chunk [391/1893] bpb=1.133243 time=120.8s + ttt_chunk [401/1893] bpb=1.131189 time=123.9s + ttt_chunk [411/1893] bpb=1.130058 time=127.0s + ttt_chunk [421/1893] bpb=1.129191 time=130.1s + ttt_chunk [431/1893] bpb=1.129039 time=133.2s + ttt_chunk [441/1893] bpb=1.129411 time=136.3s + ttt_chunk [451/1893] bpb=1.129700 time=139.4s + ttt_chunk [461/1893] bpb=1.128602 time=142.5s + ttt_chunk [471/1893] bpb=1.129204 time=145.6s + ttt_chunk [481/1893] bpb=1.128811 time=148.7s + ttt_chunk [491/1893] bpb=1.127740 time=151.8s + ttt_chunk [501/1893] bpb=1.127267 time=154.9s + ttt_chunk [511/1893] bpb=1.126595 time=158.0s + ttt_chunk [521/1893] bpb=1.124313 time=161.1s + ttt_chunk [531/1893] bpb=1.125507 time=164.2s + ttt_chunk [541/1893] bpb=1.125829 time=167.2s + ttt_chunk [551/1893] bpb=1.124810 time=170.3s + ttt_chunk [561/1893] bpb=1.125331 time=173.4s + ttt_chunk [571/1893] bpb=1.124264 time=176.6s + ttt_chunk [581/1893] bpb=1.123483 time=179.6s + ttt_chunk [591/1893] bpb=1.122882 time=182.8s + ttt_chunk [601/1893] bpb=1.123371 time=185.9s + ttt_chunk [611/1893] bpb=1.123308 time=189.0s + ttt_chunk [621/1893] bpb=1.123158 time=192.1s + ttt_chunk [631/1893] bpb=1.123903 time=195.1s + ttt_chunk [641/1893] bpb=1.123643 time=198.2s + ttt_chunk [651/1893] bpb=1.123759 time=201.3s + ttt_chunk [661/1893] bpb=1.123246 time=204.4s + ttt_chunk [671/1893] bpb=1.123611 time=207.5s + ttt_chunk [681/1893] bpb=1.124330 time=210.6s + ttt_chunk [691/1893] bpb=1.125310 time=213.7s + ttt_chunk [701/1893] bpb=1.124762 time=216.7s + ttt_chunk [711/1893] bpb=1.124741 time=219.8s + ttt_chunk [721/1893] bpb=1.124374 time=222.9s + ttt_chunk [731/1893] bpb=1.124466 time=226.0s + ttt_chunk [741/1893] bpb=1.124553 time=229.1s + ttt_chunk [751/1893] bpb=1.124388 time=232.2s + ttt_chunk [761/1893] bpb=1.124284 time=235.3s + ttt_chunk [771/1893] bpb=1.123991 time=238.4s + ttt_chunk [781/1893] bpb=1.124701 time=241.5s + ttt_chunk [791/1893] bpb=1.124307 time=244.6s + ttt_chunk [801/1893] bpb=1.124621 time=247.7s + ttt_chunk [811/1893] bpb=1.124357 time=250.7s + ttt_chunk [821/1893] bpb=1.124124 time=253.8s + ttt_chunk [831/1893] bpb=1.123943 time=256.9s + ttt_chunk [841/1893] bpb=1.123340 time=260.0s + ttt_chunk [851/1893] bpb=1.123098 time=263.1s + ttt_chunk [861/1893] bpb=1.122827 time=266.2s + ttt_chunk [871/1893] bpb=1.123090 time=269.3s + ttt_chunk [881/1893] bpb=1.123268 time=272.4s + ttt_chunk [891/1893] bpb=1.122859 time=275.5s + ttt_chunk [901/1893] bpb=1.122593 time=278.5s + ttt_chunk [911/1893] bpb=1.122701 time=281.6s + ttt_chunk [921/1893] bpb=1.123194 time=284.7s + ttt_chunk [931/1893] bpb=1.123168 time=287.8s + ttt_chunk [941/1893] bpb=1.122848 time=290.9s + ttt_chunk [951/1893] bpb=1.123224 time=294.0s + ttt_chunk [961/1893] bpb=1.123313 time=297.1s + ttt_chunk [971/1893] bpb=1.124171 time=300.2s + ttt_chunk [981/1893] bpb=1.124228 time=303.3s + ttt_chunk [991/1893] bpb=1.124239 time=306.4s + ttt_chunk [1001/1893] bpb=1.124194 time=309.5s + ttt_chunk [1011/1893] bpb=1.123999 time=312.6s + ttt_chunk [1021/1893] bpb=1.124350 time=315.7s + ttt_chunk [1031/1893] bpb=1.124818 time=318.8s + ttt_chunk [1041/1893] bpb=1.124450 time=321.9s + ttt_chunk [1051/1893] bpb=1.124200 time=325.0s + ttt_chunk [1061/1893] bpb=1.124270 time=328.1s + ttt_chunk [1071/1893] bpb=1.124875 time=331.2s + ttt_chunk [1081/1893] bpb=1.125172 time=334.3s + ttt_chunk [1091/1893] bpb=1.125919 time=337.4s + ttt_chunk [1101/1893] bpb=1.125932 time=340.5s + ttt_chunk [1111/1893] bpb=1.125773 time=343.6s + ttt_chunk [1121/1893] bpb=1.125569 time=346.7s + ttt_chunk [1131/1893] bpb=1.125455 time=349.8s + ttt_chunk [1141/1893] bpb=1.125156 time=353.0s + ttt_chunk [1151/1893] bpb=1.125168 time=356.1s + ttt_chunk [1161/1893] bpb=1.124798 time=359.2s + ttt_chunk [1171/1893] bpb=1.125108 time=362.3s + ttt_chunk [1181/1893] bpb=1.124343 time=365.4s + ttt_chunk [1191/1893] bpb=1.124225 time=368.5s + ttt_chunk [1201/1893] bpb=1.124636 time=371.6s + ttt_chunk [1211/1893] bpb=1.124180 time=374.7s + ttt_chunk [1221/1893] bpb=1.123870 time=377.8s + ttt_chunk [1231/1893] bpb=1.123608 time=380.9s + ttt_chunk [1241/1893] bpb=1.123258 time=384.0s + ttt_chunk [1251/1893] bpb=1.122673 time=387.1s + ttt_chunk [1261/1893] bpb=1.122640 time=390.2s + ttt_chunk [1271/1893] bpb=1.122271 time=393.3s + ttt_chunk [1281/1893] bpb=1.122074 time=396.4s + ttt_chunk [1291/1893] bpb=1.121831 time=399.5s + ttt_chunk [1301/1893] bpb=1.121238 time=402.6s + ttt_chunk [1311/1893] bpb=1.120828 time=405.7s + ttt_chunk [1321/1893] bpb=1.120508 time=408.8s + ttt_chunk [1331/1893] bpb=1.120440 time=411.9s + ttt_chunk [1341/1893] bpb=1.120323 time=415.0s + ttt_chunk [1351/1893] bpb=1.120243 time=418.1s + ttt_chunk [1361/1893] bpb=1.120306 time=421.2s + ttt_chunk [1371/1893] bpb=1.120192 time=424.3s + ttt_chunk [1381/1893] bpb=1.120179 time=427.4s + ttt_chunk [1391/1893] bpb=1.119783 time=430.4s + ttt_chunk [1401/1893] bpb=1.119751 time=433.5s + ttt_chunk [1411/1893] bpb=1.119858 time=436.6s + ttt_chunk [1421/1893] bpb=1.120104 time=439.7s + ttt_chunk [1431/1893] bpb=1.119817 time=442.8s + ttt_chunk [1441/1893] bpb=1.120322 time=445.9s + ttt_chunk [1451/1893] bpb=1.120656 time=449.0s + ttt_chunk [1461/1893] bpb=1.120212 time=452.1s + ttt_chunk [1471/1893] bpb=1.121265 time=455.2s + ttt_chunk [1481/1893] bpb=1.120809 time=458.3s + ttt_chunk [1491/1893] bpb=1.120625 time=461.4s + ttt_chunk [1501/1893] bpb=1.120538 time=464.5s + ttt_chunk [1511/1893] bpb=1.120563 time=467.6s + ttt_chunk [1521/1893] bpb=1.120587 time=470.7s + ttt_chunk [1531/1893] bpb=1.120083 time=473.8s + ttt_chunk [1541/1893] bpb=1.119941 time=476.9s + ttt_chunk [1551/1893] bpb=1.120253 time=480.0s + ttt_chunk [1561/1893] bpb=1.120268 time=483.1s + ttt_chunk [1571/1893] bpb=1.120085 time=486.2s + ttt_chunk [1581/1893] bpb=1.120205 time=489.3s + ttt_chunk [1591/1893] bpb=1.120053 time=492.3s + ttt_chunk [1601/1893] bpb=1.120213 time=495.4s + ttt_chunk [1611/1893] bpb=1.120152 time=498.5s + ttt_chunk [1621/1893] bpb=1.119723 time=501.6s + ttt_chunk [1631/1893] bpb=1.120027 time=504.7s + ttt_chunk [1641/1893] bpb=1.120035 time=507.8s + ttt_chunk [1651/1893] bpb=1.119989 time=510.9s + ttt_chunk [1661/1893] bpb=1.119861 time=514.7s + ttt_chunk [1671/1893] bpb=1.120338 time=517.9s + ttt_chunk [1681/1893] bpb=1.120488 time=521.0s + ttt_chunk [1691/1893] bpb=1.120321 time=524.0s + ttt_chunk [1701/1893] bpb=1.120478 time=527.1s + ttt_chunk [1711/1893] bpb=1.120483 time=530.2s + ttt_chunk [1721/1893] bpb=1.120475 time=533.3s + ttt_chunk [1731/1893] bpb=1.120354 time=536.5s + ttt_chunk [1741/1893] bpb=1.120160 time=540.3s + ttt_chunk [1751/1893] bpb=1.119994 time=543.4s + ttt_chunk [1761/1893] bpb=1.120119 time=546.5s + ttt_chunk [1771/1893] bpb=1.120013 time=549.6s + ttt_chunk [1781/1893] bpb=1.120046 time=552.7s + ttt_chunk [1791/1893] bpb=1.119635 time=555.8s + ttt_chunk [1801/1893] bpb=1.119522 time=558.9s + ttt_chunk [1811/1893] bpb=1.119422 time=562.0s + ttt_chunk [1821/1893] bpb=1.119469 time=565.1s + ttt_chunk [1831/1893] bpb=1.118879 time=569.0s + ttt_chunk [1841/1893] bpb=1.118886 time=572.1s + ttt_chunk [1851/1893] bpb=1.118681 time=575.2s + ttt_chunk [1861/1893] bpb=1.118320 time=578.3s + ttt_chunk [1871/1893] bpb=1.118311 time=581.4s + ttt_chunk [1881/1893] bpb=1.117867 time=584.5s + ttt_chunk [1891/1893] bpb=1.117626 time=587.6s + ttt_chunk [1893/1893] bpb=1.117667 time=588.0s +ttt_sliding:done val_loss=1.883386 val_bpb=1.115450 elapsed=588.0s +legal_ttt val_loss:1.8834 val_bpb:1.1155 eval_time:588454ms +legal_ttt_exact val_loss:1.88338589 val_bpb:1.11545016 diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed42.log b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed42.log new file mode 100644 index 0000000000..cbd7b63cbb --- /dev/null +++ b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed42.log @@ -0,0 +1,387 @@ +W0401 17:53:03.402000 182049 torch/distributed/run.py:851] +W0401 17:53:03.402000 182049 torch/distributed/run.py:851] ***************************************** +W0401 17:53:03.402000 182049 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0401 17:53:03.402000 182049 torch/distributed/run.py:851] ***************************************** +logs/bigram_ve_wd3500_3pass.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:80 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=4 core_end=7 num_passes=1 max_passes=3 stem=4 core=3 tail=4 schedule=[(0, 1), (4500, 2), (5500, 3)] +model_params:26698335 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:8 grad_accum_steps:1 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:6500 warmup_steps:20 max_wallclock_seconds:600.000 +seed:42 +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/6500 val_loss:6.9291 val_bpb:4.1038 train_time:0ms step_avg:0.01ms +step:1/6500 train_loss:6.9308 grad_norm:0.3913 train_time:129ms step_avg:128.83ms +step:2/6500 train_loss:8.5088 grad_norm:3.4697 train_time:163ms step_avg:81.37ms +step:3/6500 train_loss:7.6732 grad_norm:1.7810 train_time:244ms step_avg:81.17ms +step:4/6500 train_loss:7.3186 grad_norm:1.2484 train_time:325ms step_avg:81.22ms +step:5/6500 train_loss:7.0964 grad_norm:1.5246 train_time:406ms step_avg:81.18ms +step:6/6500 train_loss:6.9715 grad_norm:1.3441 train_time:487ms step_avg:81.18ms +step:7/6500 train_loss:6.8847 grad_norm:1.4191 train_time:568ms step_avg:81.09ms +step:8/6500 train_loss:6.8142 grad_norm:0.9933 train_time:649ms step_avg:81.09ms +step:9/6500 train_loss:6.4831 grad_norm:0.8023 train_time:730ms step_avg:81.15ms +step:10/6500 train_loss:6.1384 grad_norm:1.1039 train_time:811ms step_avg:81.11ms +step:50/6500 train_loss:3.7638 grad_norm:0.8901 train_time:4103ms step_avg:82.06ms +step:100/6500 train_loss:3.1626 grad_norm:0.5541 train_time:8225ms step_avg:82.25ms +step:150/6500 train_loss:2.8900 grad_norm:0.4495 train_time:12415ms step_avg:82.77ms +step:200/6500 train_loss:2.3897 grad_norm:0.3579 train_time:16547ms step_avg:82.73ms +step:250/6500 train_loss:2.4792 grad_norm:0.3376 train_time:20686ms step_avg:82.74ms +step:300/6500 train_loss:2.5535 grad_norm:0.3133 train_time:24876ms step_avg:82.92ms +step:350/6500 train_loss:2.5321 grad_norm:0.2698 train_time:29017ms step_avg:82.90ms +step:400/6500 train_loss:2.4047 grad_norm:0.2372 train_time:33216ms step_avg:83.04ms +step:450/6500 train_loss:2.3542 grad_norm:0.2499 train_time:37357ms step_avg:83.02ms +step:500/6500 train_loss:2.3848 grad_norm:0.1949 train_time:41502ms step_avg:83.00ms +step:550/6500 train_loss:2.3247 grad_norm:0.2375 train_time:45705ms step_avg:83.10ms +step:600/6500 train_loss:2.3201 grad_norm:0.1828 train_time:49846ms step_avg:83.08ms +step:650/6500 train_loss:2.3143 grad_norm:0.1735 train_time:54060ms step_avg:83.17ms +step:700/6500 train_loss:2.3316 grad_norm:0.1780 train_time:58196ms step_avg:83.14ms +step:750/6500 train_loss:2.3158 grad_norm:0.1632 train_time:62336ms step_avg:83.11ms +step:800/6500 train_loss:2.2248 grad_norm:0.1652 train_time:66556ms step_avg:83.19ms +step:850/6500 train_loss:2.2198 grad_norm:0.1562 train_time:70697ms step_avg:83.17ms +step:900/6500 train_loss:2.1103 grad_norm:0.1462 train_time:74903ms step_avg:83.23ms +step:950/6500 train_loss:2.2089 grad_norm:0.1508 train_time:79046ms step_avg:83.21ms +step:1000/6500 train_loss:2.2624 grad_norm:0.1474 train_time:83190ms step_avg:83.19ms +step:1050/6500 train_loss:2.2083 grad_norm:0.1428 train_time:87393ms step_avg:83.23ms +step:1100/6500 train_loss:2.3023 grad_norm:0.1304 train_time:91536ms step_avg:83.21ms +step:1150/6500 train_loss:2.2354 grad_norm:0.1320 train_time:95746ms step_avg:83.26ms +step:1200/6500 train_loss:2.3407 grad_norm:0.1321 train_time:99889ms step_avg:83.24ms +step:1250/6500 train_loss:2.2390 grad_norm:0.1242 train_time:104035ms step_avg:83.23ms +step:1300/6500 train_loss:2.0891 grad_norm:0.1257 train_time:108257ms step_avg:83.27ms +step:1350/6500 train_loss:2.2409 grad_norm:0.1202 train_time:112405ms step_avg:83.26ms +step:1400/6500 train_loss:2.1705 grad_norm:0.1071 train_time:116620ms step_avg:83.30ms +step:1450/6500 train_loss:2.1071 grad_norm:0.0987 train_time:120763ms step_avg:83.28ms +step:1500/6500 train_loss:2.2093 grad_norm:0.1028 train_time:124909ms step_avg:83.27ms +step:1550/6500 train_loss:2.1736 grad_norm:0.0960 train_time:129123ms step_avg:83.30ms +step:1600/6500 train_loss:2.0634 grad_norm:0.0866 train_time:133271ms step_avg:83.29ms +step:1650/6500 train_loss:2.1779 grad_norm:0.0914 train_time:137426ms step_avg:83.29ms +step:1700/6500 train_loss:2.1295 grad_norm:0.0810 train_time:141642ms step_avg:83.32ms +step:1750/6500 train_loss:2.1853 grad_norm:0.0818 train_time:145801ms step_avg:83.31ms +step:1800/6500 train_loss:2.1455 grad_norm:0.1147 train_time:150030ms step_avg:83.35ms +step:1850/6500 train_loss:2.0186 grad_norm:0.0845 train_time:154184ms step_avg:83.34ms +step:1900/6500 train_loss:2.1159 grad_norm:0.0816 train_time:158338ms step_avg:83.34ms +step:1950/6500 train_loss:2.0067 grad_norm:0.0725 train_time:162560ms step_avg:83.36ms +step:2000/6500 train_loss:2.0534 grad_norm:0.0752 train_time:166714ms step_avg:83.36ms +step:2050/6500 train_loss:2.0987 grad_norm:0.0760 train_time:170941ms step_avg:83.39ms +step:2100/6500 train_loss:2.0385 grad_norm:0.0749 train_time:175092ms step_avg:83.38ms +step:2150/6500 train_loss:2.1424 grad_norm:0.0755 train_time:179247ms step_avg:83.37ms +step:2200/6500 train_loss:2.1272 grad_norm:0.1143 train_time:183469ms step_avg:83.40ms +step:2250/6500 train_loss:2.1605 grad_norm:0.0768 train_time:187617ms step_avg:83.39ms +step:2300/6500 train_loss:2.1000 grad_norm:0.0772 train_time:191838ms step_avg:83.41ms +step:2350/6500 train_loss:2.1580 grad_norm:0.0748 train_time:195996ms step_avg:83.40ms +step:2400/6500 train_loss:2.0524 grad_norm:0.0746 train_time:200154ms step_avg:83.40ms +step:2450/6500 train_loss:2.0746 grad_norm:0.0777 train_time:204374ms step_avg:83.42ms +step:2500/6500 train_loss:2.1638 grad_norm:0.1080 train_time:208537ms step_avg:83.41ms +step:2550/6500 train_loss:2.1969 grad_norm:0.0786 train_time:212758ms step_avg:83.43ms +step:2600/6500 train_loss:2.0995 grad_norm:0.0756 train_time:216916ms step_avg:83.43ms +step:2650/6500 train_loss:2.0600 grad_norm:0.0839 train_time:221077ms step_avg:83.43ms +step:2700/6500 train_loss:2.0931 grad_norm:0.0721 train_time:225310ms step_avg:83.45ms +step:2750/6500 train_loss:2.0210 grad_norm:0.0743 train_time:229467ms step_avg:83.44ms +step:2800/6500 train_loss:2.1436 grad_norm:0.0817 train_time:233688ms step_avg:83.46ms +step:2850/6500 train_loss:2.0573 grad_norm:0.0767 train_time:237845ms step_avg:83.45ms +step:2900/6500 train_loss:2.0140 grad_norm:0.0738 train_time:242002ms step_avg:83.45ms +step:2950/6500 train_loss:2.0732 grad_norm:0.0777 train_time:246232ms step_avg:83.47ms +step:3000/6500 train_loss:2.1495 grad_norm:0.0746 train_time:250387ms step_avg:83.46ms +step:3050/6500 train_loss:2.0383 grad_norm:0.0830 train_time:254547ms step_avg:83.46ms +step:3100/6500 train_loss:2.0264 grad_norm:0.0722 train_time:258785ms step_avg:83.48ms +step:3150/6500 train_loss:1.9652 grad_norm:0.0764 train_time:262939ms step_avg:83.47ms +step:3200/6500 train_loss:2.1629 grad_norm:0.0760 train_time:267158ms step_avg:83.49ms +step:3250/6500 train_loss:2.0424 grad_norm:0.0697 train_time:271320ms step_avg:83.48ms +step:3300/6500 train_loss:2.0659 grad_norm:0.0749 train_time:275478ms step_avg:83.48ms +step:3350/6500 train_loss:2.0906 grad_norm:0.0745 train_time:279711ms step_avg:83.50ms +step:3400/6500 train_loss:2.0139 grad_norm:0.0743 train_time:283870ms step_avg:83.49ms +step:3450/6500 train_loss:2.1105 grad_norm:0.0828 train_time:288091ms step_avg:83.50ms +step:3500/6500 train_loss:2.1715 grad_norm:0.0732 train_time:292251ms step_avg:83.50ms +step:3550/6500 train_loss:1.9168 grad_norm:0.0786 train_time:296411ms step_avg:83.50ms +step:3600/6500 train_loss:2.0902 grad_norm:0.0772 train_time:300637ms step_avg:83.51ms +step:3650/6500 train_loss:1.9711 grad_norm:0.0710 train_time:304794ms step_avg:83.51ms +step:3700/6500 train_loss:2.0917 grad_norm:0.0703 train_time:309022ms step_avg:83.52ms +step:3750/6500 train_loss:1.9167 grad_norm:0.0710 train_time:313181ms step_avg:83.51ms +step:3800/6500 train_loss:2.0676 grad_norm:0.0808 train_time:317341ms step_avg:83.51ms +step:3850/6500 train_loss:2.0849 grad_norm:0.0734 train_time:321563ms step_avg:83.52ms +step:3900/6500 train_loss:2.0710 grad_norm:0.0744 train_time:325721ms step_avg:83.52ms +step:3950/6500 train_loss:2.1685 grad_norm:0.0723 train_time:329932ms step_avg:83.53ms +step:4000/6500 train_loss:1.9678 grad_norm:0.0709 train_time:334094ms step_avg:83.52ms +step:4000/6500 val_loss:2.0609 val_bpb:1.2206 train_time:334144ms step_avg:83.54ms +step:4050/6500 train_loss:2.0891 grad_norm:0.0704 train_time:338250ms step_avg:83.52ms +step:4100/6500 train_loss:2.0107 grad_norm:0.0763 train_time:342478ms step_avg:83.53ms +step:4150/6500 train_loss:2.1066 grad_norm:0.0686 train_time:346639ms step_avg:83.53ms +step:4200/6500 train_loss:2.1462 grad_norm:0.0812 train_time:350858ms step_avg:83.54ms +step:4250/6500 train_loss:2.1092 grad_norm:0.0763 train_time:355014ms step_avg:83.53ms +step:4300/6500 train_loss:2.0546 grad_norm:0.0752 train_time:359173ms step_avg:83.53ms +step:4350/6500 train_loss:2.0656 grad_norm:0.0760 train_time:363401ms step_avg:83.54ms +step:4400/6500 train_loss:2.0301 grad_norm:0.0769 train_time:367556ms step_avg:83.54ms +step:4450/6500 train_loss:2.0432 grad_norm:0.0745 train_time:371713ms step_avg:83.53ms +step:4500/6500 train_loss:2.1230 grad_norm:0.0726 train_time:375939ms step_avg:83.54ms +progressive_passes: step:4500 num_passes:2 +step:4550/6500 train_loss:2.1247 grad_norm:0.0731 train_time:381510ms step_avg:83.85ms +step:4600/6500 train_loss:1.8322 grad_norm:0.0886 train_time:387153ms step_avg:84.16ms +step:4650/6500 train_loss:2.0494 grad_norm:0.0743 train_time:392726ms step_avg:84.46ms +step:4700/6500 train_loss:2.2236 grad_norm:0.1161 train_time:398301ms step_avg:84.74ms +step:4750/6500 train_loss:2.0131 grad_norm:0.0707 train_time:403941ms step_avg:85.04ms +step:4800/6500 train_loss:2.4144 grad_norm:0.1498 train_time:409516ms step_avg:85.32ms +step:4850/6500 train_loss:2.0925 grad_norm:0.0791 train_time:415156ms step_avg:85.60ms +step:4900/6500 train_loss:2.0318 grad_norm:0.0760 train_time:420732ms step_avg:85.86ms +step:4950/6500 train_loss:2.0796 grad_norm:0.0818 train_time:426301ms step_avg:86.12ms +step:5000/6500 train_loss:2.0858 grad_norm:0.0780 train_time:431938ms step_avg:86.39ms +step:5050/6500 train_loss:2.0497 grad_norm:0.0826 train_time:437509ms step_avg:86.64ms +step:5100/6500 train_loss:2.1092 grad_norm:0.0773 train_time:443146ms step_avg:86.89ms +step:5150/6500 train_loss:2.0069 grad_norm:0.0785 train_time:448717ms step_avg:87.13ms +step:5200/6500 train_loss:2.0194 grad_norm:0.0775 train_time:454289ms step_avg:87.36ms +step:5250/6500 train_loss:2.0475 grad_norm:0.0722 train_time:459924ms step_avg:87.60ms +step:5300/6500 train_loss:1.9879 grad_norm:0.0783 train_time:465494ms step_avg:87.83ms +step:5350/6500 train_loss:1.9020 grad_norm:0.0779 train_time:471147ms step_avg:88.06ms +step:5400/6500 train_loss:2.0257 grad_norm:0.0774 train_time:476721ms step_avg:88.28ms +step:5450/6500 train_loss:2.0488 grad_norm:0.0765 train_time:482296ms step_avg:88.49ms +step:5500/6500 train_loss:1.9912 grad_norm:0.0790 train_time:487937ms step_avg:88.72ms +progressive_passes: step:5500 num_passes:3 +step:5550/6500 train_loss:1.9789 grad_norm:0.0810 train_time:494608ms step_avg:89.12ms +step:5600/6500 train_loss:1.9256 grad_norm:0.0786 train_time:501346ms step_avg:89.53ms +step:5650/6500 train_loss:2.0259 grad_norm:0.0842 train_time:508018ms step_avg:89.91ms +step:5700/6500 train_loss:1.9806 grad_norm:0.0837 train_time:514687ms step_avg:90.30ms +step:5750/6500 train_loss:2.0596 grad_norm:0.0910 train_time:521422ms step_avg:90.68ms +step:5800/6500 train_loss:1.9578 grad_norm:0.0867 train_time:528090ms step_avg:91.05ms +step:5850/6500 train_loss:2.0914 grad_norm:0.0835 train_time:534823ms step_avg:91.42ms +swa:start step:5900 +step:5900/6500 train_loss:1.8655 grad_norm:0.0820 train_time:541493ms step_avg:91.78ms +step:5950/6500 train_loss:1.9220 grad_norm:0.0783 train_time:548258ms step_avg:92.14ms +late_qat:enabled step:5974 scale:0.1498 core_quant:on +step:6000/6500 train_loss:1.9047 grad_norm:0.0854 train_time:555064ms step_avg:92.51ms +step:6050/6500 train_loss:1.9335 grad_norm:0.0832 train_time:561782ms step_avg:92.86ms +step:6100/6500 train_loss:1.8795 grad_norm:0.0848 train_time:568501ms step_avg:93.20ms +step:6150/6500 train_loss:1.9813 grad_norm:0.0840 train_time:575291ms step_avg:93.54ms +step:6200/6500 train_loss:1.9069 grad_norm:0.0873 train_time:582012ms step_avg:93.87ms +step:6250/6500 train_loss:2.0284 grad_norm:0.0927 train_time:588792ms step_avg:94.21ms +step:6300/6500 train_loss:1.9061 grad_norm:0.0825 train_time:595509ms step_avg:94.53ms +step:6334/6500 val_loss:1.9232 val_bpb:1.1390 train_time:600161ms step_avg:94.75ms +stopping_early: wallclock_cap train_time:600161ms step:6334/6500 +peak memory allocated: 34074 MiB reserved: 34084 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9202 val_bpb:1.1372 eval_time:2969ms +Serialized model: 105478842 bytes +Code size: 88253 bytes +eval_override: num_passes 1 -> 3 +Serialized model int6+lzma: 15839344 bytes +Total submission size int6+lzma: 15927597 bytes +eval_feedback: loaded from artifact, params=2560 +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=26700895 frozen=0 + ttt_chunk [1/1893] bpb=1.152477 time=0.5s + ttt_chunk [11/1893] bpb=1.143035 time=3.6s + ttt_chunk [21/1893] bpb=1.129097 time=6.6s + ttt_chunk [31/1893] bpb=1.127795 time=9.7s + ttt_chunk [41/1893] bpb=1.115135 time=12.7s + ttt_chunk [51/1893] bpb=1.108923 time=15.8s + ttt_chunk [61/1893] bpb=1.115323 time=18.8s + ttt_chunk [71/1893] bpb=1.114068 time=21.8s + ttt_chunk [81/1893] bpb=1.113434 time=24.9s + ttt_chunk [91/1893] bpb=1.114154 time=27.9s + ttt_chunk [101/1893] bpb=1.117640 time=31.0s + ttt_chunk [111/1893] bpb=1.119946 time=34.0s + ttt_chunk [121/1893] bpb=1.113175 time=37.1s + ttt_chunk [131/1893] bpb=1.113513 time=40.1s + ttt_chunk [141/1893] bpb=1.119060 time=43.1s + ttt_chunk [151/1893] bpb=1.120887 time=46.2s + ttt_chunk [161/1893] bpb=1.120409 time=49.2s + ttt_chunk [171/1893] bpb=1.124729 time=52.2s + ttt_chunk [181/1893] bpb=1.126902 time=55.3s + ttt_chunk [191/1893] bpb=1.134130 time=58.3s + ttt_chunk [201/1893] bpb=1.132989 time=61.4s + ttt_chunk [211/1893] bpb=1.130820 time=64.4s + ttt_chunk [221/1893] bpb=1.132337 time=67.4s + ttt_chunk [231/1893] bpb=1.131158 time=70.5s + ttt_chunk [241/1893] bpb=1.131378 time=73.5s + ttt_chunk [251/1893] bpb=1.130935 time=76.6s + ttt_chunk [261/1893] bpb=1.128082 time=79.6s + ttt_chunk [271/1893] bpb=1.126981 time=82.7s + ttt_chunk [281/1893] bpb=1.128422 time=85.7s + ttt_chunk [291/1893] bpb=1.130257 time=88.8s + ttt_chunk [301/1893] bpb=1.131029 time=91.8s + ttt_chunk [311/1893] bpb=1.133104 time=94.9s + ttt_chunk [321/1893] bpb=1.135079 time=97.9s + ttt_chunk [331/1893] bpb=1.134946 time=100.9s + ttt_chunk [341/1893] bpb=1.133947 time=104.0s + ttt_chunk [351/1893] bpb=1.136247 time=107.0s + ttt_chunk [361/1893] bpb=1.136484 time=110.1s + ttt_chunk [371/1893] bpb=1.135787 time=113.2s + ttt_chunk [381/1893] bpb=1.135928 time=116.2s + ttt_chunk [391/1893] bpb=1.135725 time=119.3s + ttt_chunk [401/1893] bpb=1.133622 time=122.3s + ttt_chunk [411/1893] bpb=1.132451 time=125.4s + ttt_chunk [421/1893] bpb=1.131507 time=128.4s + ttt_chunk [431/1893] bpb=1.131405 time=131.5s + ttt_chunk [441/1893] bpb=1.131771 time=134.6s + ttt_chunk [451/1893] bpb=1.132087 time=137.6s + ttt_chunk [461/1893] bpb=1.131017 time=140.7s + ttt_chunk [471/1893] bpb=1.131668 time=143.7s + ttt_chunk [481/1893] bpb=1.131297 time=146.8s + ttt_chunk [491/1893] bpb=1.130192 time=149.9s + ttt_chunk [501/1893] bpb=1.129677 time=152.9s + ttt_chunk [511/1893] bpb=1.128974 time=156.0s + ttt_chunk [521/1893] bpb=1.126707 time=159.1s + ttt_chunk [531/1893] bpb=1.127885 time=162.1s + ttt_chunk [541/1893] bpb=1.128209 time=165.2s + ttt_chunk [551/1893] bpb=1.127153 time=168.2s + ttt_chunk [561/1893] bpb=1.127686 time=171.3s + ttt_chunk [571/1893] bpb=1.126673 time=174.3s + ttt_chunk [581/1893] bpb=1.125867 time=177.4s + ttt_chunk [591/1893] bpb=1.125232 time=180.4s + ttt_chunk [601/1893] bpb=1.125737 time=183.5s + ttt_chunk [611/1893] bpb=1.125643 time=186.5s + ttt_chunk [621/1893] bpb=1.125484 time=189.7s + ttt_chunk [631/1893] bpb=1.126196 time=192.7s + ttt_chunk [641/1893] bpb=1.125945 time=195.8s + ttt_chunk [651/1893] bpb=1.126042 time=198.8s + ttt_chunk [661/1893] bpb=1.125539 time=201.9s + ttt_chunk [671/1893] bpb=1.125913 time=204.9s + ttt_chunk [681/1893] bpb=1.126614 time=208.0s + ttt_chunk [691/1893] bpb=1.127610 time=211.0s + ttt_chunk [701/1893] bpb=1.127051 time=214.1s + ttt_chunk [711/1893] bpb=1.127028 time=217.1s + ttt_chunk [721/1893] bpb=1.126685 time=220.2s + ttt_chunk [731/1893] bpb=1.126729 time=223.3s + ttt_chunk [741/1893] bpb=1.126821 time=226.3s + ttt_chunk [751/1893] bpb=1.126687 time=229.4s + ttt_chunk [761/1893] bpb=1.126623 time=232.4s + ttt_chunk [771/1893] bpb=1.126304 time=235.5s + ttt_chunk [781/1893] bpb=1.127035 time=238.5s + ttt_chunk [791/1893] bpb=1.126622 time=241.6s + ttt_chunk [801/1893] bpb=1.126957 time=244.6s + ttt_chunk [811/1893] bpb=1.126711 time=247.7s + ttt_chunk [821/1893] bpb=1.126489 time=250.8s + ttt_chunk [831/1893] bpb=1.126309 time=253.8s + ttt_chunk [841/1893] bpb=1.125658 time=256.9s + ttt_chunk [851/1893] bpb=1.125406 time=259.9s + ttt_chunk [861/1893] bpb=1.125146 time=263.0s + ttt_chunk [871/1893] bpb=1.125420 time=266.0s + ttt_chunk [881/1893] bpb=1.125586 time=269.1s + ttt_chunk [891/1893] bpb=1.125168 time=272.1s + ttt_chunk [901/1893] bpb=1.124899 time=275.1s + ttt_chunk [911/1893] bpb=1.125018 time=278.2s + ttt_chunk [921/1893] bpb=1.125495 time=281.2s + ttt_chunk [931/1893] bpb=1.125456 time=284.3s + ttt_chunk [941/1893] bpb=1.125151 time=287.3s + ttt_chunk [951/1893] bpb=1.125534 time=290.4s + ttt_chunk [961/1893] bpb=1.125608 time=293.4s + ttt_chunk [971/1893] bpb=1.126466 time=296.5s + ttt_chunk [981/1893] bpb=1.126540 time=299.5s + ttt_chunk [991/1893] bpb=1.126531 time=302.5s + ttt_chunk [1001/1893] bpb=1.126461 time=305.6s + ttt_chunk [1011/1893] bpb=1.126239 time=308.6s + ttt_chunk [1021/1893] bpb=1.126577 time=311.7s + ttt_chunk [1031/1893] bpb=1.127017 time=314.7s + ttt_chunk [1041/1893] bpb=1.126693 time=317.8s + ttt_chunk [1051/1893] bpb=1.126438 time=320.8s + ttt_chunk [1061/1893] bpb=1.126498 time=323.9s + ttt_chunk [1071/1893] bpb=1.127124 time=327.0s + ttt_chunk [1081/1893] bpb=1.127399 time=330.0s + ttt_chunk [1091/1893] bpb=1.128159 time=333.1s + ttt_chunk [1101/1893] bpb=1.128177 time=336.1s + ttt_chunk [1111/1893] bpb=1.128034 time=339.2s + ttt_chunk [1121/1893] bpb=1.127833 time=342.3s + ttt_chunk [1131/1893] bpb=1.127712 time=345.3s + ttt_chunk [1141/1893] bpb=1.127398 time=348.4s + ttt_chunk [1151/1893] bpb=1.127410 time=351.5s + ttt_chunk [1161/1893] bpb=1.127032 time=354.5s + ttt_chunk [1171/1893] bpb=1.127359 time=357.6s + ttt_chunk [1181/1893] bpb=1.126599 time=360.6s + ttt_chunk [1191/1893] bpb=1.126502 time=363.6s + ttt_chunk [1201/1893] bpb=1.126926 time=366.7s + ttt_chunk [1211/1893] bpb=1.126480 time=369.7s + ttt_chunk [1221/1893] bpb=1.126193 time=372.8s + ttt_chunk [1231/1893] bpb=1.125910 time=375.8s + ttt_chunk [1241/1893] bpb=1.125549 time=378.9s + ttt_chunk [1251/1893] bpb=1.124967 time=382.0s + ttt_chunk [1261/1893] bpb=1.124947 time=385.0s + ttt_chunk [1271/1893] bpb=1.124571 time=388.1s + ttt_chunk [1281/1893] bpb=1.124391 time=391.1s + ttt_chunk [1291/1893] bpb=1.124160 time=394.2s + ttt_chunk [1301/1893] bpb=1.123558 time=397.3s + ttt_chunk [1311/1893] bpb=1.123176 time=400.3s + ttt_chunk [1321/1893] bpb=1.122847 time=403.4s + ttt_chunk [1331/1893] bpb=1.122788 time=406.4s + ttt_chunk [1341/1893] bpb=1.122665 time=409.5s + ttt_chunk [1351/1893] bpb=1.122591 time=412.5s + ttt_chunk [1361/1893] bpb=1.122658 time=415.6s + ttt_chunk [1371/1893] bpb=1.122520 time=418.6s + ttt_chunk [1381/1893] bpb=1.122507 time=421.9s + ttt_chunk [1391/1893] bpb=1.122107 time=425.0s + ttt_chunk [1401/1893] bpb=1.122084 time=428.1s + ttt_chunk [1411/1893] bpb=1.122201 time=431.1s + ttt_chunk [1421/1893] bpb=1.122447 time=434.1s + ttt_chunk [1431/1893] bpb=1.122150 time=437.2s + ttt_chunk [1441/1893] bpb=1.122669 time=440.2s + ttt_chunk [1451/1893] bpb=1.123008 time=443.2s + ttt_chunk [1461/1893] bpb=1.122569 time=446.3s + ttt_chunk [1471/1893] bpb=1.123625 time=449.3s + ttt_chunk [1481/1893] bpb=1.123158 time=452.4s + ttt_chunk [1491/1893] bpb=1.122974 time=455.4s + ttt_chunk [1501/1893] bpb=1.122869 time=458.5s + ttt_chunk [1511/1893] bpb=1.122901 time=461.5s + ttt_chunk [1521/1893] bpb=1.122938 time=464.6s + ttt_chunk [1531/1893] bpb=1.122409 time=467.6s + ttt_chunk [1541/1893] bpb=1.122271 time=470.7s + ttt_chunk [1551/1893] bpb=1.122576 time=473.7s + ttt_chunk [1561/1893] bpb=1.122587 time=476.8s + ttt_chunk [1571/1893] bpb=1.122415 time=479.9s + ttt_chunk [1581/1893] bpb=1.122534 time=482.9s + ttt_chunk [1591/1893] bpb=1.122382 time=486.0s + ttt_chunk [1601/1893] bpb=1.122561 time=489.0s + ttt_chunk [1611/1893] bpb=1.122496 time=492.1s + ttt_chunk [1621/1893] bpb=1.122078 time=495.1s + ttt_chunk [1631/1893] bpb=1.122388 time=498.2s + ttt_chunk [1641/1893] bpb=1.122410 time=501.2s + ttt_chunk [1651/1893] bpb=1.122366 time=504.3s + ttt_chunk [1661/1893] bpb=1.122246 time=508.2s + ttt_chunk [1671/1893] bpb=1.122717 time=511.2s + ttt_chunk [1681/1893] bpb=1.122863 time=514.3s + ttt_chunk [1691/1893] bpb=1.122692 time=517.3s + ttt_chunk [1701/1893] bpb=1.122854 time=520.4s + ttt_chunk [1711/1893] bpb=1.122857 time=523.4s + ttt_chunk [1721/1893] bpb=1.122852 time=526.5s + ttt_chunk [1731/1893] bpb=1.122739 time=529.6s + ttt_chunk [1741/1893] bpb=1.122554 time=532.7s + ttt_chunk [1751/1893] bpb=1.122384 time=535.7s + ttt_chunk [1761/1893] bpb=1.122522 time=538.8s + ttt_chunk [1771/1893] bpb=1.122413 time=541.8s + ttt_chunk [1781/1893] bpb=1.122441 time=544.9s + ttt_chunk [1791/1893] bpb=1.122033 time=548.0s + ttt_chunk [1801/1893] bpb=1.121908 time=551.0s + ttt_chunk [1811/1893] bpb=1.121817 time=554.1s + ttt_chunk [1821/1893] bpb=1.121874 time=557.1s + ttt_chunk [1831/1893] bpb=1.121276 time=560.2s + ttt_chunk [1841/1893] bpb=1.121284 time=563.3s + ttt_chunk [1851/1893] bpb=1.121066 time=566.3s + ttt_chunk [1861/1893] bpb=1.120692 time=569.4s + ttt_chunk [1871/1893] bpb=1.120685 time=572.4s + ttt_chunk [1881/1893] bpb=1.120241 time=575.5s + ttt_chunk [1891/1893] bpb=1.120004 time=578.5s + ttt_chunk [1893/1893] bpb=1.120051 time=579.0s +ttt_sliding:done val_loss=1.887157 val_bpb=1.117684 elapsed=579.0s +legal_ttt val_loss:1.8872 val_bpb:1.1177 eval_time:579405ms +legal_ttt_exact val_loss:1.88715720 val_bpb:1.11768375 From 41aef3015e299bdea101bb50c076b7cf1107ef28 Mon Sep 17 00:00:00 2001 From: nesta Date: Wed, 1 Apr 2026 20:03:19 +0000 Subject: [PATCH 18/23] Add tricks section to README: graph precompilation warmup and python-minifier Document the torch.compile graph precompilation trick (cycling through pass/QAT variants during warmup to avoid compilation stalls under the 600s wallclock cap) and the python-minifier approach for fitting under the 16MB submission limit. --- .../README.md | 21 +++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md index aaf3d3b9bb..158b384f23 100644 --- a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md +++ b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md @@ -141,6 +141,7 @@ Built on the [PR #414](https://github.com/openai/parameter-golf/pull/414) stack | Jacobian proxy | λ=0.01 | | Weight avg | EMA(0.997) + SWA(every 50) | | Quantization | Late QAT (threshold 0.15) + GPTQ-lite int6 + lzma | +| Warmup precompilation | All pass×QAT graph variants compiled during 20 warmup steps | | Optimizer | Parameter Banking + Parallel Muon | ## Run Command @@ -159,9 +160,25 @@ torchrun --standalone --nproc_per_node=8 train_gpt.py \ --no-interpass-rmsnorm ``` -## Code Size +## Tricks -The original training script was 88,253 bytes, which caused seed 2025 to exceed the 16MB submission limit (16,025,625 bytes). Dead code paths (eval-only mode, int8 quantization, unused feedback variants, verbose logging) were removed and the code was minified with [python-minifier](https://github.com/dflook/python-minifier) (no local variable renaming) to 58,186 bytes, bringing all seeds under the limit. +### Graph Precompilation Warmup + +`torch.compile` is lazy — it only compiles a new graph variant the first time it's encountered. With progressive recurrence (1→2→3 passes) and late QAT, this means the training loop would hit compilation stalls at step 4500 (2-pass), step 5500 (3-pass), and again when QAT enables. Under a 600s wallclock cap, these stalls are expensive. + +The fix: **precompile all graph variants during warmup before training starts**. During the 20 warmup steps: + +1. The last few warmup steps cycle through each `num_passes` variant (2-pass, 3-pass) and each with QAT toggled on +2. This forces `torch.compile` to eagerly compile every forward/backward graph that will appear during training +3. After warmup, model weights and optimizer states are restored to their initial values — the warmup steps have zero effect on the actual training run + +This ensures the training loop runs at full speed from step 0 with no compilation jitter when passes change or QAT kicks in. + +### Code Minification with python-minifier + +The original training script was 88,253 bytes, which caused seed 2025 to exceed the 16MB submission limit (16,025,625 bytes). After removing dead code paths (eval-only mode, int8 quantization, unused feedback variants, verbose logging), the file was still too large. + +[python-minifier](https://github.com/dflook/python-minifier) with `--no-rename-locals` shrinks the code aggressively (whitespace, docstrings, constant folding) while preserving local variable names — critical because the training script uses string-based lookups for `state_dict` keys and `named_parameters`. This brought the file from 68,435 bytes down to **58,186 bytes**, comfortably fitting all seeds under the 16MB decimal limit. ## Credits From a1639d87a704f69928ee53c4df81fc4329e2c287 Mon Sep 17 00:00:00 2001 From: nesta Date: Wed, 1 Apr 2026 20:05:41 +0000 Subject: [PATCH 19/23] Fix submission.json author to nestamidavaine --- .../2026-03-26_RecurrentSOTA_Feedback/submission.json | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json index f8d545d33b..4fb986ba5a 100644 --- a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json +++ b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json @@ -3,7 +3,7 @@ "val_bpb": 1.1163, "bytes_total": 15995558, "blurb": "Progressive depth recurrence (1->2->3 passes) with error feedback + jacobian proxy stabilization. Late growth preserves fast step times for most of training, avoiding the step/capacity trade-off that makes naive recurrence impractical. 3-seed mean: 1.1163 (std 0.0013), -0.0031 vs PR #549 LeakyReLU baseline (1.1194). Built on PR #414 stack with Parallel Muon (PR #399). All artifacts under 16MB, all eval under 10 min.", - "author": "abaybektursun", - "github_id": "abaybektursun", + "author": "nestamidavaine", + "github_id": "nestamidavaine", "date": "2026-03-26" } From 47b74a364da15110e8c09bfc4d48d5401aae02ac Mon Sep 17 00:00:00 2001 From: nesta Date: Wed, 1 Apr 2026 20:08:38 +0000 Subject: [PATCH 20/23] Add per-seed results to submission.json Include seeds array, seed_results with per-seed val_loss/val_bpb/bytes, plus pre_quant_val_bpb, step_stop, wallclock_seconds, eval_time_seconds, and bytes_code fields matching the standard submission format. --- .../submission.json | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json index 4fb986ba5a..1d334c9db5 100644 --- a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json +++ b/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json @@ -1,7 +1,19 @@ { "name": "Recurrent Depth with Progressive Pass Growth + Error Feedback", "val_bpb": 1.1163, + "val_bpb_std": 0.0013, "bytes_total": 15995558, + "seeds": [1337, 42, 2025], + "seed_results": { + "1337": {"val_loss": 1.88647166, "val_bpb": 1.11727773, "bytes_model_int6_lzma": 15850832, "bytes_total": 15909018}, + "42": {"val_loss": 1.88902995, "val_bpb": 1.11879290, "bytes_model_int6_lzma": 15839344, "bytes_total": 15897530}, + "2025": {"val_loss": 1.89066597, "val_bpb": 1.12560312, "bytes_model_int6_lzma": 15937372, "bytes_total": 15995558} + }, + "pre_quant_val_bpb": 1.1359, + "step_stop": 6332, + "wallclock_seconds": 600, + "eval_time_seconds": 578, + "bytes_code": 58186, "blurb": "Progressive depth recurrence (1->2->3 passes) with error feedback + jacobian proxy stabilization. Late growth preserves fast step times for most of training, avoiding the step/capacity trade-off that makes naive recurrence impractical. 3-seed mean: 1.1163 (std 0.0013), -0.0031 vs PR #549 LeakyReLU baseline (1.1194). Built on PR #414 stack with Parallel Muon (PR #399). All artifacts under 16MB, all eval under 10 min.", "author": "nestamidavaine", "github_id": "nestamidavaine", From c0b02d323e494a54bcbd439ab82ebc25ef987395 Mon Sep 17 00:00:00 2001 From: nesta Date: Wed, 1 Apr 2026 20:21:20 +0000 Subject: [PATCH 21/23] Rename submission folder to Stable_Growing_Recurrance, update README and submission.json - Rename folder from RecurrentSOTA_Feedback to Stable_Growing_Recurrance - Add per-seed post-TTT results to submission.json (legal_ttt_exact values) - Add Tricks section to README: graph precompilation warmup, python-minifier - Note in README that logs report pre-minification code size - Fix author to nestamidavaine - Update .gitignore exception for new folder name --- .gitignore | 2 +- .../README.md | 118 ++++++++++-------- .../submission.json | 37 ++++-- .../train_gpt.py | 0 .../train_seed1337.log | 0 .../train_seed2025.log | 0 .../train_seed42.log | 0 7 files changed, 96 insertions(+), 61 deletions(-) rename records/track_non_record_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_Stable_Growing_Recurrance}/README.md (62%) rename records/track_non_record_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_Stable_Growing_Recurrance}/submission.json (50%) rename records/track_non_record_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_Stable_Growing_Recurrance}/train_gpt.py (100%) rename records/track_non_record_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_Stable_Growing_Recurrance}/train_seed1337.log (100%) rename records/track_non_record_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_Stable_Growing_Recurrance}/train_seed2025.log (100%) rename records/track_non_record_16mb/{2026-03-26_RecurrentSOTA_Feedback => 2026-03-26_Stable_Growing_Recurrance}/train_seed42.log (100%) diff --git a/.gitignore b/.gitignore index 4fe67dc24e..903259eda8 100644 --- a/.gitignore +++ b/.gitignore @@ -11,7 +11,7 @@ data/docs_selected.jsonl logs/ *.log *.txt -!records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/*.log +!records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/*.log *.pt *.ptz *.wandb \ No newline at end of file diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md similarity index 62% rename from records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md rename to records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md index 158b384f23..6c2a5d73ac 100644 --- a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/README.md +++ b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md @@ -2,16 +2,18 @@ **val_bpb: 1.1163** (3-seed mean, std 0.0013) | **~15.96 MB** | 8×H100 SXM -A non-record submission targeting significant improvement over [PR #549](https://github.com/openai/parameter-golf/pull/549) (LeakyReLU² baseline, 1.1194 mean bpb). Achieves **-0.0031 bpb** vs that baseline. For an in-depth analysis of depth recurrence in this competition, see [PR #363](https://github.com/openai/parameter-golf/pull/363). +A non-record submission targeting significant improvement over [PR #549](https://github.com/openai/parameter-golf/pull/549) (LeakyReLU² baseline, 1.1194 mean bpb). Achieves **-0.0031 bpb** vs that baseline. For an in-depth analysis of depth recurrence in this competition, see [PR #363](https://github.com/openai/parameter-golf/pull/363). I targeted 549 when I started building this solution, after I finished evaluation the new improved model has been published to the leaderboard. However I believe the experiments here can be applied to any model to improve performance, with the largest benefit for submissions using TTT since the recurrance makes use of the 10 available minutes of evaluation time very effectively. ## Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128) -| Seed | step_avg | steps | Pre-TTT bpb | **Post-TTT bpb** | TTT gain | TTT time | Artifact | -|------|----------|-------|-------------|-----------------|----------|----------|----------| -| 1337 | 83.5ms | 6,328 | 1.1353 | **1.1157** | -0.0196 | 566s | 15,909,018 | -| 42 | 83.5ms | 6,334 | 1.1372 | **1.1177** | -0.0195 | 579s | 15,897,530 | -| 2025 | 83.4ms | 6,334 | 1.1351 | **1.1155** | -0.0197 | 588s | 15,995,558 | -| **Mean** | **83.5ms** | **6,332** | **1.1359** | **1.1163 (std 0.0013)** | **-0.0196** | **~578s** | | + +| Seed | step_avg | steps | Pre-TTT bpb | **Post-TTT bpb** | TTT gain | TTT time | Artifact | +| -------- | ---------- | --------- | ----------- | ----------------------- | ----------- | --------- | ---------- | +| 1337 | 83.5ms | 6,328 | 1.1353 | **1.1157** | -0.0196 | 566s | 15,909,018 | +| 42 | 83.5ms | 6,334 | 1.1372 | **1.1177** | -0.0195 | 579s | 15,897,530 | +| 2025 | 83.4ms | 6,334 | 1.1351 | **1.1155** | -0.0197 | 588s | 15,995,558 | +| **Mean** | **83.5ms** | **6,332** | **1.1359** | **1.1163 (std 0.0013)** | **-0.0196** | **~578s** | | + ## Progressive Recurrence Architecture @@ -41,7 +43,7 @@ A non-record submission targeting significant improvement over [PR #549](https:/ [PR #363](https://github.com/openai/parameter-golf/pull/363) demonstrated that depth recurrence — reusing a shared block of transformer layers multiple times — saves parameters but *hurts* bpb under the 10-minute / 16MB competition constraints. Their controlled experiments showed a **+0.025 bpb gap** (looped worse) due to two compounding taxes: -1. **Quantization error amplification.** When shared weights are quantized to int6, the quantization error is injected at every pass. After K passes through the same core, the cumulative error grows superlinearly. +1. **Quantization error amplification.** When shared weights are quantized to int6, the quantization error is injected at every pass. After K passes through the same core, the cumulative error grows superlinearly. Additionally hidden state magnitudes tend to explode with to many recurrent passes through a block if we do not stabilize this. 2. **Step time overhead.** Each additional recurrence pass adds forward/backward compute. With 4 passes, +32ms/step translates to ~1200 fewer training steps in the 600s budget. ## Our Solution: Late Growth + Contractive Stabilization @@ -52,15 +54,17 @@ We address both taxes by growing recurrence depth progressively during training The key insight: **start training with 1 pass and gradually add passes late in training**. This preserves fast step times for the majority of training (83.5ms/step at 1-pass vs ~95ms at 3-pass), maximizing the total number of gradient updates within the 600s wallclock budget. The schedule: + | Step range | Passes | Effective layers | step_avg | -|------------|--------|-----------------|----------| -| 0–4499 | 1 | 11 | ~83.5ms | -| 4500–5499 | 2 | 14 | ~85.5ms | -| 5500–6328 | 3 | 17 | ~91ms | +| ---------- | ------ | ---------------- | -------- | +| 0–4499 | 1 | 11 | ~83.5ms | +| 4500–5499 | 2 | 14 | ~85.5ms | +| 5500–6328 | 3 | 17 | ~91ms | + This reduces the step/capacity trade-off that normally makes recurrence impractical under competition constraints. We get ~6,330 training steps (vs ~7,180 for the flat LeakyReLU baseline), but the final model has 17 effective layers at eval vs the baseline's 11. -We also tested training with 4 recurrence passes. While 4-pass shows better per-step loss, the additional step time cost (~105ms/step) means fewer total steps within the wallclock budget. Under the competition's 600s constraint, **3-pass wins the step/capacity trade-off** — the extra training steps from the faster 3-pass schedule outweigh the marginal per-step quality gain from 4 passes. +We also tested training with 4 recurrence passes. While 4-pass shows better per-step loss, the additional step time cost (~105ms/step) means fewer total steps within the wallclock budget. Under the competition's 600s constraint, **3-pass wins the step/capacity trade-off**, the extra training steps from the faster 3-pass schedule outweigh the marginal per-step quality gain from 4 passes. ### Learnable Residual Scaling @@ -84,9 +88,9 @@ The feedback module is important but not strictly required — we confirmed that A regularization term penalizes hidden state growth ratio above 1.0, enforcing contractive dynamics without computing the full Jacobian: -$$\mathcal{L}_J = \lambda \cdot \mathrm{ReLU}\left(\frac{\|h_{k+1} - h_k\|}{\|h_k\| + \epsilon} - 1\right)^{2}$$ +$$\mathcal{L}*J = \lambda \cdot \mathrm{ReLU}\left(\frac{h*{k+1} - h_k}{h_k + \epsilon} - 1\right)^{2}$$ -with $\lambda = 0.01$. This is a cheap finite-difference proxy for the spectral norm of the Jacobian $\partial h_{k+1}/\partial h_k$, encouraging it to stay below 1 (contractive map). +with $\lambda = 0.01$. This is a cheap finite-difference proxy for the spectral norm of the Jacobian $\partial h_{k+1}/\partial h_k$, encouraging it to stay below 1 (contractive map). The model learns to adhere to this quickly and it does not seem to effect early training dynamics. However we did see better results with 0.01 compared to 0.1 for Lambda, potentially since the restriction of 0.1 is to high, we don't always need contractive layers with only 3x recurrance, but we do need it to not explode. This loss term is critical for training stability. **Without it, gradient norms and hidden state magnitudes explode** during the multi-pass phases, destabilizing training. The proxy loss keeps the recurrent dynamics well-behaved without the computational cost of full Jacobian computation. @@ -96,62 +100,69 @@ Note: the jacobian proxy loss is only added to the training loss — it does not Score-first legal TTT following [PR #461](https://github.com/openai/parameter-golf/pull/461): -1. Val tokens split into 1,893 non-overlapping 32K-token chunks +1. Val tokens split into 1,893 non-overlapping 32K-token chunks. Here 3 pass recurrance is vital since with 4 passes we must increase chunk size to fit within the time limit. 2. **For each chunk**: - - **SCORE**: Sliding window eval under `torch.inference_mode()` — no gradients, no weight mutation - - **TRAIN**: SGD on the already-scored chunk. 3 epochs, all blocks unfrozen, cosine LR decay, grad clip 1.0 + - **SCORE**: Sliding window eval under `torch.inference_mode()` — no gradients, no weight mutation + - **TRAIN**: SGD on the already-scored chunk. 3 epochs, all blocks unfrozen, cosine LR decay, grad clip 1.0 3. Last chunk scored but never trained on -| Parameter | Value | -|-----------|-------| -| Chunk size | 32,768 tokens | -| Optimizer | SGD + momentum(0.9) | -| Learning rate | 0.002 (cosine decay) | -| Epochs per chunk | 3 | -| Frozen blocks | None (all blocks adapt) | -| Gradient clip | 1.0 | -| Eval passes | 3 (matching final training phase) | + +| Parameter | Value | +| ---------------- | --------------------------------- | +| Chunk size | 32,768 tokens | +| Optimizer | SGD + momentum(0.9) | +| Learning rate | 0.002 (cosine decay) | +| Epochs per chunk | 3 | +| Frozen blocks | None (all blocks adapt) | +| Gradient clip | 1.0 | +| Eval passes | 3 (matching final training phase) | + ### Timing Budget -| Phase | Time | -|-------|------| -| Training (wallclock cap) | 600s (10 min) | -| Standard eval (int6 + sliding window) | ~3s | -| Legal TTT (score-first + adaptation) | ~578s | -| **Total eval** | **~581s (< 10 min)** | + +| Phase | Time | +| ------------------------------------- | -------------------- | +| Training (wallclock cap) | 600s (10 min) | +| Standard eval (int6 + sliding window) | ~3s | +| Legal TTT (score-first + adaptation) | ~578s | +| **Total eval** | **~581s (< 10 min)** | + ## Architecture Built on the [PR #414](https://github.com/openai/parameter-golf/pull/414) stack with [PR #399](https://github.com/openai/parameter-golf/pull/399) Parallel Muon: -| Component | Setting | -|-----------|---------| -| Layers | 11 unique (512d, 8H, 4KV) | -| Effective layers (eval) | 17 (4 stem + 3 core ×3 + 4 tail) | -| MLP | 3× with LeakyReLU(0.5)² | -| BigramHash | 512 | -| XSA | Last 4 layers | -| RoPE | Partial (16/64 dims) | -| LN Scale | 1/√(layer+1) | -| VE128 | Layers 9-10 | -| Recurrence core | Layers 4-6, progressive 1→2→3 passes | -| ResidualScale | Per-pass learnable, init 0.5 | -| Error Feedback | Diagonal mode, rank 2, 2560 params | -| Jacobian proxy | λ=0.01 | -| Weight avg | EMA(0.997) + SWA(every 50) | -| Quantization | Late QAT (threshold 0.15) + GPTQ-lite int6 + lzma | -| Warmup precompilation | All pass×QAT graph variants compiled during 20 warmup steps | -| Optimizer | Parameter Banking + Parallel Muon | + +| Component | Setting | +| ----------------------- | ----------------------------------------------------------- | +| Layers | 11 unique (512d, 8H, 4KV) | +| Effective layers (eval) | 17 (4 stem + 3 core ×3 + 4 tail) | +| MLP | 3× with LeakyReLU(0.5)² | +| BigramHash | 512 | +| XSA | Last 4 layers | +| RoPE | Partial (16/64 dims) | +| LN Scale | 1/√(layer+1) | +| VE128 | Layers 9-10 | +| Recurrence core | Layers 4-6, progressive 1→2→3 passes | +| ResidualScale | Per-pass learnable, init 0.5 | +| Error Feedback | Diagonal mode, rank 2, 2560 params | +| Jacobian proxy | λ=0.01 | +| Weight avg | EMA(0.997) + SWA(every 50) | +| Quantization | Late QAT (threshold 0.15) + GPTQ-lite int6 + lzma | +| Warmup precompilation | All pass×QAT graph variants compiled during 20 warmup steps | +| Optimizer | Parameter Banking + Parallel Muon | + ## Run Command ```bash -cd records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback +cd records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance bash run_earlyqat.sh # Single seed (set SEED env var) ``` Key flags: + ```bash torchrun --standalone --nproc_per_node=8 train_gpt.py \ --feedback-mode diagonal --feedback-rank 2 \ @@ -180,6 +191,8 @@ The original training script was 88,253 bytes, which caused seed 2025 to exceed [python-minifier](https://github.com/dflook/python-minifier) with `--no-rename-locals` shrinks the code aggressively (whitespace, docstrings, constant folding) while preserving local variable names — critical because the training script uses string-based lookups for `state_dict` keys and `named_parameters`. This brought the file from 68,435 bytes down to **58,186 bytes**, comfortably fitting all seeds under the 16MB decimal limit. +**Note:** The code was minified *after* all three seed runs completed, so the log files report `Code size: 88253 bytes` and correspondingly larger `Total submission size` values. The actual submission uses the minified 58,186-byte script — the correct per-seed totals are listed in `submission.json` and the results table above. + ## Credits - **Base model**: [PR #414](https://github.com/openai/parameter-golf/pull/414) by @signalrush @@ -187,3 +200,4 @@ The original training script was 88,253 bytes, which caused seed 2025 to exceed - **LeakyReLU² activation**: [PR #493](https://github.com/openai/parameter-golf/pull/493) by @parinzee - **TTT recipe**: [PR #461](https://github.com/openai/parameter-golf/pull/461) by @Christopher-Lee-McClendon - **Depth recurrence analysis**: [PR #363](https://github.com/openai/parameter-golf/pull/363) by @evangelinehelsinki + diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/submission.json similarity index 50% rename from records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json rename to records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/submission.json index 1d334c9db5..fd2cccba00 100644 --- a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/submission.json +++ b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/submission.json @@ -3,19 +3,40 @@ "val_bpb": 1.1163, "val_bpb_std": 0.0013, "bytes_total": 15995558, + "bytes_code": 58186, "seeds": [1337, 42, 2025], "seed_results": { - "1337": {"val_loss": 1.88647166, "val_bpb": 1.11727773, "bytes_model_int6_lzma": 15850832, "bytes_total": 15909018}, - "42": {"val_loss": 1.88902995, "val_bpb": 1.11879290, "bytes_model_int6_lzma": 15839344, "bytes_total": 15897530}, - "2025": {"val_loss": 1.89066597, "val_bpb": 1.12560312, "bytes_model_int6_lzma": 15937372, "bytes_total": 15995558} + "1337": { + "val_loss": 1.88375543, + "val_bpb": 1.11566902, + "pre_ttt_val_bpb": 1.1353, + "ttt_time_seconds": 565.6, + "steps": 6328, + "bytes_model_int6_lzma": 15850832, + "bytes_total": 15909018 + }, + "42": { + "val_loss": 1.88715720, + "val_bpb": 1.11768375, + "pre_ttt_val_bpb": 1.1372, + "ttt_time_seconds": 579.0, + "steps": 6334, + "bytes_model_int6_lzma": 15839344, + "bytes_total": 15897530 + }, + "2025": { + "val_loss": 1.88338589, + "val_bpb": 1.11545016, + "pre_ttt_val_bpb": 1.1351, + "ttt_time_seconds": 588.0, + "steps": 6334, + "bytes_model_int6_lzma": 15937372, + "bytes_total": 15995558 + } }, - "pre_quant_val_bpb": 1.1359, - "step_stop": 6332, "wallclock_seconds": 600, - "eval_time_seconds": 578, - "bytes_code": 58186, "blurb": "Progressive depth recurrence (1->2->3 passes) with error feedback + jacobian proxy stabilization. Late growth preserves fast step times for most of training, avoiding the step/capacity trade-off that makes naive recurrence impractical. 3-seed mean: 1.1163 (std 0.0013), -0.0031 vs PR #549 LeakyReLU baseline (1.1194). Built on PR #414 stack with Parallel Muon (PR #399). All artifacts under 16MB, all eval under 10 min.", "author": "nestamidavaine", "github_id": "nestamidavaine", - "date": "2026-03-26" + "date": "2026-04-01" } diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_gpt.py similarity index 100% rename from records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt.py rename to records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_gpt.py diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed1337.log b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed1337.log similarity index 100% rename from records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed1337.log rename to records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed1337.log diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed2025.log b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed2025.log similarity index 100% rename from records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed2025.log rename to records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed2025.log diff --git a/records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed42.log b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed42.log similarity index 100% rename from records/track_non_record_16mb/2026-03-26_RecurrentSOTA_Feedback/train_seed42.log rename to records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed42.log From 947ae2f0e528d454250dd46069e36be1ef6e2d1b Mon Sep 17 00:00:00 2001 From: nesta Date: Wed, 1 Apr 2026 20:24:28 +0000 Subject: [PATCH 22/23] Emphasize significant baseline beat under results table --- .../2026-03-26_Stable_Growing_Recurrance/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md index 6c2a5d73ac..994fdb7982 100644 --- a/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md +++ b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md @@ -14,6 +14,7 @@ A non-record submission targeting significant improvement over [PR #549](https:/ | 2025 | 83.4ms | 6,334 | 1.1351 | **1.1155** | -0.0197 | 588s | 15,995,558 | | **Mean** | **83.5ms** | **6,332** | **1.1359** | **1.1163 (std 0.0013)** | **-0.0196** | **~578s** | | +We significantly beat the [PR #549](https://github.com/openai/parameter-golf/pull/549) LeakyReLU² baseline (1.1194 mean bpb) by **-0.0031 bpb** across all three seeds, achieving the goal we set out with. ## Progressive Recurrence Architecture From c0db831d60211e582ddac1f33a385ef18ac941c9 Mon Sep 17 00:00:00 2001 From: nesta Date: Wed, 1 Apr 2026 20:25:24 +0000 Subject: [PATCH 23/23] Add nats to baseline comparison --- .../2026-03-26_Stable_Growing_Recurrance/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md index 994fdb7982..2249d50228 100644 --- a/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md +++ b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md @@ -14,7 +14,7 @@ A non-record submission targeting significant improvement over [PR #549](https:/ | 2025 | 83.4ms | 6,334 | 1.1351 | **1.1155** | -0.0197 | 588s | 15,995,558 | | **Mean** | **83.5ms** | **6,332** | **1.1359** | **1.1163 (std 0.0013)** | **-0.0196** | **~578s** | | -We significantly beat the [PR #549](https://github.com/openai/parameter-golf/pull/549) LeakyReLU² baseline (1.1194 mean bpb) by **-0.0031 bpb** across all three seeds, achieving the goal we set out with. +We significantly beat the [PR #549](https://github.com/openai/parameter-golf/pull/549) LeakyReLU² baseline (1.1194 mean bpb / 1.8901 nats) by **-0.0031 bpb / -0.0053 nats** across all three seeds (1.1163 mean bpb / 1.8848 nats), achieving the goal we set out with. ## Progressive Recurrence Architecture