diff --git a/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/README.md b/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/README.md
new file mode 100644
index 0000000000..838ecd47be
--- /dev/null
+++ b/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/README.md
@@ -0,0 +1,198 @@
+# Record: Casefold V4 + Attention Output Gate + Multi-Phase Global SGD TTT
+
+**val_bpb: 1.05733** (3-seed mean, std 0.00035) | **3.04721 nats** | **~15.21 MB** | 8xH100 SXM, 600s | Phased TTT
+
+## Summary
+
+Stacks per-head **Attention Output Gate** (from PR #1667 @MarioPaerle) on top of our Casefold V4 + Multi-Phase Global SGD TTT record (PR #1670). The gate is weight-initialized to zero (identity at init) and adds 1,056 parameters total (12 x 8 heads x 11 layers). Combined with SmearGate (input-dependent per-channel mixer), these architectural additions are orthogonal to the casefold tokenizer and the phased TTT protocol, yielding a clean -0.00237 BPB improvement over PR #1670.
+
+**Note:** Casefold tokenizer normalization is a novel technique pending organizer review at Issue #1604. The tokenizer itself is retrained from scratch on casefolded data -- it is NOT a modified version of the standard SP8192 tokenizer. This submission is offered for evaluation under that pending ruling. The Attention Output Gate and SmearGate are pure architectural additions and do not depend on Issue #1604.
+
+## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, Phased TTT)
+
+### Core Results
+
+| Seed | Steps | ms/step | Pre-TTT BPB | **Post-TTT BPB** | TTT gain | TTT time | Artifact |
+|------|-------|---------|-------------|------------------|----------|----------|----------|
+| 42   | 4902  | 121.6   | 1.06633     | **1.05693**      | -0.00940 | 351s     | 15,936,269 |
+| 0    | 4883  | 122.1   | 1.06674     | **1.05730**      | -0.00944 | 347s     | 15,937,514 |
+| 1234 | 4906  | 121.5   | 1.06714     | **1.05777**      | -0.00937 | 307s     | 15,938,772 |
+| **Mean** | **4897** | **121.7** | **1.06674** | **1.05733** | **-0.00940** | **335s** | **15,937,518** |
+| **Std** | | | | **0.00035** | | | |
+
+### Supplemental Diagnostics
+
+| Seed | Post-EMA BPB | Quantized BPB | Post-TTT BPB | val_loss (nats) | Code size | Total submission | Train time | Eval time |
+|------|-------------|---------------|--------------|-----------------|-----------|-----------------|------------|-----------|
+| 42   | 1.05634     | 1.06633       | 1.05693      | 3.04604         | 124,826 B | 15,936,269 B    | 596.1s     | 350.9s    |
+| 0    | 1.05652     | 1.06674       | 1.05730      | 3.04712         | 124,826 B | 15,937,514 B    | 596.1s     | 347.3s    |
+| 1234 | 1.05707     | 1.06714       | 1.05777      | 3.04846         | 124,826 B | 15,938,772 B    | 596.2s     | 306.7s    |
+
+### Record Comparison
+
+| Submission | val_bpb | val_loss (nats) | Delta BPB | Delta nats |
+|------------|---------|-----------------|-----------|------------|
+| Merged SOTA (PR #1493) | 1.08100 | - | - | - |
+| PR #1530 @samacqua | 1.07336 | - | - | - |
+| PR #1585 @codemath3000 (casefold leader) | 1.06390 | - | - | - |
+| PR #1667 @MarioPaerle (AttnOutGate + SmearGate) | 1.07139 | - | - | - |
+| PR #1670 @dexhunter (casefold v4 + phased TTT) | 1.05970 | 3.05401 | - | - |
+| **This (3-seed)** | **1.05733** | **3.04721** | **-0.00657 vs #1585** | **-0.01697 vs #1585** |
+
+Clears the 0.005-nat record threshold vs the casefold leader (PR #1585) by 3.4x. Improves on PR #1670 by -0.00237 BPB (-0.00680 nats).
+
+## Key Innovations
+
+### 1. Attention Output Gate (from PR #1667 @MarioPaerle)
+
+Lightweight per-head multiplicative gate on the attention output. Weight-initialized to zero (so at init, all heads pass through at scale 1.0). Activated in the inline-safe path with `.contiguous()` barriers so it works under fullgraph torch.compile:
+
+```python
+def _apply_attn_out_gate_inline(y, x_orig, gate_w):
+    """Inline-safe version: .contiguous() barriers prevent over-aggressive kernel fusion."""
+    gate_in = x_orig[:, :, :12].contiguous()
+    gate = (2.0 * torch.sigmoid(F.linear(gate_in, gate_w.to(gate_in.dtype)))).contiguous()
+    return y * gate.unsqueeze(-1)
+```
+
+- Total new parameters: 12 x 8 heads = 96 weights per layer x 11 layers = **1,056 parameters**
+- Applied in all three attention paths: standard, parallel-residual, and depth-recurrent
+- Negligible throughput cost (<2%)
+
+### 2. SmearGate (input-dependent per-channel mixer)
+
+Input-dependent SmearGate applied at the residual stream before attention:
+
+```python
+def _apply_smear_gate_inline(x, smear_w, smear_lambda):
+    prev_x = torch.zeros_like(x)
+    prev_x[:, 1:] = x[:, :-1]
+    gate_in = x[:, :, :12].contiguous()
+    gate = torch.sigmoid(F.linear(gate_in, smear_w.to(x.dtype).unsqueeze(0))).contiguous()
+    return x + smear_lambda.to(x.dtype) * gate * prev_x
+```
+
+- Total new parameters: 12 gate weights + 1 scalar lambda = **13 parameters**
+- Zero-initialized smear_lambda (so at init, this is exactly the residual stream)
+
+### 3. Casefold V4 Tokenizer (from PR #1670)
+
+All input text is lowercased (casefolded) offline before SP8192 BPE retraining. Both train and validation shards are retokenized with the casefolded tokenizer, and BOS tokens are preserved. Byte-level BPB is computed over the original (non-casefolded) validation bytes through the sentencepiece piece table.
+
+### 4. Multi-Phase Global SGD TTT (from PR #1670 / PR #1610 concept)
+
+Score-first SGD adaptation on 2000 prefix documents split into 3 phases (boundaries [666, 1333, 2000]). Each phase fully scores its prefix under `torch.no_grad()` before any SGD update.
+
+## Changes from PR #1670 (Casefold V4 baseline)
+
+| Aspect | PR #1670 (base) | This submission |
+|--------|----------------|-----------------|
+| AttnOutGate | Off | **On (width=12, per-head, all 11 layers, zero-init)** |
+| SmearGate | Off | **On (width=12, zero-init lambda)** |
+| val_bpb | 1.05970 | **1.05733 (-0.00237)** |
+| val_loss (nats) | 3.05401 | 3.04721 (-0.00680) |
+| Artifact | ~15.20 MB | ~15.21 MB |
+| Tokenizer | Casefold V4 (retrained SP8192 on lowercased text) | Same |
+| TTT | Multi-Phase Global SGD (3 phases, 2000 prefix docs) | Same |
+| Code size (uncompressed) | 122,604 B | 124,826 B |
+| Code size (compressed) | ~28 KB | 28,060 B |
+
+## Architecture
+
+11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10] (loops layers 3-5, activated at step frac=0.35). Parallel residuals from layer 8. Skip gates (sigmoid-gated U-Net connections). EMA decay 0.9965.
+
+**New this submission:** Per-head Attention Output Gate (12 x 8 heads per layer, zero-init, 11 layers). Residual-stream SmearGate (width 12, zero-init lambda).
+
+### Training
+
+MuonEq-R optimizer (row-normalized Muon, Newton-Schulz 5 steps, momentum 0.97), AdamW for embeddings/scalars. Gradient clip 0.3. ~4897 steps in 596s on 8xH100 SXM. Linear warmdown over final 75% of training.
+
+### Quantization
+
+Full-Hessian GPTQ with SDClip: int6 for attention/MLP matrices (clip=12.85 sigma for attention, 12.0 sigma for MLP), int7 for token embeddings (clip=15.0 sigma). Brotli-11 compression. Trimmed GPTQ (reserve=4s, calibration=16 batches). The new AttnOutGate and SmearGate parameters (all scalar-like) are kept in float16 passthrough.
+
+### TTT (Test-Time Training)
+
+Multi-Phase Global SGD with score-first ordering:
+- 3 phases, each adapting on a growing prefix of 2000 validation documents
+- Phase boundaries: [666, 1333, 2000] documents
+- Per phase: score all sliding windows under `torch.no_grad()`, then SGD update
+- SGD: lr=0.001, momentum=0.9, gradient clipping at 1.0
+- Plus per-doc LoRA TTT (rank=96, lr=0.0001, chunk=48, 64-batch) for the suffix documents
+- Total TTT eval time: ~335s (within 600s eval budget)
+
+## Rule Compliance
+
+Per Issue #1017 (Track B -- legal eval-time adaptation):
+
+- **Condition 1 (Causality):** Sliding-window eval is strictly causal. Each position scored from prefix tokens only. AttnOutGate and SmearGate are both purely positional-local -- AttnOutGate multiplies the attention output by a sigmoid of the current token's first-12 channels, SmearGate mixes the current token with the previous token (strictly backward-looking).
+- **Condition 2 (Normalized distribution):** Standard softmax over full vocab. No n-gram cache, no logit biasing. Gates modulate hidden states only, not logits.
+- **Condition 3 (Score before update):** Each phase fully scored under `torch.no_grad()` BEFORE any SGD update. Training only on already-scored tokens.
+- **Condition 4 (Single pass):** Each token scored exactly once per phase. Final scores from last phase only.
+
+**Casefold tokenizer normalization:** Novel technique. The tokenizer is retrained on casefolded (lowercased) text. Organizer review is pending at Issue #1604. The technique does not violate any of the four conditions above -- it only changes the tokenizer vocabulary, not the scoring or adaptation procedure. The byte-level BPB computation remains correct: each sentencepiece token maps to its constituent bytes via the piece table, and BPB is computed over all bytes in the validation set.
+
+**Attention Output Gate / SmearGate:** Pure architectural additions (training-time learned parameters). No eval-time effect beyond the trained weights. Fully legal under all Issue #1017 conditions; analogous gating constructs have precedent in SmearGate (modded-nanogpt), skip gates (PR #549 family), and parallel-lane gating (PR #1204 family).
+
+Additional compliance:
+- No SLOT (standard or causal)
+- No pre-quant TTT on val data (model quantized once during training, TTT adapts at eval time)
+- No ETLB (eval-time logit bias)
+- No n-gram cache or tilt
+- All artifacts under 16,000,000 bytes on all 3 seeds (max 15,938,772 B)
+- Training under 600s on all seeds (596.1-596.2s actual)
+- Eval (phased TTT) under 600s on all seeds (~307-351s actual)
+
+## Requirements
+
+- Python >= 3.12
+- PyTorch >= 2.9.1
+- flash-attn-3
+- brotli
+- sentencepiece
+
+## Run Command
+
+```bash
+# Install dependencies
+pip install brotli sentencepiece
+pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
+
+# Prepare casefolded data (offline, before training)
+# Casefolded shards expected at DATA_DIR
+# Contains: datasets/fineweb10B_sp8192/*.bin, tokenizers/fineweb_8192_bpe.model
+
+# 3-seed evaluation loop
+for SEED in 42 0 1234; do
+  DATA_DIR=/path/to/casefold_data/ SEED=$SEED \
+    ATTN_OUT_GATE=1 SMEAR_GATE=1 \
+    PHASED_TTT_ENABLED=1 PHASED_TTT_NUM_PHASES=3 PHASED_TTT_PREFIX_DOCS=2000 \
+    GLOBAL_TTT_LR=0.001 GLOBAL_TTT_MOMENTUM=0.9 GLOBAL_TTT_GRAD_CLIP=1.0 \
+    GLOBAL_TTT_CHUNK_TOKENS=32768 GLOBAL_TTT_BATCH_SEQS=32 \
+    torchrun --standalone --nproc_per_node=8 train_gpt.py \
+    2>&1 | tee train_seed${SEED}.log
+done
+```
+
+## Lineage
+
+PR #1530 (@samacqua) -> PR #1626 (@dexhunter, multi-phase SGD TTT) -> PR #1670 (@dexhunter, casefold v4 + phased TTT) -> this PR (+ AttnOutGate from PR #1667 @MarioPaerle + SmearGate)
+
+## Credits
+
+- **@samacqua** -- PR #1530 base architecture (11L/512d/4x MLP, depth recurrence, parallel residuals, MuonEq-R, GPTQ SDClip, VarLen attention, fused MLP)
+- **@MarioPaerle** -- Attention Output Gate (PR #1667), SmearGate reintroduction to parameter-golf
+- **@kellerjordan** -- SmearGate concept (originally from modded-nanogpt)
+- **@mikeapedia** -- Casefold tokenizer concept (PR #1578)
+- **@romeerp** -- Phased TTT concept (PR #1610)
+- **@abaybektursun** -- Score-first TTT framework (PR #549, merged precedent)
+- **@dexhunter** -- Casefold V4 retokenization + BOS fix, multi-phase global SGD TTT, trimmed GPTQ tuning, inline-safe gate implementation compatible with fullgraph torch.compile
+
+## Included Files
+
+- `README.md` (this file)
+- `submission.json`
+- `train_gpt.py`
+- `train_seed42.log`
+- `train_seed0.log`
+- `train_seed1234.log`
diff --git a/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/submission.json b/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/submission.json
new file mode 100644
index 0000000000..56d7db1018
--- /dev/null
+++ b/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/submission.json
@@ -0,0 +1,40 @@
+{
+  "author": "dexhunter",
+  "github_id": "dexhunter",
+  "name": "Casefold V4 + Attention Output Gate + Multi-Phase Global SGD + Phased TTT",
+  "date": "2026-04-17",
+  "track": "10min_16mb",
+  "val_bpb": 1.05733,
+  "val_bpb_std": 0.00035,
+  "val_loss_nats": 3.04721,
+  "seeds": [42, 0, 1234],
+  "seed_results": {
+    "42":   {"val_bpb": 1.05693, "val_loss": 3.04604, "artifact_bytes": 15936269},
+    "0":    {"val_bpb": 1.05730, "val_loss": 3.04712, "artifact_bytes": 15937514},
+    "1234": {"val_bpb": 1.05777, "val_loss": 3.04846, "artifact_bytes": 15938772}
+  },
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "technique_summary": "Casefold V4 tokenizer normalization (retrained SP8192 on lowercased text) + per-head Attention Output Gate (zero-init, 11 layers) + SmearGate + 3-layer depth recurrence (L3-5) + parallel residuals (L8+) + multi-phase global SGD TTT (3 phases, 2000 prefix docs) + per-doc LoRA TTT + full-Hessian GPTQ int6/int7 + Brotli-11",
+  "compliance": {
+    "train_under_600s": true,
+    "artifact_under_16mb": true,
+    "eval_under_600s": true,
+    "no_slot": true,
+    "no_pre_quant_ttt": true,
+    "no_etlb": true,
+    "no_ngram_cache": true,
+    "score_first_ttt": true,
+    "three_seeds": true,
+    "casefold_pending_review": "Issue #1604"
+  },
+  "attribution": {
+    "base_architecture": "@samacqua (PR #1530)",
+    "attn_output_gate": "@MarioPaerle (PR #1667)",
+    "smear_gate_concept": "@kellerjordan (modded-nanogpt), @MarioPaerle (PR #1667 reintroduction)",
+    "casefold_concept": "@mikeapedia (PR #1578)",
+    "phased_ttt_concept": "@romeerp (PR #1610)",
+    "score_first_ttt_framework": "@abaybektursun (PR #549)",
+    "casefold_v4_retokenization_and_multiphase_ttt": "@dexhunter (PR #1626, PR #1670)"
+  }
+}
diff --git a/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/train_gpt.py b/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/train_gpt.py
new file mode 100644
index 0000000000..7c7dbf6c5f
--- /dev/null
+++ b/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/train_gpt.py
@@ -0,0 +1,3030 @@
+import base64, collections, copy, fcntl, glob, io, json, lzma, math, os
+from pathlib import Path
+import random, re, subprocess, sys, time, uuid, numpy as np, sentencepiece as spm, torch, torch.distributed as dist, torch.nn.functional as F
+from torch import nn
+from flash_attn_interface import (
+    flash_attn_func as flash_attn_3_func,
+    flash_attn_varlen_func,
+)
+from concurrent.futures import ThreadPoolExecutor
+import triton
+import triton.language as tl
+from triton.tools.tensor_descriptor import TensorDescriptor
+
+
+class Hyperparameters:
+    data_dir = os.environ.get("DATA_DIR", "./data/")
+    seed = int(os.environ.get("SEED", 1337))
+    run_id = os.environ.get("RUN_ID", str(uuid.uuid4()))
+    iterations = int(os.environ.get("ITERATIONS", 20000))
+    warmdown_frac = float(os.environ.get("WARMDOWN_FRAC", 0.75))
+    warmup_steps = int(os.environ.get("WARMUP_STEPS", 20))
+    train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 786432))
+    train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 2048))
+    train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 500))
+    max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 6e2))
+    val_batch_tokens = int(os.environ.get("VAL_BATCH_TOKENS", 524288))
+    eval_seq_len = int(os.environ.get("EVAL_SEQ_LEN", 2048))
+    val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 4000))
+    sliding_window_enabled = bool(int(os.environ.get("SLIDING_WINDOW_ENABLED", "0")))
+    vocab_size = int(os.environ.get("VOCAB_SIZE", 8192))
+    num_layers = int(os.environ.get("NUM_LAYERS", 11))
+    xsa_last_n = int(os.environ.get("XSA_LAST_N", 11))
+    model_dim = int(os.environ.get("MODEL_DIM", 512))
+    num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4))
+    num_heads = int(os.environ.get("NUM_HEADS", 8))
+    mlp_mult = float(os.environ.get("MLP_MULT", 4.0))
+    skip_gates_enabled = bool(int(os.environ.get("SKIP_GATES_ENABLED", "1")))
+    tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1")))
+    logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 3e1))
+    rope_base = float(os.environ.get("ROPE_BASE", 1e4))
+    rope_dims = int(os.environ.get("ROPE_DIMS", 16))
+    rope_train_seq_len = int(os.environ.get("ROPE_TRAIN_SEQ_LEN", 2048))
+    rope_yarn = bool(int(os.environ.get("ROPE_YARN", "0")))
+    ln_scale = bool(int(os.environ.get("LN_SCALE", "1")))
+    qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 5.0))
+    num_loops = int(os.environ.get("NUM_LOOPS", 2))
+    loop_start = int(os.environ.get("LOOP_START", 3))
+    loop_end = int(os.environ.get("LOOP_END", 5))
+    enable_looping_at = float(os.environ.get("ENABLE_LOOPING_AT", 0.35))
+    parallel_start_layer = int(os.environ.get("PARALLEL_START_LAYER", 8))
+    parallel_final_lane = os.environ.get("PARALLEL_FINAL_LANE", "mean")
+    min_lr = float(os.environ.get("MIN_LR", 0.0))
+    embed_lr = float(os.environ.get("EMBED_LR", 0.6))
+    tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.03))
+    tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005))
+    matrix_lr = float(os.environ.get("MATRIX_LR", 0.026))
+    scalar_lr = float(os.environ.get("SCALAR_LR", 0.02))
+    muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.97))
+    muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5))
+    muon_momentum_warmup_start = float(
+        os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.92)
+    )
+    muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 1500))
+    muon_row_normalize = bool(int(os.environ.get("MUON_ROW_NORMALIZE", "1")))
+    beta1 = float(os.environ.get("BETA1", 0.9))
+    beta2 = float(os.environ.get("BETA2", 0.95))
+    adam_eps = float(os.environ.get("ADAM_EPS", 1e-08))
+    grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.3))
+    eval_stride = int(os.environ.get("EVAL_STRIDE", 64))
+    adam_wd = float(os.environ.get("ADAM_WD", 0.02))
+    muon_wd = float(os.environ.get("MUON_WD", 0.095))
+    embed_wd = float(os.environ.get("EMBED_WD", 0.085))
+    ema_decay = float(os.environ.get("EMA_DECAY", 0.9965))
+    ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "1")))
+    ttt_lora_rank = int(os.environ.get("TTT_LORA_RANK", 96))
+    ttt_lora_lr = float(os.environ.get("TTT_LORA_LR", 0.0001))
+    ttt_chunk_size = int(os.environ.get("TTT_CHUNK_SIZE", 48))
+    ttt_eval_seq_len = int(os.environ.get("TTT_EVAL_SEQ_LEN", 2048))
+    ttt_batch_size = int(os.environ.get("TTT_BATCH_SIZE", 64))
+    ttt_grad_steps = int(os.environ.get("TTT_GRAD_STEPS", 1))
+    ttt_weight_decay = float(os.environ.get("TTT_WEIGHT_DECAY", 0.5))
+    ttt_beta1 = float(os.environ.get("TTT_BETA1", 0))
+    ttt_beta2 = float(os.environ.get("TTT_BETA2", 0.999))
+    ttt_k_lora = bool(int(os.environ.get("TTT_K_LORA", "1")))
+    ttt_mlp_lora = bool(int(os.environ.get("TTT_MLP_LORA", "1")))
+    ttt_o_lora = bool(int(os.environ.get("TTT_O_LORA", "1")))
+    ttt_optimizer = os.environ.get("TTT_OPTIMIZER", "adam")
+    ttt_eval_batches = os.environ.get("TTT_EVAL_BATCHES", "")
+    val_doc_fraction = float(os.environ.get("VAL_DOC_FRACTION", 1.0))
+    compressor = os.environ.get("COMPRESSOR", "brotli")
+    gptq_calibration_batches = int(os.environ.get("GPTQ_CALIBRATION_BATCHES", 16))
+    gptq_reserve_seconds = float(os.environ.get("GPTQ_RESERVE_SECONDS", 4.0))
+    phased_ttt_enabled = bool(int(os.environ.get("PHASED_TTT_ENABLED", "0")))
+    phased_ttt_prefix_docs = int(os.environ.get("PHASED_TTT_PREFIX_DOCS", 2000))
+    phased_ttt_num_phases = int(os.environ.get("PHASED_TTT_NUM_PHASES", 1))
+    global_ttt_lr = float(os.environ.get("GLOBAL_TTT_LR", 0.001))
+    global_ttt_momentum = float(os.environ.get("GLOBAL_TTT_MOMENTUM", 0.9))
+    global_ttt_epochs = int(os.environ.get("GLOBAL_TTT_EPOCHS", 1))
+    global_ttt_chunk_tokens = int(os.environ.get("GLOBAL_TTT_CHUNK_TOKENS", 32768))
+    global_ttt_batch_seqs = int(os.environ.get("GLOBAL_TTT_BATCH_SEQS", 32))
+    global_ttt_warmup_start_lr = float(os.environ.get("GLOBAL_TTT_WARMUP_START_LR", 0.0))
+    global_ttt_warmup_chunks = int(os.environ.get("GLOBAL_TTT_WARMUP_CHUNKS", 0))
+    global_ttt_grad_clip = float(os.environ.get("GLOBAL_TTT_GRAD_CLIP", 1.0))
+    global_ttt_respect_doc_boundaries = bool(int(os.environ.get("GLOBAL_TTT_RESPECT_DOC_BOUNDARIES", "1")))
+    matrix_bits = int(os.environ.get("MATRIX_BITS", 6))
+    embed_bits = int(os.environ.get("EMBED_BITS", 8))
+    matrix_clip_sigmas = float(os.environ.get("MATRIX_CLIP_SIGMAS", 12.85))
+    embed_clip_sigmas = float(os.environ.get("EMBED_CLIP_SIGMAS", 2e1))
+    mlp_clip_sigmas = float(os.environ.get("MLP_CLIP_SIGMAS", 10.0))
+    attn_clip_sigmas = float(os.environ.get("ATTN_CLIP_SIGMAS", 13.0))
+    smear_gate = bool(int(os.environ.get("SMEAR_GATE", "0")))
+    attn_out_gate = bool(int(os.environ.get("ATTN_OUT_GATE", "0")))
+    distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ
+    rank = int(os.environ.get("RANK", "0"))
+    world_size = int(os.environ.get("WORLD_SIZE", "1"))
+    local_rank = int(os.environ.get("LOCAL_RANK", "0"))
+    is_main_process = rank == 0
+    grad_accum_steps = 8 // world_size
+    datasets_dir = os.path.join(data_dir, "datasets", f"fineweb10B_sp{vocab_size}")
+    train_files = os.path.join(datasets_dir, "fineweb_train_*.bin")
+    val_files = os.path.join(datasets_dir, "fineweb_val_*.bin")
+    tokenizer_path = os.path.join(
+        data_dir, "tokenizers", f"fineweb_{vocab_size}_bpe.model"
+    )
+    artifact_dir = os.environ.get("ARTIFACT_DIR", "")
+    logfile = (
+        os.path.join(artifact_dir, f"{run_id}.txt")
+        if artifact_dir
+        else f"logs/{run_id}.txt"
+    )
+    model_path = (
+        os.path.join(artifact_dir, "final_model.pt")
+        if artifact_dir
+        else "final_model.pt"
+    )
+    quantized_model_path = (
+        os.path.join(artifact_dir, "final_model.int6.ptz")
+        if artifact_dir
+        else "final_model.int6.ptz"
+    )
+
+
+_logger_hparams = None
+
+
+def set_logging_hparams(h):
+    global _logger_hparams
+    _logger_hparams = h
+
+
+def log(msg, console=True):
+    if _logger_hparams is None:
+        print(msg)
+        return
+    if _logger_hparams.is_main_process:
+        if console:
+            print(msg)
+        if _logger_hparams.logfile is not None:
+            with open(_logger_hparams.logfile, "a", encoding="utf-8") as f:
+                print(msg, file=f)
+
+
+class ValidationData:
+    def __init__(self, h, device):
+        self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path)
+        if int(self.sp.vocab_size()) != h.vocab_size:
+            raise ValueError(
+                f"VOCAB_SIZE={h.vocab_size} does not match tokenizer vocab_size={int(self.sp.vocab_size())}"
+            )
+        self.val_tokens = load_validation_tokens(h.val_files, h.eval_seq_len)
+        (
+            self.base_bytes_lut,
+            self.has_leading_space_lut,
+            self.is_boundary_token_lut,
+        ) = build_sentencepiece_luts(self.sp, h.vocab_size, device)
+
+
+def build_sentencepiece_luts(sp, vocab_size, device):
+    sp_vocab_size = int(sp.vocab_size())
+    assert (
+        sp.piece_to_id("▁") != sp.unk_id()
+    ), "Tokenizer must have '▁' (space) as its own token for correct BPB byte counting"
+    table_size = max(sp_vocab_size, vocab_size)
+    base_bytes_np = np.zeros((table_size,), dtype=np.int16)
+    has_leading_space_np = np.zeros((table_size,), dtype=np.bool_)
+    is_boundary_token_np = np.ones((table_size,), dtype=np.bool_)
+    for token_id in range(sp_vocab_size):
+        if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+            continue
+        is_boundary_token_np[token_id] = False
+        if sp.is_byte(token_id):
+            base_bytes_np[token_id] = 1
+            continue
+        piece = sp.id_to_piece(token_id)
+        if piece.startswith("▁"):
+            has_leading_space_np[token_id] = True
+            piece = piece[1:]
+        base_bytes_np[token_id] = len(piece.encode("utf-8"))
+    return (
+        torch.tensor(base_bytes_np, dtype=torch.int16, device=device),
+        torch.tensor(has_leading_space_np, dtype=torch.bool, device=device),
+        torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device),
+    )
+
+
+def load_validation_tokens(pattern, seq_len):
+    files = [Path(p) for p in sorted(glob.glob(pattern))]
+    if not files:
+        raise FileNotFoundError(f"No files found for pattern: {pattern}")
+    tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
+    usable = (tokens.numel() - 1) // seq_len * seq_len
+    if usable <= 0:
+        raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
+    return tokens[: usable + 1]
+
+
+def load_data_shard(file):
+    header_bytes = 256 * np.dtype("<i4").itemsize
+    token_bytes = np.dtype("<u2").itemsize
+    header = np.fromfile(file, dtype="<i4", count=256)
+    if header.size != 256 or int(header[0]) != 20240520 or int(header[1]) != 1:
+        raise ValueError(f"Unexpected shard header for {file}")
+    num_tokens = int(header[2])
+    expected_size = header_bytes + num_tokens * token_bytes
+    if file.stat().st_size != expected_size:
+        raise ValueError(
+            f"Shard size mismatch for {file}: expected {expected_size} bytes"
+        )
+    tokens_np = np.fromfile(file, dtype="<u2", count=num_tokens, offset=header_bytes)
+    if tokens_np.size != num_tokens:
+        raise ValueError(f"Short read for {file}")
+    return torch.from_numpy(tokens_np.astype(np.uint16, copy=False))
+
+
+_SHARD_HEADER_BYTES = 256 * np.dtype("<i4").itemsize
+_SHARD_NTOKENS_CACHE = {}
+_MMAP_CACHE = {}
+
+
+def _read_num_tokens(file):
+    key = str(file)
+    cached = _SHARD_NTOKENS_CACHE.get(key)
+    if cached is not None:
+        return cached
+    header = np.fromfile(file, dtype="<i4", count=256)
+    if header.size != 256 or int(header[0]) != 20240520 or int(header[1]) != 1:
+        raise ValueError(f"Unexpected shard header for {file}")
+    n = int(header[2])
+    _SHARD_NTOKENS_CACHE[key] = n
+    return n
+
+
+def _get_shard_memmap(file):
+    key = str(file)
+    mm = _MMAP_CACHE.get(key)
+    if mm is not None:
+        return mm
+    n = _read_num_tokens(file)
+    mm = np.memmap(file, mode="r", dtype="<u2", offset=_SHARD_HEADER_BYTES, shape=(n,))
+    _MMAP_CACHE[key] = mm
+    return mm
+
+
+BOS_ID = None
+
+
+def get_next_multiple_of_n(v, n):
+    return ((v + n - 1) // n) * n
+
+
+def _build_cu_seqlens(bos_pos, total_len, device, max_doc_len=0, bucket_size=64):
+    if not bos_pos or bos_pos[0] != 0:
+        bos_pos = [0] + bos_pos
+    seg_starts = []
+    starts_with_end = bos_pos + [total_len]
+    for i in range(len(starts_with_end) - 1):
+        start = starts_with_end[i]
+        end = starts_with_end[i + 1]
+        if max_doc_len > 0:
+            pos = start
+            while pos < end:
+                seg_starts.append(pos)
+                pos += max_doc_len
+        else:
+            seg_starts.append(start)
+    boundaries = seg_starts + [total_len]
+    padded_len = get_next_multiple_of_n(len(boundaries), bucket_size)
+    cu = torch.full((padded_len,), total_len, dtype=torch.int32, device=device)
+    cu[: len(boundaries)] = torch.tensor(boundaries, dtype=torch.int32, device=device)
+    seg_ends = seg_starts[1:] + [total_len]
+    max_seqlen = max(end - start for start, end in zip(seg_starts, seg_ends))
+    return cu, max_seqlen
+
+class DocumentPackingLoader:
+    _shard_pool = ThreadPoolExecutor(1)
+
+    def __init__(self, h, device, cu_bucket_size=64):
+        self.rank = h.rank
+        self.world_size = h.world_size
+        self.device = device
+        self.cu_bucket_size = cu_bucket_size
+        self.max_seq_len = h.train_seq_len
+        all_files = [Path(p) for p in sorted(glob.glob(h.train_files))]
+        if not all_files:
+            raise FileNotFoundError(f"No files found for pattern: {h.train_files}")
+        self.files = all_files
+        self.file_iter = iter(self.files)
+        self._init_shard(load_data_shard(next(self.file_iter)))
+        self._next_shard = self._submit_next_shard()
+        self._batch_pool = ThreadPoolExecutor(1)
+        self._next_batch = None
+
+    def _init_shard(self, tokens):
+        global BOS_ID
+        self.tokens = tokens
+        self.shard_size = tokens.numel()
+        if BOS_ID is None:
+            BOS_ID = 1
+        self.bos_idx = (
+            (tokens == BOS_ID).nonzero(as_tuple=True)[0].to(torch.int64).cpu().numpy()
+        )
+        if self.bos_idx.size == 0:
+            self.bos_idx = np.array([0], dtype=np.int64)
+        self.cursor = int(self.bos_idx[0])
+
+    def _submit_next_shard(self):
+        try:
+            path = next(self.file_iter)
+            return self._shard_pool.submit(load_data_shard, path)
+        except StopIteration:
+            return None
+
+    def _advance_shard(self):
+        if self._next_shard is None:
+            self.file_iter = iter(self.files)
+            self._next_shard = self._shard_pool.submit(
+                load_data_shard, next(self.file_iter)
+            )
+        self._init_shard(self._next_shard.result())
+        self._next_shard = self._submit_next_shard()
+
+    def _local_doc_starts(self, local_start, total_len):
+        lo = np.searchsorted(self.bos_idx, local_start, side="left")
+        hi = np.searchsorted(self.bos_idx, local_start + total_len, side="left")
+        return (self.bos_idx[lo:hi] - local_start).tolist()
+
+    def _prepare_batch(self, num_tokens_local, max_seq_len):
+        per_rank_span = num_tokens_local + 1
+        global_span = per_rank_span * self.world_size
+        while self.cursor + global_span > self.shard_size:
+            self._advance_shard()
+        local_start = self.cursor + self.rank * per_rank_span
+        buf = self.tokens[local_start : local_start + per_rank_span]
+        inputs = buf[:-1].to(dtype=torch.int64).pin_memory()
+        targets = buf[1:].to(dtype=torch.int64).pin_memory()
+        starts = self._local_doc_starts(local_start, inputs.numel())
+        cu_seqlens, max_seqlen = _build_cu_seqlens(
+            starts, inputs.numel(), inputs.device, max_seq_len, self.cu_bucket_size
+        )
+        cu_seqlens = cu_seqlens.pin_memory()
+        self.cursor += global_span
+        return inputs, targets, cu_seqlens, max_seqlen
+
+    def next_batch(self, global_tokens, grad_accum_steps):
+        num_tokens_local = global_tokens // (self.world_size * grad_accum_steps)
+        if self._next_batch is not None:
+            inputs, targets, cu_seqlens, max_seqlen = self._next_batch.result()
+        else:
+            inputs, targets, cu_seqlens, max_seqlen = self._prepare_batch(
+                num_tokens_local, self.max_seq_len
+            )
+        self._next_batch = self._batch_pool.submit(
+            self._prepare_batch, num_tokens_local, self.max_seq_len
+        )
+        return (
+            inputs[None].to(self.device, non_blocking=True),
+            targets[None].to(self.device, non_blocking=True),
+            cu_seqlens.to(self.device, non_blocking=True),
+            max_seqlen,
+        )
+
+
+class ShuffledSequenceLoader:
+    def __init__(self, h, device):
+        self.world_size = h.world_size
+        self.seq_len = h.train_seq_len
+        self.device = device
+        all_files = [Path(p) for p in sorted(glob.glob(h.train_files))]
+        if not all_files:
+            raise FileNotFoundError(f"No files found for pattern: {h.train_files}")
+        self.files = all_files[h.rank :: h.world_size]
+        self.rng = np.random.Generator(np.random.PCG64(h.rank))
+        self.num_tokens = [_read_num_tokens(f) for f in self.files]
+        self.start_inds = [[] for _ in self.files]
+        for si in range(len(self.files)):
+            self._reset_shard(si)
+
+    def _reset_shard(self, si):
+        max_phase = min(
+            self.seq_len - 1, max(0, self.num_tokens[si] - self.seq_len - 1)
+        )
+        phase = int(self.rng.integers(max_phase + 1)) if max_phase > 0 else 0
+        num_sequences = (self.num_tokens[si] - 1 - phase) // self.seq_len
+        sequence_order = self.rng.permutation(num_sequences)
+        self.start_inds[si] = (phase + sequence_order * self.seq_len).tolist()
+
+    def next_batch(self, global_tokens, grad_accum_steps):
+        device_tokens = global_tokens // (self.world_size * grad_accum_steps)
+        device_batch_size = device_tokens // self.seq_len
+        remaining = np.array([len(s) for s in self.start_inds], dtype=np.float64)
+        x = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64)
+        y = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64)
+        for bi in range(device_batch_size):
+            total = remaining.sum()
+            if total <= 0:
+                for si in range(len(self.files)):
+                    self._reset_shard(si)
+                remaining = np.array(
+                    [len(s) for s in self.start_inds], dtype=np.float64
+                )
+                total = remaining.sum()
+            probs = remaining / total
+            si = int(self.rng.choice(len(self.files), p=probs))
+            start_ind = self.start_inds[si].pop()
+            remaining[si] -= 1
+            mm = _get_shard_memmap(self.files[si])
+            window = torch.as_tensor(
+                np.array(mm[start_ind : start_ind + self.seq_len + 1], dtype=np.int64)
+            )
+            x[bi] = window[:-1]
+            y[bi] = window[1:]
+        return x.to(self.device, non_blocking=True), y.to(
+            self.device, non_blocking=True
+        )
+
+
+class RMSNorm(nn.Module):
+    def __init__(self, eps=None):
+        super().__init__()
+        self.eps = eps
+
+    def forward(self, x):
+        return F.rms_norm(x, (x.size(-1),), eps=self.eps)
+
+
+class CastedLinear(nn.Linear):
+    def forward(self, x):
+        w = self.weight.to(x.dtype)
+        bias = self.bias.to(x.dtype) if self.bias is not None else None
+        return F.linear(x, w, bias)
+
+
+@triton.jit
+def linear_leaky_relu_square_kernel(
+    a_desc,
+    b_desc,
+    c_desc,
+    aux_desc,
+    M,
+    N,
+    K,
+    BLOCK_SIZE_M: tl.constexpr,
+    BLOCK_SIZE_N: tl.constexpr,
+    BLOCK_SIZE_K: tl.constexpr,
+    NUM_SMS: tl.constexpr,
+    FORWARD: tl.constexpr,
+):
+    dtype = tl.bfloat16
+    start_pid = tl.program_id(axis=0)
+    num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
+    num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
+    k_tiles = tl.cdiv(K, BLOCK_SIZE_K)
+    num_tiles = num_pid_m * num_pid_n
+    tile_id_c = start_pid - NUM_SMS
+    for tile_id in tl.range(start_pid, num_tiles, NUM_SMS, flatten=True):
+        pid_m = tile_id // num_pid_n
+        pid_n = tile_id % num_pid_n
+        offs_am = pid_m * BLOCK_SIZE_M
+        offs_bn = pid_n * BLOCK_SIZE_N
+        accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
+        for ki in range(k_tiles):
+            offs_k = ki * BLOCK_SIZE_K
+            a = a_desc.load([offs_am, offs_k])
+            b = b_desc.load([offs_bn, offs_k])
+            accumulator = tl.dot(a, b.T, accumulator)
+        tile_id_c += NUM_SMS
+        offs_am_c = offs_am
+        offs_bn_c = offs_bn
+        acc = tl.reshape(accumulator, (BLOCK_SIZE_M, 2, BLOCK_SIZE_N // 2))
+        acc = tl.permute(acc, (0, 2, 1))
+        acc0, acc1 = tl.split(acc)
+        c0 = acc0.to(dtype)
+        c1 = acc1.to(dtype)
+        if not FORWARD:
+            pre0 = aux_desc.load([offs_am_c, offs_bn_c])
+            pre1 = aux_desc.load([offs_am_c, offs_bn_c + BLOCK_SIZE_N // 2])
+            c0 = c0 * tl.where(pre0 > 0, 2.0 * pre0, 0.5 * pre0)
+            c1 = c1 * tl.where(pre1 > 0, 2.0 * pre1, 0.5 * pre1)
+        c_desc.store([offs_am_c, offs_bn_c], c0)
+        c_desc.store([offs_am_c, offs_bn_c + BLOCK_SIZE_N // 2], c1)
+        if FORWARD:
+            aux0 = tl.where(c0 > 0, c0, 0.5 * c0)
+            aux1 = tl.where(c1 > 0, c1, 0.5 * c1)
+            aux_desc.store([offs_am_c, offs_bn_c], aux0 * aux0)
+            aux_desc.store([offs_am_c, offs_bn_c + BLOCK_SIZE_N // 2], aux1 * aux1)
+
+
+def linear_leaky_relu_square(a, b, aux=None):
+    M, K = a.shape
+    N, K2 = b.shape
+    assert K == K2
+    c = torch.empty((M, N), device=a.device, dtype=a.dtype)
+    forward = aux is None
+    if aux is None:
+        aux = torch.empty((M, N), device=a.device, dtype=a.dtype)
+    num_sms = torch.cuda.get_device_properties(a.device).multi_processor_count
+    BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K = 128, 256, 64
+    num_stages = 4 if forward else 3
+    a_desc = TensorDescriptor.from_tensor(a, [BLOCK_SIZE_M, BLOCK_SIZE_K])
+    b_desc = TensorDescriptor.from_tensor(b, [BLOCK_SIZE_N, BLOCK_SIZE_K])
+    c_desc = TensorDescriptor.from_tensor(c, [BLOCK_SIZE_M, BLOCK_SIZE_N // 2])
+    aux_desc = TensorDescriptor.from_tensor(aux, [BLOCK_SIZE_M, BLOCK_SIZE_N // 2])
+    grid = lambda _meta: (
+        min(num_sms, triton.cdiv(M, BLOCK_SIZE_M) * triton.cdiv(N, BLOCK_SIZE_N)),
+    )
+    linear_leaky_relu_square_kernel[grid](
+        a_desc,
+        b_desc,
+        c_desc,
+        aux_desc,
+        M,
+        N,
+        K,
+        BLOCK_SIZE_M=BLOCK_SIZE_M,
+        BLOCK_SIZE_N=BLOCK_SIZE_N,
+        BLOCK_SIZE_K=BLOCK_SIZE_K,
+        NUM_SMS=num_sms,
+        FORWARD=forward,
+        num_stages=num_stages,
+        num_warps=8,
+    )
+    if forward:
+        return c, aux
+    return c
+
+
+class FusedLinearLeakyReLUSquareFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, x, w1, w2):
+        x_flat = x.reshape(-1, x.shape[-1])
+        pre, post = linear_leaky_relu_square(x_flat, w1)
+        out = F.linear(post, w2)
+        ctx.save_for_backward(x, w1, w2, pre, post)
+        return out.view(*x.shape[:-1], out.shape[-1])
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        x, w1, w2, pre, post = ctx.saved_tensors
+        x_flat = x.reshape(-1, x.shape[-1])
+        grad_output_flat = grad_output.reshape(-1, grad_output.shape[-1])
+        dw2 = grad_output_flat.T @ post
+        dpre = linear_leaky_relu_square(grad_output_flat, w2.T.contiguous(), aux=pre)
+        dw1 = dpre.T @ x_flat
+        dx = dpre @ w1
+        return dx.view_as(x), dw1, dw2
+
+
+FusedLeakyReLUSquareMLP = FusedLinearLeakyReLUSquareFunction.apply
+
+
+class Rotary(nn.Module):
+    def __init__(self, dim, base=1e4, train_seq_len=1024, rope_dims=0, yarn=True):
+        super().__init__()
+        self.dim = dim
+        self.base = base
+        self.train_seq_len = train_seq_len
+        self.yarn = yarn
+        self.rope_dims = rope_dims if rope_dims > 0 else dim
+        inv_freq = 1.0 / base ** (
+            torch.arange(0, self.rope_dims, 2, dtype=torch.float32) / self.rope_dims
+        )
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self._seq_len_cached = 0
+        self._cos_cached = None
+        self._sin_cached = None
+
+    def forward(self, seq_len, device, dtype):
+        if (
+            self._cos_cached is None
+            or self._sin_cached is None
+            or self._seq_len_cached < seq_len
+            or self._cos_cached.device != device
+        ):
+            rd = self.rope_dims
+            if self.yarn and seq_len > self.train_seq_len:
+                scale = seq_len / self.train_seq_len
+                new_base = self.base * scale ** (rd / (rd - 2))
+                inv_freq = 1.0 / new_base ** (
+                    torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd
+                )
+            else:
+                inv_freq = self.inv_freq.float().to(device)
+            t = torch.arange(seq_len, device=device, dtype=torch.float32)
+            freqs = torch.outer(t, inv_freq)
+            self._cos_cached = freqs.cos()[None, :, None, :]
+            self._sin_cached = freqs.sin()[None, :, None, :]
+            self._seq_len_cached = seq_len
+        return self._cos_cached[:, :seq_len].to(dtype=dtype), self._sin_cached[:, :seq_len].to(dtype=dtype)
+
+
+def apply_rotary_emb(x, cos, sin, rope_dims=0):
+    if rope_dims > 0 and rope_dims < x.size(-1):
+        x_rope, x_pass = x[..., :rope_dims], x[..., rope_dims:]
+        half = rope_dims // 2
+        x1, x2 = x_rope[..., :half], x_rope[..., half:]
+        x_rope = torch.cat((x1 * cos + x2 * sin, x1 * -sin + x2 * cos), dim=-1)
+        return torch.cat((x_rope, x_pass), dim=-1)
+    half = x.size(-1) // 2
+    x1, x2 = x[..., :half], x[..., half:]
+    return torch.cat((x1 * cos + x2 * sin, x1 * -sin + x2 * cos), dim=-1)
+
+
+def _apply_attn_out_gate_inline(y, x_orig, gate_w):
+    """Inline-safe version: .contiguous() barriers prevent over-aggressive kernel fusion."""
+    gate_in = x_orig[:, :, :12].contiguous()
+    gate = (2.0 * torch.sigmoid(F.linear(gate_in, gate_w.to(gate_in.dtype)))).contiguous()
+    return y * gate.unsqueeze(-1)
+
+def _apply_smear_gate_inline(x, smear_w, smear_lambda):
+    """Inline-safe version: .contiguous() barriers prevent over-aggressive kernel fusion."""
+    prev_x = torch.zeros_like(x)
+    prev_x[:, 1:] = x[:, :-1]
+    gate_in = x[:, :, :12].contiguous()
+    gate = torch.sigmoid(F.linear(gate_in, smear_w.to(x.dtype).unsqueeze(0))).contiguous()
+    return x + smear_lambda.to(x.dtype) * gate * prev_x
+
+class CausalSelfAttention(nn.Module):
+    def __init__(
+        self, dim, num_heads, num_kv_heads, rope_base, qk_gain_init, train_seq_len, yarn=True
+    ):
+        super().__init__()
+        if dim % num_heads != 0:
+            raise ValueError("model_dim must be divisible by num_heads")
+        if num_heads % num_kv_heads != 0:
+            raise ValueError("num_heads must be divisible by num_kv_heads")
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = dim // num_heads
+        if self.head_dim % 2 != 0:
+            raise ValueError("head_dim must be even for RoPE")
+        self.q_gain = nn.Parameter(
+            torch.full((num_heads,), qk_gain_init, dtype=torch.float32)
+        )
+        self.rope_dims = 0
+        self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=train_seq_len, yarn=yarn)
+        self.use_xsa = False
+        self.attn_out_gate_w = None
+
+    def _xsa_efficient(self, y, v):
+        B, T, H, D = y.shape
+        Hkv = v.size(-2)
+        group = H // Hkv
+        y_g = y.reshape(B, T, Hkv, group, D)
+        vn = F.normalize(v, dim=-1).unsqueeze(-2)
+        proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn
+        return (y_g - proj).reshape(B, T, H, D)
+
+    def forward(self, x, q_w, k_w, v_w, out_w, cu_seqlens=None, max_seqlen=0, x_orig=None):
+        bsz, seqlen, dim = x.shape
+        q = F.linear(x, q_w.to(x.dtype)).reshape(bsz, seqlen, self.num_heads, self.head_dim)
+        k = F.linear(x, k_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim)
+        v = F.linear(x, v_w.to(x.dtype)).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim)
+        q = F.rms_norm(q, (q.size(-1),))
+        k = F.rms_norm(k, (k.size(-1),))
+        cos, sin = self.rotary(seqlen, x.device, q.dtype)
+        q = apply_rotary_emb(q, cos, sin, self.rope_dims)
+        k = apply_rotary_emb(k, cos, sin, self.rope_dims)
+        q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None]
+        if cu_seqlens is not None:
+            y = flash_attn_varlen_func(
+                q[0],
+                k[0],
+                v[0],
+                cu_seqlens_q=cu_seqlens,
+                cu_seqlens_k=cu_seqlens,
+                max_seqlen_q=max_seqlen,
+                max_seqlen_k=max_seqlen,
+                causal=True,
+                window_size=(-1, -1),
+            )[None]
+        else:
+            y = flash_attn_3_func(q, k, v, causal=True)
+        if self.use_xsa:
+            y = self._xsa_efficient(y, v)
+        if self.attn_out_gate_w is not None and x_orig is not None:
+            y = _apply_attn_out_gate_inline(y, x_orig, self.attn_out_gate_w)
+        y = y.reshape(bsz, seqlen, dim)
+        self._last_proj_input = y.detach() if getattr(self, "_calib", False) else None
+        return F.linear(y, out_w.to(x.dtype))
+
+
+class MLP(nn.Module):
+    def __init__(self, dim, mlp_mult):
+        super().__init__()
+        self.use_fused = True
+
+    def forward(self, x, up_w, down_w):
+        if self.training and self.use_fused:
+            return FusedLeakyReLUSquareMLP(x, up_w.to(x.dtype), down_w.to(x.dtype))
+        hidden = F.leaky_relu(F.linear(x, up_w.to(x.dtype)), negative_slope=0.5).square()
+        self._last_down_input = hidden.detach() if getattr(self, "_calib", False) else None
+        return F.linear(hidden, down_w.to(x.dtype))
+
+
+class Block(nn.Module):
+    def __init__(
+        self,
+        dim,
+        num_heads,
+        num_kv_heads,
+        mlp_mult,
+        rope_base,
+        qk_gain_init,
+        train_seq_len,
+        layer_idx=0,
+        ln_scale=False,
+        yarn=True,
+    ):
+        super().__init__()
+        self.attn_norm = RMSNorm()
+        self.mlp_norm = RMSNorm()
+        self.attn = CausalSelfAttention(
+            dim, num_heads, num_kv_heads, rope_base, qk_gain_init, train_seq_len, yarn=yarn
+        )
+        self.mlp = MLP(dim, mlp_mult)
+        self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.resid_mix = nn.Parameter(
+            torch.stack((torch.ones(dim), torch.zeros(dim))).float()
+        )
+        self.ln_scale_factor = 1.0 / math.sqrt(layer_idx + 1) if ln_scale else 1.0
+
+    def forward(self, x, x0, q_w, k_w, v_w, out_w, up_w, down_w, cu_seqlens=None, max_seqlen=0):
+        mix = self.resid_mix.to(dtype=x.dtype)
+        x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0
+        attn_out = self.attn(
+            self.attn_norm(x_in) * self.ln_scale_factor,
+            q_w, k_w, v_w, out_w,
+            cu_seqlens=cu_seqlens,
+            max_seqlen=max_seqlen,
+            x_orig=x_in,
+        )
+        x_out = x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out
+        x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[
+            None, None, :
+        ] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w)
+        return x_out
+
+class GPT(nn.Module):
+    def __init__(self, h):
+        super().__init__()
+        if h.logit_softcap <= 0.0:
+            raise ValueError(f"logit_softcap must be positive, got {h.logit_softcap}")
+        self.tie_embeddings = h.tie_embeddings
+        self.tied_embed_init_std = h.tied_embed_init_std
+        self.logit_softcap = h.logit_softcap
+        self.tok_emb = nn.Embedding(h.vocab_size, h.model_dim)
+        self.num_layers = h.num_layers
+        head_dim = h.model_dim // h.num_heads
+        kv_dim = h.num_kv_heads * head_dim
+        hidden_dim = int(h.mlp_mult * h.model_dim)
+        self.qo_bank = nn.Parameter(torch.empty(2 * h.num_layers, h.model_dim, h.model_dim))
+        self.kv_bank = nn.Parameter(torch.empty(2 * h.num_layers, kv_dim, h.model_dim))
+        self.mlp_up_bank = nn.Parameter(torch.empty(h.num_layers, hidden_dim, h.model_dim))
+        self.mlp_down_bank = nn.Parameter(torch.empty(h.num_layers, h.model_dim, hidden_dim))
+        self.num_encoder_layers = h.num_layers // 2
+        self.num_decoder_layers = h.num_layers - self.num_encoder_layers
+        self.blocks = nn.ModuleList(
+            [
+                Block(
+                    h.model_dim,
+                    h.num_heads,
+                    h.num_kv_heads,
+                    h.mlp_mult,
+                    h.rope_base,
+                    h.qk_gain_init,
+                    h.train_seq_len,
+                    layer_idx=i,
+                    ln_scale=h.ln_scale,
+                    yarn=h.rope_yarn,
+                )
+                for i in range(h.num_layers)
+            ]
+        )
+        if h.rope_dims > 0:
+            head_dim = h.model_dim // h.num_heads
+            for block in self.blocks:
+                block.attn.rope_dims = h.rope_dims
+                block.attn.rotary = Rotary(
+                    head_dim,
+                    base=h.rope_base,
+                    train_seq_len=h.train_seq_len,
+                    rope_dims=h.rope_dims,
+                    yarn=h.rope_yarn,
+                )
+        self.final_norm = RMSNorm()
+        self.lm_head = (
+            None
+            if h.tie_embeddings
+            else CastedLinear(h.model_dim, h.vocab_size, bias=False)
+        )
+        if self.lm_head is not None:
+            self.lm_head._zero_init = True
+        if h.xsa_last_n > 0:
+            for i in range(max(0, h.num_layers - h.xsa_last_n), h.num_layers):
+                self.blocks[i].attn.use_xsa = True
+        self.looping_active = False
+        if h.num_loops > 0:
+            loop_seg = list(range(h.loop_start, h.loop_end + 1))
+            all_indices = list(range(h.loop_start))
+            for _ in range(h.num_loops + 1):
+                all_indices.extend(loop_seg)
+            all_indices.extend(range(h.loop_end + 1, h.num_layers))
+            num_enc = len(all_indices) // 2
+            self.encoder_indices = all_indices[:num_enc]
+            self.decoder_indices = all_indices[num_enc:]
+        else:
+            self.encoder_indices = list(range(self.num_encoder_layers))
+            self.decoder_indices = list(range(self.num_encoder_layers, h.num_layers))
+        self.num_skip_weights = min(
+            len(self.encoder_indices), len(self.decoder_indices)
+        )
+        self.skip_weights = nn.Parameter(
+            torch.ones(self.num_skip_weights, h.model_dim, dtype=torch.float32)
+        )
+        self.skip_gates = (
+            nn.Parameter(
+                torch.zeros(self.num_skip_weights, h.model_dim, dtype=torch.float32)
+            )
+            if h.skip_gates_enabled
+            else None
+        )
+        self.parallel_start_layer = h.parallel_start_layer
+        self.parallel_final_lane = h.parallel_final_lane.lower()
+        self.parallel_post_lambdas = nn.Parameter(
+            torch.ones(h.num_layers, 2, 2, dtype=torch.float32)
+        )
+        self.parallel_resid_lambdas = nn.Parameter(
+            torch.full((h.num_layers, 2), 1.1, dtype=torch.float32)
+        )
+        self.smear_gate_enabled = h.smear_gate
+        if h.smear_gate:
+            self.smear_w = nn.Parameter(torch.zeros(12))
+            self.smear_lambda = nn.Parameter(torch.zeros(1))
+        else:
+            self.smear_w = None
+            self.smear_lambda = None
+        if h.attn_out_gate:
+            for block in self.blocks:
+                block.attn.attn_out_gate_w = nn.Parameter(
+                    torch.zeros(h.num_heads, 12, dtype=torch.float32)
+                )
+        self._init_weights()
+
+    def _init_weights(self):
+        if self.tie_embeddings:
+            nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std)
+        n = self.num_layers
+        proj_scale = 1.0 / math.sqrt(2 * n)
+        for i in range(n):
+            nn.init.orthogonal_(self.qo_bank.data[i], gain=1.0)
+            nn.init.zeros_(self.qo_bank.data[n + i])
+            self.qo_bank.data[n + i].mul_(proj_scale)
+            nn.init.orthogonal_(self.kv_bank.data[i], gain=1.0)
+            nn.init.orthogonal_(self.kv_bank.data[n + i], gain=1.0)
+        for i in range(n):
+            nn.init.orthogonal_(self.mlp_up_bank.data[i], gain=1.0)
+            nn.init.zeros_(self.mlp_down_bank.data[i])
+            self.mlp_down_bank.data[i].mul_(proj_scale)
+        for name, module in self.named_modules():
+            if isinstance(module, nn.Linear):
+                if getattr(module, "_zero_init", False):
+                    nn.init.zeros_(module.weight)
+                elif (
+                    module.weight.ndim == 2
+                    and module.weight.shape[0] >= 64
+                    and module.weight.shape[1] >= 64
+                ):
+                    nn.init.orthogonal_(module.weight, gain=1.0)
+
+    def _bank_weights(self, i):
+        n = self.num_layers
+        return (
+            self.qo_bank[i],
+            self.kv_bank[i],
+            self.kv_bank[n + i],
+            self.qo_bank[n + i],
+            self.mlp_up_bank[i],
+            self.mlp_down_bank[i],
+        )
+
+    def _parallel_block(
+        self, block_idx, lane0, lane1, x0,
+        q_w, k_w, v_w, out_w, up_w, down_w,
+        cu_seqlens=None, max_seqlen=0,
+    ):
+        block = self.blocks[block_idx]
+        mix = block.resid_mix.to(dtype=lane0.dtype)
+        attn_read = mix[0][None, None, :] * lane0 + mix[1][None, None, :] * x0
+        attn_out = block.attn(
+            block.attn_norm(attn_read) * block.ln_scale_factor,
+            q_w, k_w, v_w, out_w,
+            cu_seqlens=cu_seqlens, max_seqlen=max_seqlen,
+            x_orig=attn_read,
+        )
+        attn_out = block.attn_scale.to(dtype=attn_out.dtype)[None, None, :] * attn_out
+        mlp_read = lane1
+        mlp_out = block.mlp_scale.to(dtype=lane1.dtype)[None, None, :] * block.mlp(
+            block.mlp_norm(mlp_read) * block.ln_scale_factor, up_w, down_w
+        )
+        attn_resid = self.parallel_resid_lambdas[block_idx, 0].to(dtype=lane0.dtype)
+        attn_post = self.parallel_post_lambdas[block_idx, 0].to(dtype=lane0.dtype)
+        mlp_resid = self.parallel_resid_lambdas[block_idx, 1].to(dtype=lane0.dtype)
+        mlp_post = self.parallel_post_lambdas[block_idx, 1].to(dtype=lane0.dtype)
+        lane0 = attn_resid * lane0 + attn_post[0] * attn_out + mlp_post[0] * mlp_out
+        lane1 = mlp_resid * lane1 + attn_post[1] * attn_out + mlp_post[1] * mlp_out
+        return lane0, lane1
+
+    def _final_parallel_hidden(self, lane0, lane1):
+        if self.parallel_final_lane == "mlp":
+            return lane1
+        if self.parallel_final_lane == "attn":
+            return lane0
+        return 0.5 * (lane0 + lane1)
+
+    def forward_logits(self, input_ids, cu_seqlens=None, max_seqlen=0):
+        x = self.tok_emb(input_ids)
+        x = F.rms_norm(x, (x.size(-1),))
+        if self.smear_gate_enabled:
+            x = _apply_smear_gate_inline(x, self.smear_w, self.smear_lambda)
+        x0 = x
+        skips = []
+        enc_iter = (
+            self.encoder_indices
+            if self.looping_active
+            else range(self.num_encoder_layers)
+        )
+        dec_iter = (
+            self.decoder_indices
+            if self.looping_active
+            else range(
+                self.num_encoder_layers,
+                self.num_encoder_layers + self.num_decoder_layers,
+            )
+        )
+        for i in enc_iter:
+            q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(i)
+            x = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen)
+            skips.append(x)
+        psl = self.parallel_start_layer
+        lane0 = None
+        lane1 = None
+        for skip_idx, i in enumerate(dec_iter):
+            q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(i)
+            if i >= psl and psl > 0:
+                if lane0 is None:
+                    lane0 = x
+                    lane1 = x
+                if skip_idx < self.num_skip_weights and skips:
+                    skip = skips.pop()
+                    w = self.skip_weights[skip_idx].to(dtype=lane0.dtype)[None, None, :]
+                    if self.skip_gates is not None:
+                        g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=lane0.dtype))[None, None, :]
+                        lane0 = torch.lerp(w * skip, lane0, g)
+                    else:
+                        lane0 = lane0 + w * skip
+                lane0, lane1 = self._parallel_block(
+                    i, lane0, lane1, x0, q_w, k_w, v_w, out_w, up_w, down_w,
+                    cu_seqlens=cu_seqlens, max_seqlen=max_seqlen,
+                )
+            else:
+                if skip_idx < self.num_skip_weights and skips:
+                    scaled_skip = (
+                        self.skip_weights[skip_idx].to(dtype=x.dtype)[None, None, :]
+                        * skips.pop()
+                    )
+                    if self.skip_gates is not None:
+                        g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=x.dtype))[None, None, :]
+                        x = torch.lerp(scaled_skip, x, g)
+                    else:
+                        x = x + scaled_skip
+                x = self.blocks[i](x, x0, q_w, k_w, v_w, out_w, up_w, down_w, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen)
+        if lane0 is not None:
+            x = self._final_parallel_hidden(lane0, lane1)
+        x = self.final_norm(x)
+        if self.tie_embeddings:
+            logits_proj = F.linear(x, self.tok_emb.weight)
+        else:
+            logits_proj = self.lm_head(x)
+        return self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap)
+
+    def forward(self, input_ids, target_ids, cu_seqlens=None, max_seqlen=0):
+        logits = self.forward_logits(
+            input_ids, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen
+        )
+        return F.cross_entropy(
+            logits.reshape(-1, logits.size(-1)).float(),
+            target_ids.reshape(-1),
+            reduction="mean",
+        )
+
+    def forward_ttt(self, input_ids, target_ids, lora):
+        x = self.tok_emb(input_ids)
+        x = F.rms_norm(x, (x.size(-1),))
+        if self.smear_gate_enabled:
+            x = _apply_smear_gate_inline(x, self.smear_w, self.smear_lambda)
+        x0 = x
+        skips = []
+        enc_iter = (
+            self.encoder_indices
+            if self.looping_active
+            else list(range(self.num_encoder_layers))
+        )
+        dec_iter = (
+            self.decoder_indices
+            if self.looping_active
+            else list(
+                range(
+                    self.num_encoder_layers,
+                    self.num_encoder_layers + self.num_decoder_layers,
+                )
+            )
+        )
+        slot = 0
+        for i in enc_iter:
+            q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(i)
+            x = self._block_with_lora(self.blocks[i], x, x0, lora, slot, q_w, k_w, v_w, out_w, up_w, down_w)
+            slot += 1
+            skips.append(x)
+        psl = self.parallel_start_layer
+        lane0 = None
+        lane1 = None
+        for skip_idx, i in enumerate(dec_iter):
+            q_w, k_w, v_w, out_w, up_w, down_w = self._bank_weights(i)
+            if i >= psl and psl > 0:
+                if lane0 is None:
+                    lane0 = x
+                    lane1 = x
+                if skip_idx < self.num_skip_weights and skips:
+                    skip = skips.pop()
+                    w = self.skip_weights[skip_idx].to(dtype=lane0.dtype)[None, None, :]
+                    if self.skip_gates is not None:
+                        g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=lane0.dtype))[None, None, :]
+                        lane0 = torch.lerp(w * skip, lane0, g)
+                    else:
+                        lane0 = lane0 + w * skip
+                lane0, lane1 = self._parallel_block_with_lora(
+                    i, lane0, lane1, x0, lora, slot,
+                    q_w, k_w, v_w, out_w, up_w, down_w,
+                )
+            else:
+                if skip_idx < self.num_skip_weights and skips:
+                    scaled_skip = (
+                        self.skip_weights[skip_idx].to(dtype=x.dtype)[None, None, :]
+                        * skips.pop()
+                    )
+                    if self.skip_gates is not None:
+                        g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=x.dtype))[None, None, :]
+                        x = torch.lerp(scaled_skip, x, g)
+                    else:
+                        x = x + scaled_skip
+                x = self._block_with_lora(self.blocks[i], x, x0, lora, slot, q_w, k_w, v_w, out_w, up_w, down_w)
+            slot += 1
+        if lane0 is not None:
+            x = self._final_parallel_hidden(lane0, lane1)
+        x = self.final_norm(x)
+        if self.tie_embeddings:
+            logits = F.linear(x, self.tok_emb.weight)
+        else:
+            logits = self.lm_head(x)
+        logits = logits + lora.lm_head_lora(x)
+        logits = self.logit_softcap * torch.tanh(logits / self.logit_softcap)
+        bsz, sl, V = logits.shape
+        return F.cross_entropy(
+            logits.float().reshape(-1, V), target_ids.reshape(-1), reduction="none"
+        ).reshape(bsz, sl)
+
+    def _block_with_lora(self, block, x, x0, lora, slot, q_w, k_w, v_w, out_w, up_w, down_w):
+        mix = block.resid_mix.to(dtype=x.dtype)
+        x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0
+        n = block.attn_norm(x_in) * block.ln_scale_factor
+        attn = block.attn
+        bsz, seqlen, dim = n.shape
+        q = (F.linear(n, q_w.to(n.dtype)) + lora.q_loras[slot](n)).reshape(
+            bsz, seqlen, attn.num_heads, attn.head_dim
+        )
+        k = F.linear(n, k_w.to(n.dtype))
+        if lora.k_loras is not None:
+            k = k + lora.k_loras[slot](n)
+        k = k.reshape(bsz, seqlen, attn.num_kv_heads, attn.head_dim)
+        v = (F.linear(n, v_w.to(n.dtype)) + lora.v_loras[slot](n)).reshape(
+            bsz, seqlen, attn.num_kv_heads, attn.head_dim
+        )
+        q = F.rms_norm(q, (q.size(-1),))
+        k = F.rms_norm(k, (k.size(-1),))
+        cos, sin = attn.rotary(seqlen, n.device, q.dtype)
+        q = apply_rotary_emb(q, cos, sin, attn.rope_dims)
+        k = apply_rotary_emb(k, cos, sin, attn.rope_dims)
+        q = q * attn.q_gain.to(dtype=q.dtype)[None, None, :, None]
+        y = flash_attn_3_func(q, k, v, causal=True)
+        if attn.use_xsa:
+            y = attn._xsa_efficient(y, v)
+        if attn.attn_out_gate_w is not None:
+            y = _apply_attn_out_gate_inline(y, x_in, attn.attn_out_gate_w)
+        y = y.reshape(bsz, seqlen, dim)
+        attn_out = F.linear(y, out_w.to(n.dtype))
+        if lora.o_loras is not None:
+            attn_out = attn_out + lora.o_loras[slot](n)
+        x_out = x_in + block.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out
+        mlp_n = block.mlp_norm(x_out) * block.ln_scale_factor
+        mlp_out = block.mlp(mlp_n, up_w, down_w)
+        if lora.mlp_loras is not None:
+            mlp_out = mlp_out + lora.mlp_loras[slot](mlp_n)
+        x_out = x_out + block.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * mlp_out
+        return x_out
+
+    def _parallel_block_with_lora(
+        self, block_idx, lane0, lane1, x0, lora, slot,
+        q_w, k_w, v_w, out_w, up_w, down_w,
+    ):
+        block = self.blocks[block_idx]
+        mix = block.resid_mix.to(dtype=lane0.dtype)
+        attn_read = mix[0][None, None, :] * lane0 + mix[1][None, None, :] * x0
+        n = block.attn_norm(attn_read) * block.ln_scale_factor
+        attn = block.attn
+        bsz, seqlen, dim = n.shape
+        q = (F.linear(n, q_w.to(n.dtype)) + lora.q_loras[slot](n)).reshape(
+            bsz, seqlen, attn.num_heads, attn.head_dim
+        )
+        k = F.linear(n, k_w.to(n.dtype))
+        if lora.k_loras is not None:
+            k = k + lora.k_loras[slot](n)
+        k = k.reshape(bsz, seqlen, attn.num_kv_heads, attn.head_dim)
+        v = (F.linear(n, v_w.to(n.dtype)) + lora.v_loras[slot](n)).reshape(
+            bsz, seqlen, attn.num_kv_heads, attn.head_dim
+        )
+        q = F.rms_norm(q, (q.size(-1),))
+        k = F.rms_norm(k, (k.size(-1),))
+        cos, sin = attn.rotary(seqlen, n.device, q.dtype)
+        q = apply_rotary_emb(q, cos, sin, attn.rope_dims)
+        k = apply_rotary_emb(k, cos, sin, attn.rope_dims)
+        q = q * attn.q_gain.to(dtype=q.dtype)[None, None, :, None]
+        y = flash_attn_3_func(q, k, v, causal=True)
+        if attn.use_xsa:
+            y = attn._xsa_efficient(y, v)
+        if attn.attn_out_gate_w is not None:
+            y = _apply_attn_out_gate_inline(y, attn_read, attn.attn_out_gate_w)
+        y = y.reshape(bsz, seqlen, dim)
+        attn_out = F.linear(y, out_w.to(n.dtype))
+        if lora.o_loras is not None:
+            attn_out = attn_out + lora.o_loras[slot](n)
+        attn_out = block.attn_scale.to(dtype=attn_out.dtype)[None, None, :] * attn_out
+        mlp_read = lane1
+        mlp_n = block.mlp_norm(mlp_read) * block.ln_scale_factor
+        mlp_out = block.mlp(mlp_n, up_w, down_w)
+        if lora.mlp_loras is not None:
+            mlp_out = mlp_out + lora.mlp_loras[slot](mlp_n)
+        mlp_out = block.mlp_scale.to(dtype=lane1.dtype)[None, None, :] * mlp_out
+        attn_resid = self.parallel_resid_lambdas[block_idx, 0].to(dtype=lane0.dtype)
+        attn_post = self.parallel_post_lambdas[block_idx, 0].to(dtype=lane0.dtype)
+        mlp_resid = self.parallel_resid_lambdas[block_idx, 1].to(dtype=lane0.dtype)
+        mlp_post = self.parallel_post_lambdas[block_idx, 1].to(dtype=lane0.dtype)
+        lane0 = attn_resid * lane0 + attn_post[0] * attn_out + mlp_post[0] * mlp_out
+        lane1 = mlp_resid * lane1 + attn_post[1] * attn_out + mlp_post[1] * mlp_out
+        return lane0, lane1
+
+
+class BatchedLinearLoRA(nn.Module):
+    def __init__(self, bsz, in_features, out_features, rank):
+        super().__init__()
+        self._bound = 1.0 / math.sqrt(in_features)
+        self.A = nn.Parameter(
+            torch.empty(bsz, rank, in_features).uniform_(-self._bound, self._bound)
+        )
+        self.B = nn.Parameter(torch.zeros(bsz, out_features, rank))
+
+    def reset(self):
+        with torch.no_grad():
+            self.A.uniform_(-self._bound, self._bound)
+            self.B.zero_()
+
+    def forward(self, x):
+        return (x @ self.A.transpose(1, 2)) @ self.B.transpose(1, 2)
+
+
+class BatchedTTTLoRA(nn.Module):
+    def __init__(self, bsz, model, rank, k_lora=True, mlp_lora=True, o_lora=True):
+        super().__init__()
+        self.bsz = bsz
+        dim = model.qo_bank.shape[-1]
+        vocab = model.tok_emb.num_embeddings
+        if getattr(model, "looping_active", False):
+            num_slots = len(model.encoder_indices) + len(model.decoder_indices)
+        else:
+            num_slots = len(model.blocks)
+        kv_dim = model.blocks[0].attn.num_kv_heads * (
+            dim // model.blocks[0].attn.num_heads
+        )
+        embed_dim = model.tok_emb.embedding_dim
+        self.lm_head_lora = BatchedLinearLoRA(bsz, embed_dim, vocab, rank)
+        self.q_loras = nn.ModuleList(
+            [BatchedLinearLoRA(bsz, dim, dim, rank) for _ in range(num_slots)]
+        )
+        self.v_loras = nn.ModuleList(
+            [BatchedLinearLoRA(bsz, dim, kv_dim, rank) for _ in range(num_slots)]
+        )
+        self.k_loras = (
+            nn.ModuleList(
+                [BatchedLinearLoRA(bsz, dim, kv_dim, rank) for _ in range(num_slots)]
+            )
+            if k_lora
+            else None
+        )
+        self.mlp_loras = (
+            nn.ModuleList(
+                [BatchedLinearLoRA(bsz, dim, dim, rank) for _ in range(num_slots)]
+            )
+            if mlp_lora
+            else None
+        )
+        self.o_loras = (
+            nn.ModuleList(
+                [BatchedLinearLoRA(bsz, dim, dim, rank) for _ in range(num_slots)]
+            )
+            if o_lora
+            else None
+        )
+
+    def reset(self):
+        with torch.no_grad():
+            self.lm_head_lora.reset()
+            for loras in [self.q_loras, self.v_loras, self.k_loras,
+                          self.mlp_loras, self.o_loras]:
+                if loras is not None:
+                    for lora in loras:
+                        lora.reset()
+
+
+@torch.compile
+def zeropower_via_newtonschulz5(G, steps=10, eps=1e-07):
+    a, b, c = 3.4445, -4.775, 2.0315
+    was_2d = G.ndim == 2
+    if was_2d:
+        G = G.unsqueeze(0)
+    X = G.bfloat16()
+    transposed = X.size(-2) > X.size(-1)
+    if transposed:
+        X = X.mT
+    X = X / (X.norm(dim=(-2, -1), keepdim=True) + eps)
+    for _ in range(steps):
+        A = X @ X.mT
+        B = b * A + c * (A @ A)
+        X = a * X + B @ X
+    if transposed:
+        X = X.mT
+    if was_2d:
+        X = X.squeeze(0)
+    return X
+
+
+class Muon(torch.optim.Optimizer):
+    def __init__(
+        self,
+        params,
+        lr,
+        momentum,
+        backend_steps,
+        nesterov=True,
+        weight_decay=0.0,
+        row_normalize=False,
+    ):
+        super().__init__(
+            params,
+            dict(
+                lr=lr,
+                momentum=momentum,
+                backend_steps=backend_steps,
+                nesterov=nesterov,
+                weight_decay=weight_decay,
+                row_normalize=row_normalize,
+            ),
+        )
+        self._built = False
+
+    def _build(self):
+        self._distributed = dist.is_available() and dist.is_initialized()
+        self._world_size = dist.get_world_size() if self._distributed else 1
+        self._rank = dist.get_rank() if self._distributed else 0
+        ws = self._world_size
+        self._bank_meta = []
+        for group in self.param_groups:
+            for p in group["params"]:
+                B = p.shape[0]
+                padded_B = ((B + ws - 1) // ws) * ws
+                shard_B = padded_B // ws
+                tail = p.shape[1:]
+                dev = p.device
+                self._bank_meta.append({
+                    "p": p,
+                    "B": B,
+                    "padded_grad": torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16),
+                    "shard": torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16),
+                    "shard_mom": torch.zeros(shard_B, *tail, device=dev, dtype=torch.bfloat16),
+                    "full_update": torch.zeros(padded_B, *tail, device=dev, dtype=torch.bfloat16),
+                    "scale": max(1, p.shape[-2] / p.shape[-1]) ** 0.5,
+                })
+        self._bank_meta.sort(key=lambda m: -m["p"].numel())
+        self._built = True
+
+    def launch_reduce_scatters(self):
+        if not self._built:
+            self._build()
+        if not self._distributed:
+            return
+        self._rs_futures = []
+        for m in self._bank_meta:
+            p = m["p"]
+            if p.grad is None:
+                self._rs_futures.append(None)
+                continue
+            pg = m["padded_grad"]
+            pg[: m["B"]].copy_(p.grad.bfloat16())
+            if pg.shape[0] > m["B"]:
+                pg[m["B"] :].zero_()
+            fut = dist.reduce_scatter_tensor(
+                m["shard"], pg, op=dist.ReduceOp.AVG, async_op=True
+            )
+            self._rs_futures.append(fut)
+
+    @torch.no_grad()
+    def step(self, closure=None):
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+        if not self._built:
+            self._build()
+        for group in self.param_groups:
+            lr = group["lr"]
+            momentum = group["momentum"]
+            backend_steps = group["backend_steps"]
+            nesterov = group["nesterov"]
+            wd = group.get("weight_decay", 0.0)
+            row_normalize = group.get("row_normalize", False)
+            prev_ag_handle = None
+            prev_m = None
+            sharded = self._distributed and hasattr(self, "_rs_futures")
+            for idx, m in enumerate(self._bank_meta):
+                p = m["p"]
+                if p.grad is None:
+                    continue
+                if prev_ag_handle is not None:
+                    prev_ag_handle.wait()
+                    pp = prev_m["p"]
+                    upd = prev_m["full_update"][: prev_m["B"]]
+                    if wd > 0.0:
+                        pp.data.mul_(1.0 - lr * wd)
+                    pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m["scale"])
+                if sharded and self._rs_futures[idx] is not None:
+                    self._rs_futures[idx].wait()
+                    g = m["shard"]
+                    buf = m["shard_mom"]
+                else:
+                    g = p.grad.bfloat16()
+                    state = self.state[p]
+                    if "momentum_buffer" not in state:
+                        state["momentum_buffer"] = torch.zeros_like(g)
+                    buf = state["momentum_buffer"]
+                buf.mul_(momentum).add_(g)
+                if nesterov:
+                    update = g.add(buf, alpha=momentum)
+                else:
+                    update = buf
+                if row_normalize:
+                    rn = update.float().norm(dim=-1, keepdim=True).clamp_min(1e-07)
+                    update = update / rn.to(update.dtype)
+                update = zeropower_via_newtonschulz5(update, steps=backend_steps)
+                if sharded:
+                    prev_ag_handle = dist.all_gather_into_tensor(
+                        m["full_update"], update, async_op=True
+                    )
+                    prev_m = m
+                else:
+                    if wd > 0.0:
+                        p.data.mul_(1.0 - lr * wd)
+                    p.add_(update.to(dtype=p.dtype), alpha=-lr * m["scale"])
+            if prev_ag_handle is not None:
+                prev_ag_handle.wait()
+                pp = prev_m["p"]
+                upd = prev_m["full_update"][: prev_m["B"]]
+                if wd > 0.0:
+                    pp.data.mul_(1.0 - lr * wd)
+                pp.add_(upd.to(dtype=pp.dtype), alpha=-lr * prev_m["scale"])
+            if hasattr(self, "_rs_futures"):
+                del self._rs_futures
+        return loss
+
+
+CONTROL_TENSOR_NAME_PATTERNS = tuple(
+    pattern
+    for pattern in os.environ.get(
+        "CONTROL_TENSOR_NAME_PATTERNS",
+        "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,skip_gates,parallel_post_lambdas,parallel_resid_lambdas,attn_out_gate_w",
+    ).split(",")
+    if pattern
+)
+
+
+PACKED_REPLICATED_GRAD_MAX_NUMEL = 1 << 15
+
+
+class Optimizers:
+    def __init__(self, h, base_model):
+        matrix_params = [
+            base_model.qo_bank,
+            base_model.kv_bank,
+            base_model.mlp_up_bank,
+            base_model.mlp_down_bank,
+        ]
+        block_named_params = list(base_model.blocks.named_parameters())
+        scalar_params = [
+            p
+            for (name, p) in block_named_params
+            if p.ndim < 2
+            or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)
+        ]
+        if base_model.skip_weights.numel() > 0:
+            scalar_params.append(base_model.skip_weights)
+        if base_model.skip_gates is not None and base_model.skip_gates.numel() > 0:
+            scalar_params.append(base_model.skip_gates)
+        if base_model.parallel_post_lambdas is not None:
+            scalar_params.append(base_model.parallel_post_lambdas)
+        if base_model.parallel_resid_lambdas is not None:
+            scalar_params.append(base_model.parallel_resid_lambdas)
+        if base_model.smear_w is not None:
+            scalar_params.append(base_model.smear_w)
+            scalar_params.append(base_model.smear_lambda)
+        token_lr = h.tied_embed_lr if h.tie_embeddings else h.embed_lr
+        tok_params = [
+            {"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}
+        ]
+        self.optimizer_tok = torch.optim.AdamW(
+            tok_params,
+            betas=(h.beta1, h.beta2),
+            eps=h.adam_eps,
+            weight_decay=h.embed_wd,
+            fused=True,
+        )
+        self.optimizer_muon = Muon(
+            matrix_params,
+            lr=h.matrix_lr,
+            momentum=h.muon_momentum,
+            backend_steps=h.muon_backend_steps,
+            weight_decay=h.muon_wd,
+            row_normalize=h.muon_row_normalize,
+        )
+        for group in self.optimizer_muon.param_groups:
+            group["base_lr"] = h.matrix_lr
+        self.optimizer_scalar = torch.optim.AdamW(
+            [{"params": scalar_params, "lr": h.scalar_lr, "base_lr": h.scalar_lr}],
+            betas=(h.beta1, h.beta2),
+            eps=h.adam_eps,
+            weight_decay=h.adam_wd,
+            fused=True,
+        )
+        self.optimizers = [
+            self.optimizer_tok,
+            self.optimizer_muon,
+            self.optimizer_scalar,
+        ]
+        self.replicated_params = list(tok_params[0]["params"])
+        self.replicated_params.extend(scalar_params)
+        self.replicated_large_params = []
+        self.replicated_packed_params = []
+        for p in self.replicated_params:
+            if p.numel() <= PACKED_REPLICATED_GRAD_MAX_NUMEL:
+                self.replicated_packed_params.append(p)
+            else:
+                self.replicated_large_params.append(p)
+
+    def __iter__(self):
+        return iter(self.optimizers)
+
+    def zero_grad_all(self):
+        for opt in self.optimizers:
+            opt.zero_grad(set_to_none=True)
+
+    def _all_reduce_packed_grads(self):
+        grads_by_key = collections.defaultdict(list)
+        for p in self.replicated_packed_params:
+            if p.grad is not None:
+                grads_by_key[(p.grad.device, p.grad.dtype)].append(p.grad)
+        for grads in grads_by_key.values():
+            flat = torch.empty(
+                sum(g.numel() for g in grads),
+                device=grads[0].device,
+                dtype=grads[0].dtype,
+            )
+            offset = 0
+            for g in grads:
+                n = g.numel()
+                flat[offset : offset + n].copy_(g.contiguous().view(-1))
+                offset += n
+            dist.all_reduce(flat, op=dist.ReduceOp.AVG)
+            offset = 0
+            for g in grads:
+                n = g.numel()
+                g.copy_(flat[offset : offset + n].view_as(g))
+                offset += n
+
+    def step(self, distributed=False):
+        self.optimizer_muon.launch_reduce_scatters()
+        if distributed:
+            reduce_handles = [
+                dist.all_reduce(p.grad, op=dist.ReduceOp.AVG, async_op=True)
+                for p in self.replicated_large_params
+                if p.grad is not None
+            ]
+            self._all_reduce_packed_grads()
+            for handle in reduce_handles:
+                handle.wait()
+        self.optimizer_tok.step()
+        self.optimizer_scalar.step()
+        self.optimizer_muon.step()
+        self.zero_grad_all()
+
+
+def restore_fp32_params(model):
+    for module in model.modules():
+        if isinstance(module, CastedLinear):
+            module.float()
+    for name, param in model.named_parameters():
+        if (
+            param.ndim < 2
+            or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)
+        ) and param.dtype != torch.float32:
+            param.data = param.data.float()
+    if hasattr(model, "qo_bank") and model.qo_bank is not None:
+        model.qo_bank.data = model.qo_bank.data.float()
+        model.kv_bank.data = model.kv_bank.data.float()
+    model.mlp_up_bank.data = model.mlp_up_bank.data.float()
+    model.mlp_down_bank.data = model.mlp_down_bank.data.float()
+
+
+def collect_hessians(model, train_loader, h, device, n_calibration_batches=64):
+    hessians = {}
+    hooks = []
+    for i, block in enumerate(model.blocks):
+        block.attn._calib = True
+        block.mlp._calib = True
+        block.mlp.use_fused = False
+
+    def make_attn_hook(layer_idx):
+        def hook_fn(module, inp, out):
+            x = inp[0].detach().float()
+            if x.ndim == 3:
+                x = x.reshape(-1, x.shape[-1])
+            for suffix in ["c_q", "c_k", "c_v"]:
+                name = f"blocks.{layer_idx}.attn.{suffix}.weight"
+                if name not in hessians:
+                    hessians[name] = torch.zeros(
+                        x.shape[1], x.shape[1], dtype=torch.float32, device=device
+                    )
+                hessians[name].addmm_(x.T, x)
+            y = module._last_proj_input
+            if y is not None:
+                y = y.float()
+                if y.ndim == 3:
+                    y = y.reshape(-1, y.shape[-1])
+                name = f"blocks.{layer_idx}.attn.proj.weight"
+                if name not in hessians:
+                    hessians[name] = torch.zeros(
+                        y.shape[1], y.shape[1], dtype=torch.float32, device=device
+                    )
+                hessians[name].addmm_(y.T, y)
+        return hook_fn
+
+    def make_mlp_hook(layer_idx):
+        def hook_fn(module, inp, out):
+            x = inp[0].detach().float()
+            if x.ndim == 3:
+                x = x.reshape(-1, x.shape[-1])
+            name = f"blocks.{layer_idx}.mlp.fc.weight"
+            if name not in hessians:
+                hessians[name] = torch.zeros(
+                    x.shape[1], x.shape[1], dtype=torch.float32, device=device
+                )
+            hessians[name].addmm_(x.T, x)
+            h_act = module._last_down_input
+            if h_act is not None:
+                h_act = h_act.float()
+                if h_act.ndim == 3:
+                    h_act = h_act.reshape(-1, h_act.shape[-1])
+                name = f"blocks.{layer_idx}.mlp.proj.weight"
+                if name not in hessians:
+                    hessians[name] = torch.zeros(
+                        h_act.shape[1], h_act.shape[1], dtype=torch.float32, device=device
+                    )
+                hessians[name].addmm_(h_act.T, h_act)
+        return hook_fn
+
+    for i, block in enumerate(model.blocks):
+        hooks.append(block.attn.register_forward_hook(make_attn_hook(i)))
+        hooks.append(block.mlp.register_forward_hook(make_mlp_hook(i)))
+
+    # Hessian hooks for embedding factorization projection layers
+    def make_linear_input_hook(weight_name):
+        def hook_fn(module, inp, out):
+            x = inp[0].detach().float()
+            if x.ndim == 3:
+                x = x.reshape(-1, x.shape[-1])
+            if weight_name not in hessians:
+                hessians[weight_name] = torch.zeros(
+                    x.shape[1], x.shape[1], dtype=torch.float32, device=device
+                )
+            hessians[weight_name].addmm_(x.T, x)
+        return hook_fn
+
+    if model.tie_embeddings:
+        hook_module = model.final_norm
+
+        def make_output_hook(name):
+            def hook_fn(module, inp, out):
+                x = out.detach().float()
+                if x.ndim == 3:
+                    x = x.reshape(-1, x.shape[-1])
+                if name not in hessians:
+                    hessians[name] = torch.zeros(
+                        x.shape[1], x.shape[1], dtype=torch.float32, device=device
+                    )
+                hessians[name].addmm_(x.T, x)
+            return hook_fn
+
+        hooks.append(
+            hook_module.register_forward_hook(make_output_hook("tok_emb.weight"))
+        )
+    model.eval()
+    with torch.no_grad():
+        for _ in range(n_calibration_batches):
+            x, _ = train_loader.next_batch(h.train_batch_tokens, h.grad_accum_steps)
+            model.forward_logits(x)
+    for hook in hooks:
+        hook.remove()
+    for i, block in enumerate(model.blocks):
+        block.attn._calib = False
+        block.mlp._calib = False
+        block.mlp.use_fused = True
+    for name in hessians:
+        hessians[name] = hessians[name].cpu() / n_calibration_batches
+    return hessians
+
+
+def gptq_quantize_weight(w, H, clip_sigmas=3.0, clip_range=63, block_size=128):
+    W_orig = w.float().clone()
+    rows, cols = W_orig.shape
+    H = H.float().clone()
+    dead = torch.diag(H) == 0
+    H[dead, dead] = 1
+    damp = 0.01 * H.diag().mean()
+    H.diagonal().add_(damp)
+    perm = torch.argsort(H.diag(), descending=True)
+    invperm = torch.argsort(perm)
+    W_perm = W_orig[:, perm].clone()
+    W_perm[:, dead[perm]] = 0
+    H = H[perm][:, perm]
+    Hinv = torch.cholesky_inverse(torch.linalg.cholesky(H))
+    Hinv = torch.linalg.cholesky(Hinv, upper=True)
+    row_std = W_orig.std(dim=1)
+    s = (clip_sigmas * row_std / clip_range).clamp_min(1e-10).to(torch.float16)
+    sf = s.float()
+    Q = torch.zeros(rows, cols, dtype=torch.int8)
+    W_work = W_perm.clone()
+    for i1 in range(0, cols, block_size):
+        i2 = min(i1 + block_size, cols)
+        W_block = W_work[:, i1:i2].clone()
+        Hinv_block = Hinv[i1:i2, i1:i2]
+        Err = torch.zeros(rows, i2 - i1)
+        for j in range(i2 - i1):
+            w_col = W_block[:, j]
+            d = Hinv_block[j, j]
+            q_col = torch.clamp(torch.round(w_col / sf), -clip_range, clip_range)
+            Q[:, i1 + j] = q_col.to(torch.int8)
+            err = (w_col - q_col.float() * sf) / d
+            Err[:, j] = err
+            W_block[:, j:] -= err.unsqueeze(1) * Hinv_block[j, j:].unsqueeze(0)
+        if i2 < cols:
+            W_work[:, i2:] -= Err @ Hinv[i1:i2, i2:]
+    return Q[:, invperm], s
+
+
+def gptq_mixed_quantize(state_dict, hessians, h):
+    result = {}
+    meta = {}
+    for (name, tensor) in state_dict.items():
+        t = tensor.detach().cpu().contiguous()
+        if not t.is_floating_point() or t.numel() <= 65536:
+            result[name] = t.to(torch.float16) if t.is_floating_point() else t
+            meta[name] = "passthrough (float16)"
+            continue
+        if "tok_emb" in name:
+            cs = h.embed_clip_sigmas
+        elif ".mlp." in name:
+            cs = h.mlp_clip_sigmas
+        elif ".attn." in name:
+            cs = h.attn_clip_sigmas
+        else:
+            cs = h.matrix_clip_sigmas
+        bits = h.embed_bits if "tok_emb" in name else h.matrix_bits
+        clip_range = 2 ** (bits - 1) - 1
+        ret = gptq_quantize_weight(
+            t, hessians[name], clip_sigmas=cs, clip_range=clip_range
+        )
+        q, s = ret
+        result[name + ".q"] = q
+        result[name + ".scale"] = s
+        meta[name] = f"gptq (int{bits})"
+    categories = collections.defaultdict(set)
+    for (name, cat) in meta.items():
+        short = re.sub("\\.\\d+$", "", re.sub("blocks\\.\\d+", "blocks", name))
+        categories[cat].add(short)
+    log("Quantized weights:")
+    for cat in sorted(categories):
+        log(f"  {cat}: {', '.join(sorted(categories[cat]))}")
+    return result, meta
+
+
+def dequantize_mixed(result, meta, template_sd):
+    out = {}
+    for (name, orig) in template_sd.items():
+        info = meta.get(name)
+        if info is None:
+            continue
+        orig_dtype = orig.dtype
+        if "passthrough" in info:
+            t = result[name]
+            if t.dtype == torch.float16 and orig_dtype in (
+                torch.float32,
+                torch.bfloat16,
+            ):
+                t = t.to(orig_dtype)
+            out[name] = t
+            continue
+        q, s = result[name + ".q"], result[name + ".scale"]
+        if s.ndim > 0:
+            out[name] = (
+                q.float() * s.float().view(q.shape[0], *[1] * (q.ndim - 1))
+            ).to(orig_dtype)
+        else:
+            out[name] = (q.float() * float(s.item())).to(orig_dtype)
+    return out
+
+
+_BSHF_MAGIC = b"BSHF"
+
+
+def _byte_shuffle(data, stride=2):
+    if stride <= 1 or len(data) < stride:
+        return data
+    src = np.frombuffer(data, dtype=np.uint8)
+    n = len(src)
+    out = np.empty(n, dtype=np.uint8)
+    dest_off = 0
+    for pos in range(stride):
+        chunk = src[pos::stride]
+        out[dest_off : dest_off + len(chunk)] = chunk
+        dest_off += len(chunk)
+    return _BSHF_MAGIC + bytes([stride]) + out.tobytes()
+
+
+def _byte_unshuffle(data):
+    if len(data) < 5 or data[:4] != _BSHF_MAGIC:
+        return data
+    stride = data[4]
+    if stride < 2:
+        return data[5:]
+    payload = np.frombuffer(data, dtype=np.uint8, offset=5)
+    n = len(payload)
+    out = np.empty(n, dtype=np.uint8)
+    src_off = 0
+    for pos in range(stride):
+        chunk_len = n // stride + (1 if pos < n % stride else 0)
+        out[pos::stride][:chunk_len] = payload[src_off : src_off + chunk_len]
+        src_off += chunk_len
+    return out.tobytes()
+
+
+def _compress(data, compressor):
+    data = _byte_shuffle(data)
+    if compressor == "lzma":
+        return lzma.compress(data, preset=6)
+    elif compressor == "brotli":
+        import brotli
+
+        return brotli.compress(data, quality=11)
+    raise ValueError(f"Unknown compressor: {compressor!r}")
+
+
+def _decompress(data, compressor):
+    if compressor == "lzma":
+        raw = lzma.decompress(data)
+    elif compressor == "brotli":
+        import brotli
+
+        raw = brotli.decompress(data)
+    else:
+        raise ValueError(f"Unknown compressor: {compressor!r}")
+    raw = _byte_unshuffle(raw)
+    return raw
+
+
+def _unbank_state_dict(state_dict, num_layers):
+    sd = {}
+    n = num_layers
+    for k, v in state_dict.items():
+        t = v.detach().cpu() if v is not None else None
+        if k == "qo_bank":
+            for i in range(n):
+                sd[f"blocks.{i}.attn.c_q.weight"] = t[i]
+                sd[f"blocks.{i}.attn.proj.weight"] = t[n + i]
+        elif k == "kv_bank":
+            for i in range(n):
+                sd[f"blocks.{i}.attn.c_k.weight"] = t[i]
+                sd[f"blocks.{i}.attn.c_v.weight"] = t[n + i]
+        elif k == "mlp_up_bank":
+            for i in range(n):
+                sd[f"blocks.{i}.mlp.fc.weight"] = t[i]
+        elif k == "mlp_down_bank":
+            for i in range(n):
+                sd[f"blocks.{i}.mlp.proj.weight"] = t[i]
+        else:
+            if t is not None:
+                sd[k] = t
+    return sd
+
+
+def _rebank_state_dict(flat_sd, num_layers, model_dim, kv_dim, hidden_dim):
+    sd = {}
+    n = num_layers
+    sd["qo_bank"] = torch.zeros(2 * n, model_dim, model_dim)
+    sd["kv_bank"] = torch.zeros(2 * n, kv_dim, model_dim)
+    for i in range(n):
+        sd["qo_bank"][i] = flat_sd[f"blocks.{i}.attn.c_q.weight"]
+        sd["qo_bank"][n + i] = flat_sd[f"blocks.{i}.attn.proj.weight"]
+        sd["kv_bank"][i] = flat_sd[f"blocks.{i}.attn.c_k.weight"]
+        sd["kv_bank"][n + i] = flat_sd[f"blocks.{i}.attn.c_v.weight"]
+    sd["mlp_up_bank"] = torch.zeros(n, hidden_dim, model_dim)
+    sd["mlp_down_bank"] = torch.zeros(n, model_dim, hidden_dim)
+    for i in range(n):
+        sd["mlp_up_bank"][i] = flat_sd[f"blocks.{i}.mlp.fc.weight"]
+        sd["mlp_down_bank"][i] = flat_sd[f"blocks.{i}.mlp.proj.weight"]
+    for k, v in flat_sd.items():
+        if not (
+            k.startswith("blocks.")
+            and any(
+                p in k
+                for p in [
+                    ".attn.c_q.", ".attn.c_k.", ".attn.c_v.",
+                    ".attn.proj.", ".mlp.fc.", ".mlp.proj.",
+                ]
+            )
+        ):
+            sd[k] = v
+    return sd
+
+
+
+def _compressed_code_size(code):
+    code_raw = code.encode("utf-8")
+    minified = subprocess.run(
+        ["pyminify", "--no-rename-locals", "--no-hoist-literals", "--remove-literal-statements", "-"],
+        input=code_raw, capture_output=True, check=True,
+    ).stdout
+    compressed = lzma.compress(minified)
+    encoded = base64.b85encode(compressed)
+    wrapper = b'import lzma as L,base64 as B\nexec(L.decompress(B.b85decode("' + encoded + b'")))\n'
+    return len(code_raw), len(wrapper)
+
+
+def serialize(h, base_model, code):
+    code_bytes_uncompressed, code_bytes = _compressed_code_size(code)
+    if h.is_main_process:
+        torch.save(base_model.state_dict(), h.model_path)
+        model_bytes = os.path.getsize(h.model_path)
+        log(f"Serialized model: {model_bytes} bytes")
+        log(f"Code size (uncompressed): {code_bytes_uncompressed} bytes")
+        log(f"Code size (compressed): {code_bytes} bytes")
+    sd_cpu = _unbank_state_dict(base_model.state_dict(), h.num_layers)
+    device = torch.device("cuda", h.local_rank)
+    log("GPTQ:collecting Hessians from calibration data...")
+    t0 = time.perf_counter()
+    calib_loader = ShuffledSequenceLoader(h, device)
+    hessians = collect_hessians(
+        base_model,
+        calib_loader,
+        h,
+        device,
+        n_calibration_batches=h.gptq_calibration_batches,
+    )
+    log(f"GPTQ:collected {len(hessians)} Hessians in {time.perf_counter()-t0:.1f}s")
+    quant_result, quant_meta = gptq_mixed_quantize(sd_cpu, hessians, h)
+    quant_buf = io.BytesIO()
+    torch.save({"w": quant_result, "m": quant_meta}, quant_buf)
+    quant_raw = quant_buf.getvalue()
+    quant_blob = _compress(quant_raw, h.compressor)
+    quant_file_bytes = len(quant_blob)
+    bytes_total = quant_file_bytes + code_bytes
+    if h.is_main_process:
+        with open(h.quantized_model_path, "wb") as f:
+            f.write(quant_blob)
+        log(f"Serialized model quantized+{h.compressor}: {quant_file_bytes} bytes")
+        log(f"Total submission size quantized+{h.compressor}: {bytes_total} bytes")
+    return bytes_total, quant_file_bytes
+
+
+def deserialize(h, device):
+    eval_model = GPT(h).to(device).bfloat16()
+    restore_fp32_params(eval_model)
+    flat_template = _unbank_state_dict(eval_model.state_dict(), h.num_layers)
+    with open(h.quantized_model_path, "rb") as f:
+        quant_blob_disk = f.read()
+    quant_state = torch.load(
+        io.BytesIO(_decompress(quant_blob_disk, h.compressor)), map_location="cpu"
+    )
+    deq_flat = dequantize_mixed(quant_state["w"], quant_state["m"], flat_template)
+    head_dim = h.model_dim // h.num_heads
+    kv_dim = h.num_kv_heads * head_dim
+    hidden_dim = int(h.mlp_mult * h.model_dim)
+    deq_state = _rebank_state_dict(deq_flat, h.num_layers, h.model_dim, kv_dim, hidden_dim)
+    eval_model.load_state_dict(deq_state, strict=True)
+    return eval_model
+
+
+def _loss_bpb(loss_sum, token_count, byte_count):
+    val_loss = (loss_sum / token_count).item()
+    val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item())
+    return val_loss, val_bpb
+
+
+def eval_val(h, device, val_data, model, forward_logits_fn=None):
+    seq_len = h.eval_seq_len
+    local_batch_tokens = h.val_batch_tokens // (h.world_size * h.grad_accum_steps)
+    if local_batch_tokens < seq_len:
+        raise ValueError(
+            f"VAL_BATCH_SIZE must provide at least one sequence per rank; got VAL_BATCH_SIZE={h.val_batch_tokens}, WORLD_SIZE={h.world_size}, GRAD_ACCUM_STEPS={h.grad_accum_steps}, seq_len={seq_len}"
+        )
+    local_batch_seqs = local_batch_tokens // seq_len
+    total_seqs = (val_data.val_tokens.numel() - 1) // seq_len
+    seq_start = total_seqs * h.rank // h.world_size
+    seq_end = total_seqs * (h.rank + 1) // h.world_size
+
+    # TODO: Don't truncate this.
+    seq_end = seq_start + ((seq_end - seq_start) // local_batch_seqs) * local_batch_seqs
+
+    val_loss_sum = torch.zeros((), device=device, dtype=torch.float64)
+    val_token_count = torch.zeros((), device=device, dtype=torch.float64)
+    val_byte_count = torch.zeros((), device=device, dtype=torch.float64)
+    run_forward_logits = (
+        (model.module.forward_logits if hasattr(model, "module") else model.forward_logits)
+        if forward_logits_fn is None
+        else forward_logits_fn
+    )
+    model.eval()
+    global BOS_ID
+    if BOS_ID is None:
+        BOS_ID = 1
+    with torch.no_grad():
+        for batch_seq_start in range(seq_start, seq_end, local_batch_seqs):
+            batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end)
+            raw_start = batch_seq_start * seq_len
+            raw_end = batch_seq_end * seq_len + 1
+            local = val_data.val_tokens[raw_start:raw_end].to(
+                device=device, dtype=torch.int64, non_blocking=True
+            )
+            x = local[:-1]
+            y = local[1:]
+            bos_pos = (x == BOS_ID).nonzero(as_tuple=True)[0].tolist()
+            cu_seqlens, max_seqlen = _build_cu_seqlens(
+                bos_pos, x.numel(), x.device, h.eval_seq_len, 64
+            )
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                logits = run_forward_logits(
+                    x[None], cu_seqlens=cu_seqlens, max_seqlen=max_seqlen
+                ).detach()
+            per_token_loss = F.cross_entropy(
+                logits.reshape(-1, logits.size(-1)).float(),
+                y.reshape(-1),
+                reduction="none",
+            )
+            val_loss_sum += per_token_loss.to(torch.float64).sum()
+            val_token_count += float(y.numel())
+            prev_ids = x
+            tgt_ids = y
+            token_bytes = val_data.base_bytes_lut[tgt_ids].to(dtype=torch.int16)
+            token_bytes += (
+                val_data.has_leading_space_lut[tgt_ids]
+                & ~val_data.is_boundary_token_lut[prev_ids]
+            ).to(dtype=torch.int16)
+            val_byte_count += token_bytes.to(torch.float64).sum()
+    if dist.is_available() and dist.is_initialized():
+        dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM)
+        dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM)
+        dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM)
+    model.train()
+    return _loss_bpb(val_loss_sum, val_token_count, val_byte_count)
+
+
+def eval_val_sliding(h, device, val_data, base_model, forward_logits_fn=None, batch_seqs=32):
+    global BOS_ID
+    if BOS_ID is None:
+        BOS_ID = 1
+    base_model.eval()
+    run_forward_logits = base_model.forward_logits if forward_logits_fn is None else forward_logits_fn
+    seq_len = h.eval_seq_len
+    stride = h.eval_stride
+    total_tokens = val_data.val_tokens.numel() - 1
+    context_size = seq_len - stride
+    window_starts = [ws for ws in range(0, total_tokens, stride)
+                     if ws + context_size < total_tokens]
+    total_windows = len(window_starts)
+    my_s = (total_windows * h.rank) // h.world_size
+    my_e = (total_windows * (h.rank + 1)) // h.world_size
+    my_windows = window_starts[my_s:my_e]
+    loss_sum = torch.zeros((), device=device, dtype=torch.float64)
+    token_count = torch.zeros((), device=device, dtype=torch.float64)
+    byte_count = torch.zeros((), device=device, dtype=torch.float64)
+    total_batches = (len(my_windows) + batch_seqs - 1) // batch_seqs
+    is_master = h.rank == 0
+    cu_bucket = 64
+    t_sw_start = time.perf_counter()
+    with torch.no_grad():
+        for bi in range(0, len(my_windows), batch_seqs):
+            batch_idx = bi // batch_seqs
+            if is_master and (batch_idx % 50 == 0 or batch_idx == total_batches - 1):
+                elapsed = time.perf_counter() - t_sw_start
+                rl = float(loss_sum.item() / token_count.item()) if token_count.item() > 0 else 0.0
+                rb = float((rl / math.log(2.0)) * token_count.item() / byte_count.item()) if byte_count.item() > 0 else 0.0
+                log(f"sliding_progress: batch {batch_idx+1}/{total_batches} "
+                    f"tokens:{int(token_count.item())} running_loss:{rl:.4f} running_bpb:{rb:.4f} "
+                    f"elapsed:{elapsed:.1f}s")
+            batch_ws = my_windows[bi:bi + batch_seqs]
+            x_parts = []
+            y_parts = []
+            cu_starts = []
+            score_ranges = []
+            offset = 0
+            for ws in batch_ws:
+                end = min(ws + seq_len, total_tokens)
+                wlen = end - ws
+                chunk_cpu = val_data.val_tokens[ws:end + 1]
+                bos_pos = (chunk_cpu[:-1] == BOS_ID).nonzero(as_tuple=True)[0].tolist()
+                if not bos_pos or bos_pos[0] != 0:
+                    bos_pos = [0] + bos_pos
+                cu_starts.extend(offset + pos for pos in bos_pos)
+                chunk = chunk_cpu.to(dtype=torch.int64, device=device)
+                x_parts.append(chunk[:-1])
+                y_parts.append(chunk[1:])
+                score_ranges.append((offset, wlen, ws))
+                offset += wlen
+            x_cat = torch.cat(x_parts, dim=0)[None]
+            y_cat = torch.cat(y_parts, dim=0)
+            boundaries = cu_starts + [offset]
+            padded_len = get_next_multiple_of_n(len(boundaries), cu_bucket)
+            cu_seqlens = torch.full((padded_len,), offset, dtype=torch.int32, device=device)
+            cu_seqlens[:len(boundaries)] = torch.tensor(boundaries, dtype=torch.int32, device=device)
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+                logits = run_forward_logits(x_cat, cu_seqlens=cu_seqlens, max_seqlen=seq_len)
+            flat_nll = F.cross_entropy(
+                logits.reshape(-1, logits.size(-1)).float(),
+                y_cat,
+                reduction="none",
+            )
+            flat_x = x_cat.reshape(-1)
+            for off, wlen, ws in score_ranges:
+                s = 0 if ws == 0 else context_size
+                lo = off + s
+                hi = off + wlen
+                scored_nll = flat_nll[lo:hi].to(torch.float64)
+                loss_sum += scored_nll.sum()
+                token_count += float(hi - lo)
+                tgt = y_cat[lo:hi]
+                prev = flat_x[lo:hi]
+                tb = val_data.base_bytes_lut[tgt].to(torch.float64)
+                tb += (val_data.has_leading_space_lut[tgt] & ~val_data.is_boundary_token_lut[prev]).to(torch.float64)
+                byte_count += tb.sum()
+    if dist.is_available() and dist.is_initialized():
+        dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM)
+        dist.all_reduce(token_count, op=dist.ReduceOp.SUM)
+        dist.all_reduce(byte_count, op=dist.ReduceOp.SUM)
+    base_model.train()
+    return _loss_bpb(loss_sum, token_count, byte_count)
+
+
+def _find_docs(all_tokens):
+    bos_positions = (all_tokens == BOS_ID).nonzero(as_tuple=True)[0].numpy()
+    docs = []
+    if len(bos_positions) == 0:
+        # Fallback for tokenizers without BOS tokens (e.g. casefold):
+        # split into synthetic documents of ~2048 tokens each
+        synth_doc_len = 2048
+        total = all_tokens.numel()
+        for start in range(0, total - 1, synth_doc_len):
+            doc_len = min(synth_doc_len, total - start)
+            if doc_len >= 2:
+                docs.append((start, doc_len))
+        return docs
+    for i in range(len(bos_positions)):
+        start = int(bos_positions[i])
+        end = (
+            int(bos_positions[i + 1])
+            if i + 1 < len(bos_positions)
+            else all_tokens.numel()
+        )
+        if i + 1 < len(bos_positions):
+            end += 1
+        assert end - start >= 2
+        docs.append((start, end - start))
+    return docs
+
+
+def _build_ttt_global_batches(doc_entries, h, ascending=False):
+    batch_size = h.ttt_batch_size
+    global_doc_entries = sorted(doc_entries, key=lambda x: x[1][1])
+    global_batches = [
+        global_doc_entries[i : i + batch_size]
+        for i in range(0, len(global_doc_entries), batch_size)
+    ]
+    indexed = list(enumerate(global_batches))
+    if not ascending:
+        indexed.sort(key=lambda ib: -max(dl for _, (_, dl) in ib[1]))
+    return indexed
+
+
+def _init_batch_counter(path):
+    with open(path, "wb") as f:
+        f.write((0).to_bytes(4, "little"))
+
+
+def _claim_next_batch(counter_path, queue_len):
+    try:
+        with open(counter_path, "r+b") as f:
+            fcntl.flock(f, fcntl.LOCK_EX)
+            idx = int.from_bytes(f.read(4), "little")
+            f.seek(0)
+            f.write((idx + 1).to_bytes(4, "little"))
+            f.flush()
+    except FileNotFoundError:
+        return queue_len
+    return idx
+
+
+def _compute_chunk_window(ci, pred_len, num_chunks, chunk_size, eval_seq_len):
+    chunk_end = pred_len if ci == num_chunks - 1 else (ci + 1) * chunk_size
+    win_start = max(0, chunk_end - eval_seq_len)
+    win_len = chunk_end - win_start
+    chunk_start = ci * chunk_size
+    chunk_offset = chunk_start - win_start
+    chunk_len = chunk_end - chunk_start
+    return win_start, win_len, chunk_offset, chunk_len
+
+
+def _accumulate_bpb(
+    ptl,
+    x,
+    y,
+    chunk_offsets,
+    chunk_lens,
+    pos_idx,
+    base_bytes_lut,
+    has_leading_space_lut,
+    is_boundary_token_lut,
+    loss_sum,
+    byte_sum,
+    token_count,
+):
+    pos = pos_idx[: x.size(1)].unsqueeze(0)
+    mask = (
+        (chunk_lens.unsqueeze(1) > 0)
+        & (pos >= chunk_offsets.unsqueeze(1))
+        & (pos < (chunk_offsets + chunk_lens).unsqueeze(1))
+    )
+    mask_f64 = mask.to(torch.float64)
+    tok_bytes = base_bytes_lut[y].to(torch.float64)
+    tok_bytes += (has_leading_space_lut[y] & ~is_boundary_token_lut[x]).to(
+        torch.float64
+    )
+    loss_sum += (ptl.to(torch.float64) * mask_f64).sum()
+    byte_sum += (tok_bytes * mask_f64).sum()
+    token_count += chunk_lens.to(torch.float64).sum()
+
+
+def _loss_bpb_from_sums(loss_sum, token_count, byte_sum):
+    val_loss = (loss_sum / token_count).item()
+    val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_sum.item())
+    return val_loss, val_bpb
+
+
+def _split_doc_entries_for_phased(doc_entries, prefix_docs):
+    prefix_docs = max(0, min(len(doc_entries), int(prefix_docs)))
+    return doc_entries[:prefix_docs], doc_entries[prefix_docs:]
+
+
+def _add_to_counter(path, delta):
+    try:
+        with open(path, "r+b") as f:
+            fcntl.flock(f, fcntl.LOCK_EX)
+            cur = int.from_bytes(f.read(8), "little", signed=True)
+            cur += int(delta)
+            f.seek(0)
+            f.write(int(cur).to_bytes(8, "little", signed=True))
+            f.flush()
+            return cur
+    except FileNotFoundError:
+        return int(delta)
+
+
+def _init_int64_counter(path):
+    with open(path, "wb") as f:
+        f.write((0).to_bytes(8, "little", signed=True))
+
+
+def _select_ttt_doc_entries(docs, h):
+    doc_entries = list(enumerate(docs))
+    if h.val_doc_fraction < 1.0:
+        sample_n = max(1, int(round(len(docs) * h.val_doc_fraction)))
+        sampled_indices = sorted(
+            random.Random(h.seed).sample(range(len(docs)), sample_n)
+        )
+        return [(i, docs[i]) for i in sampled_indices]
+    return doc_entries
+
+
+def train_val_ttt_global_sgd_distributed(h, device, val_data, base_model, val_tokens, batch_seqs=None):
+    global BOS_ID
+    if BOS_ID is None:
+        BOS_ID = 1
+    base_model.eval()
+    seq_len = h.eval_seq_len
+    total_tokens = val_tokens.numel() - 1
+    ttt_chunk = h.global_ttt_chunk_tokens
+    batch_seqs = h.global_ttt_batch_seqs if batch_seqs is None else batch_seqs
+    num_chunks = (total_tokens + ttt_chunk - 1) // ttt_chunk
+    ttt_params = [p for p in base_model.parameters()]
+    for p in ttt_params:
+        p.requires_grad_(True)
+    optimizer = torch.optim.SGD(
+        ttt_params, lr=h.global_ttt_lr, momentum=h.global_ttt_momentum
+    )
+    t_start = time.perf_counter()
+    for ci in range(num_chunks):
+        chunk_start = ci * ttt_chunk
+        chunk_end = min((ci + 1) * ttt_chunk, total_tokens)
+        is_last_chunk = ci == num_chunks - 1
+        if is_last_chunk or h.global_ttt_epochs <= 0:
+            continue
+        base_model.train()
+        chunk_seqs = (chunk_end - chunk_start) // seq_len
+        if chunk_seqs <= 0:
+            continue
+        warmup_chunks = max(0, min(h.global_ttt_warmup_chunks, num_chunks - 1))
+        if warmup_chunks > 0 and ci < warmup_chunks:
+            warmup_denom = max(warmup_chunks - 1, 1)
+            warmup_t = ci / warmup_denom
+            lr_now = (
+                h.global_ttt_warmup_start_lr
+                + (h.global_ttt_lr - h.global_ttt_warmup_start_lr) * warmup_t
+            )
+        else:
+            decay_steps = max(num_chunks - 1 - warmup_chunks, 1)
+            decay_ci = max(ci - warmup_chunks, 0)
+            lr_now = h.global_ttt_lr * 0.5 * (
+                1.0 + math.cos(math.pi * decay_ci / decay_steps)
+            )
+        for pg in optimizer.param_groups:
+            pg["lr"] = lr_now
+        my_seq_s = chunk_seqs * h.rank // h.world_size
+        my_seq_e = chunk_seqs * (h.rank + 1) // h.world_size
+        my_chunk_seqs = my_seq_e - my_seq_s
+        for _ in range(h.global_ttt_epochs):
+            for bs in range(0, my_chunk_seqs, batch_seqs):
+                be = min(bs + batch_seqs, my_chunk_seqs)
+                actual_bs = my_seq_s + bs
+                start_tok = chunk_start + actual_bs * seq_len
+                end_tok = chunk_start + (my_seq_s + be) * seq_len + 1
+                if end_tok > val_tokens.numel():
+                    continue
+                local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64)
+                x_flat = local[:-1]
+                y_flat = local[1:]
+                optimizer.zero_grad(set_to_none=True)
+                with torch.enable_grad():
+                    with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+                        if h.global_ttt_respect_doc_boundaries:
+                            bos_pos = (x_flat == BOS_ID).nonzero(as_tuple=True)[0].tolist()
+                            cu_seqlens, max_seqlen = _build_cu_seqlens(
+                                bos_pos, x_flat.numel(), x_flat.device, h.eval_seq_len, 64
+                            )
+                            loss = base_model(
+                                x_flat[None],
+                                y_flat[None],
+                                cu_seqlens=cu_seqlens,
+                                max_seqlen=max_seqlen,
+                            )
+                        else:
+                            x = x_flat.reshape(-1, seq_len)
+                            y = y_flat.reshape(-1, seq_len)
+                            loss = base_model(x, y)
+                loss.backward()
+                if dist.is_available() and dist.is_initialized():
+                    for p in ttt_params:
+                        if p.grad is not None:
+                            dist.all_reduce(p.grad, op=dist.ReduceOp.SUM)
+                            p.grad.mul_(1.0 / h.world_size)
+                if h.global_ttt_grad_clip > 0:
+                    torch.nn.utils.clip_grad_norm_(ttt_params, h.global_ttt_grad_clip)
+                optimizer.step()
+        base_model.eval()
+        if h.rank == 0:
+            elapsed = time.perf_counter() - t_start
+            log(
+                f"tttg: c{ci+1}/{num_chunks} lr:{lr_now:.6f} t:{elapsed:.1f}s"
+            )
+    for p in base_model.parameters():
+        p.requires_grad_(True)
+    base_model.eval()
+
+
+def eval_val_ttt_phased(h, base_model, device, val_data, forward_ttt_train):
+    global BOS_ID
+    if BOS_ID is None:
+        BOS_ID = 1
+    base_model.eval()
+    for p in base_model.parameters():
+        p.requires_grad_(False)
+    all_tokens = val_data.val_tokens
+    all_tokens_idx = all_tokens.to(torch.int32)
+    docs = _find_docs(all_tokens)
+    doc_entries = _select_ttt_doc_entries(docs, h)
+    prefix_doc_limit = max(0, min(len(doc_entries), int(h.phased_ttt_prefix_docs)))
+    num_phases = max(1, int(h.phased_ttt_num_phases))
+    phase_boundaries = []
+    for pi in range(num_phases):
+        boundary = prefix_doc_limit * (pi + 1) // num_phases
+        phase_boundaries.append(boundary)
+    current_phase = 0
+    current_phase_boundary = phase_boundaries[0]
+    log(
+        "ttt_phased:"
+        f" total_docs:{len(doc_entries)} prefix_docs:{prefix_doc_limit} "
+        f"suffix_docs:{len(doc_entries) - prefix_doc_limit}"
+        f" num_phases:{num_phases} boundaries:{phase_boundaries}"
+    )
+    chunk_size, eval_seq_len = h.ttt_chunk_size, h.ttt_eval_seq_len
+    eval_batch_set = None
+    if h.ttt_eval_batches:
+        eval_batch_set = set(int(x) for x in h.ttt_eval_batches.split(",") if x.strip())
+    use_ascending = eval_batch_set is not None
+    global_batches_sorted = _build_ttt_global_batches(
+        doc_entries, h, ascending=use_ascending
+    )
+    queue_len = len(global_batches_sorted)
+    counter_path = f"/tmp/ttt_counter_{h.run_id}"
+    prefix_counter_path = f"/tmp/ttt_prefix_counter_{h.run_id}"
+    pause_flag_path = f"/tmp/ttt_pause_flag_{h.run_id}"
+    if h.rank == 0:
+        _init_batch_counter(counter_path)
+        _init_int64_counter(prefix_counter_path)
+        try:
+            os.remove(pause_flag_path)
+        except FileNotFoundError:
+            pass
+    if dist.is_available() and dist.is_initialized():
+        path_list = [counter_path, prefix_counter_path, pause_flag_path]
+        dist.broadcast_object_list(path_list, src=0)
+        counter_path, prefix_counter_path, pause_flag_path = path_list
+        dist.barrier()
+    loss_sum = torch.zeros((), device=device, dtype=torch.float64)
+    byte_sum = torch.zeros((), device=device, dtype=torch.float64)
+    token_count = torch.zeros((), device=device, dtype=torch.float64)
+    t_start = time.perf_counter()
+    reusable_lora = BatchedTTTLoRA(
+        h.ttt_batch_size, base_model, h.ttt_lora_rank,
+        k_lora=h.ttt_k_lora, mlp_lora=h.ttt_mlp_lora, o_lora=h.ttt_o_lora,
+    ).to(device)
+
+    def _build_opt(lora):
+        if h.ttt_optimizer == "sgd":
+            return torch.optim.SGD(
+                lora.parameters(), lr=h.ttt_lora_lr,
+                momentum=h.ttt_beta1, weight_decay=h.ttt_weight_decay,
+            )
+        return torch.optim.AdamW(
+            lora.parameters(), lr=h.ttt_lora_lr,
+            betas=(h.ttt_beta1, h.ttt_beta2),
+            eps=1e-10, weight_decay=h.ttt_weight_decay, fused=True,
+        )
+
+    reusable_opt = _build_opt(reusable_lora)
+    local_scored_docs = []
+    global_ttt_done = prefix_doc_limit == 0
+    try:
+      while True:
+        queue_idx = _claim_next_batch(counter_path, queue_len)
+        if queue_idx >= queue_len:
+            break
+        orig_batch_idx, batch_entries = global_batches_sorted[queue_idx]
+        batch = [doc for _, doc in batch_entries]
+        bsz = len(batch)
+        prev_loss = loss_sum.item()
+        prev_bytes = byte_sum.item()
+        prev_tokens = token_count.item()
+        if bsz == reusable_lora.bsz:
+            reusable_lora.reset()
+            for s in reusable_opt.state.values():
+                for k, v in s.items():
+                    if isinstance(v, torch.Tensor):
+                        v.zero_()
+                    elif k == "step":
+                        s[k] = 0
+            cur_lora = reusable_lora
+            cur_opt = reusable_opt
+        else:
+            cur_lora = BatchedTTTLoRA(
+                bsz, base_model, h.ttt_lora_rank,
+                k_lora=h.ttt_k_lora, mlp_lora=h.ttt_mlp_lora, o_lora=h.ttt_o_lora,
+            ).to(device)
+            cur_opt = _build_opt(cur_lora)
+        pred_lens = [doc_len - 1 for _, doc_len in batch]
+        num_chunks = [(pl + chunk_size - 1) // chunk_size for pl in pred_lens]
+        max_nc = max(num_chunks)
+        num_chunks_t = torch.tensor(num_chunks, dtype=torch.int64, device=device)
+        for ci in range(max_nc):
+            active = [ci < nc for nc in num_chunks]
+            needs_train = any(ci < nc - 1 for nc in num_chunks)
+            tok_starts = torch.zeros(bsz, dtype=torch.int64)
+            tok_wls = torch.zeros(bsz, dtype=torch.int64)
+            chunk_offsets_cpu = torch.zeros(bsz, dtype=torch.int64)
+            chunk_lens_cpu = torch.zeros(bsz, dtype=torch.int64)
+            for b in range(bsz):
+                if not active[b]:
+                    continue
+                doc_start, doc_len = batch[b]
+                win_start, win_len, chunk_offset, chunk_len = _compute_chunk_window(
+                    ci, pred_lens[b], num_chunks[b], chunk_size, eval_seq_len
+                )
+                tok_starts[b] = doc_start + win_start
+                tok_wls[b] = win_len
+                chunk_offsets_cpu[b] = chunk_offset
+                chunk_lens_cpu[b] = chunk_len
+            _, context_size, chunk_offset, _ = _compute_chunk_window(
+                ci, (ci + 1) * chunk_size, ci + 1, chunk_size, eval_seq_len
+            )
+            col_idx = torch.arange(context_size + 1)
+            idx = tok_starts.unsqueeze(1) + col_idx.unsqueeze(0)
+            idx.clamp_(max=all_tokens.numel() - 1)
+            gathered_gpu = all_tokens_idx[idx].to(
+                device=device, dtype=torch.int64, non_blocking=True
+            )
+            valid = (col_idx[:context_size].unsqueeze(0) < tok_wls.unsqueeze(1)).to(
+                device, non_blocking=True
+            )
+            chunk_offsets = chunk_offsets_cpu.to(device, non_blocking=True)
+            chunk_lens = chunk_lens_cpu.to(device, non_blocking=True)
+            x = torch.where(valid, gathered_gpu[:, :context_size], 0)
+            y = torch.where(valid, gathered_gpu[:, 1 : context_size + 1], 0)
+            ctx_pos = torch.arange(context_size, device=device, dtype=torch.int64)
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+                per_tok_loss = forward_ttt_train(x, y, lora=cur_lora)
+            with torch.no_grad():
+                _accumulate_bpb(
+                    per_tok_loss,
+                    x,
+                    y,
+                    chunk_offsets,
+                    chunk_lens,
+                    ctx_pos,
+                    val_data.base_bytes_lut,
+                    val_data.has_leading_space_lut,
+                    val_data.is_boundary_token_lut,
+                    loss_sum,
+                    byte_sum,
+                    token_count,
+                )
+            if needs_train:
+                activate_chunk_mask = (num_chunks_t - 1 > ci).float()
+                for gi in range(h.ttt_grad_steps):
+                    if gi > 0:
+                        with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+                            per_tok_loss = forward_ttt_train(x, y, lora=cur_lora)
+                    per_doc = per_tok_loss[
+                        :, chunk_offset : chunk_offset + chunk_size
+                    ].mean(dim=-1)
+                    cur_opt.zero_grad(set_to_none=True)
+                    (per_doc * activate_chunk_mask).sum().backward()
+                    cur_opt.step()
+            else:
+                del per_tok_loss
+        batch_num = orig_batch_idx + 1
+        doc_lens = [dl for _, dl in batch]
+        should_report = batch_num in eval_batch_set if eval_batch_set is not None else True
+        if should_report:
+            cur_tokens = token_count.item()
+            cur_loss_val = loss_sum.item()
+            cur_bytes_val = byte_sum.item()
+            dt = cur_tokens - prev_tokens
+            db = cur_bytes_val - prev_bytes
+            if dt > 0 and db > 0:
+                b_loss = (cur_loss_val - prev_loss) / dt
+                b_bpb = b_loss / math.log(2.0) * (dt / db)
+            else:
+                b_loss = b_bpb = 0.0
+            r_loss = cur_loss_val / max(cur_tokens, 1)
+            r_bpb = r_loss / math.log(2.0) * (cur_tokens / max(cur_bytes_val, 1))
+            elapsed = time.perf_counter() - t_start
+            log(
+                f"ttp: b{batch_num}/{queue_len} bl:{b_loss:.4f} bb:{b_bpb:.4f} "
+                f"rl:{r_loss:.4f} rb:{r_bpb:.4f} dl:{min(doc_lens)}-{max(doc_lens)} "
+                f"gd:{int(global_ttt_done)}"
+            )
+        if not global_ttt_done:
+            local_scored_docs.extend(
+                (orig_batch_idx, pos, doc_start, doc_len)
+                for pos, (doc_start, doc_len) in enumerate(batch)
+            )
+            prefix_done = _add_to_counter(prefix_counter_path, len(batch_entries))
+            if prefix_done >= current_phase_boundary:
+                try:
+                    with open(pause_flag_path, "x"):
+                        pass
+                except FileExistsError:
+                    pass
+            should_pause = os.path.exists(pause_flag_path)
+            if should_pause:
+                if dist.is_available() and dist.is_initialized():
+                    dist.barrier()
+                gathered_scored_docs = [None] * h.world_size
+                if dist.is_available() and dist.is_initialized():
+                    dist.all_gather_object(gathered_scored_docs, local_scored_docs)
+                else:
+                    gathered_scored_docs = [local_scored_docs]
+                scored_docs_for_global = []
+                for rank_docs in gathered_scored_docs:
+                    if rank_docs:
+                        scored_docs_for_global.extend(rank_docs)
+                scored_docs_for_global.sort(key=lambda x: (x[0], x[1]))
+                scored_docs_for_global = scored_docs_for_global[:current_phase_boundary]
+                scored_token_chunks = [
+                    val_data.val_tokens[doc_start : doc_start + doc_len]
+                    for _, _, doc_start, doc_len in scored_docs_for_global
+                ]
+                if scored_token_chunks:
+                    global_ttt_tokens = torch.cat(scored_token_chunks)
+                else:
+                    global_ttt_tokens = val_data.val_tokens[:0]
+                if h.rank == 0:
+                    prefix_done = 0
+                    try:
+                        with open(prefix_counter_path, "rb") as f:
+                            prefix_done = int.from_bytes(
+                                f.read(8), "little", signed=True
+                            )
+                    except FileNotFoundError:
+                        pass
+                    log(
+                        f"ttpp: phase:{current_phase + 1}/{num_phases} pd:{prefix_done} "
+                        f"gd:{len(scored_docs_for_global)} "
+                        f"t:{time.perf_counter() - t_start:.1f}s"
+                    )
+                train_val_ttt_global_sgd_distributed(
+                    h, device, val_data, base_model, global_ttt_tokens
+                )
+                for p in base_model.parameters():
+                    p.requires_grad_(False)
+                reusable_lora = BatchedTTTLoRA(
+                    h.ttt_batch_size, base_model, h.ttt_lora_rank,
+                    k_lora=h.ttt_k_lora, mlp_lora=h.ttt_mlp_lora, o_lora=h.ttt_o_lora,
+                ).to(device)
+                reusable_opt = _build_opt(reusable_lora)
+                current_phase += 1
+                if current_phase >= num_phases:
+                    global_ttt_done = True
+                else:
+                    current_phase_boundary = phase_boundaries[current_phase]
+                    if h.rank == 0:
+                        try:
+                            os.remove(pause_flag_path)
+                        except FileNotFoundError:
+                            pass
+                if dist.is_available() and dist.is_initialized():
+                    dist.barrier()
+                if h.rank == 0:
+                    log(f"ttpr: phase:{current_phase}/{num_phases} t:{time.perf_counter() - t_start:.1f}s")
+        del cur_lora, cur_opt
+    finally:
+        pass
+    if dist.is_available() and dist.is_initialized():
+        dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM)
+        dist.all_reduce(byte_sum, op=dist.ReduceOp.SUM)
+        dist.all_reduce(token_count, op=dist.ReduceOp.SUM)
+    for p in base_model.parameters():
+        p.requires_grad_(True)
+    base_model.train()
+    return _loss_bpb_from_sums(loss_sum, token_count, byte_sum)
+
+
+def timed_eval(label, fn, *args, **kwargs):
+    torch.cuda.synchronize()
+    t0 = time.perf_counter()
+    val_loss, val_bpb = fn(*args, **kwargs)
+    torch.cuda.synchronize()
+    elapsed_ms = 1e3 * (time.perf_counter() - t0)
+    log(
+        f"{label} val_loss:{val_loss:.8f} val_bpb:{val_bpb:.8f} eval_time:{elapsed_ms:.0f}ms"
+    )
+    return val_loss, val_bpb
+
+
+def train_model(h, device, val_data):
+    base_model = GPT(h).to(device).bfloat16()
+    restore_fp32_params(base_model)
+    compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True)
+    compiled_forward_logits = torch.compile(
+        base_model.forward_logits, dynamic=False, fullgraph=True
+    )
+    model = compiled_model
+    log(f"model_params:{sum(p.numel()for p in base_model.parameters())}")
+    optimizers = Optimizers(h, base_model)
+    train_loader = DocumentPackingLoader(h, device)
+    max_wallclock_ms = (
+        1e3 * h.max_wallclock_seconds if h.max_wallclock_seconds > 0 else None
+    )
+    if max_wallclock_ms is not None:
+        max_wallclock_ms -= h.gptq_reserve_seconds * 1e3
+        log(
+            f"gptq:reserving {h.gptq_reserve_seconds:.0f}s, effective={max_wallclock_ms:.0f}ms"
+        )
+
+    def training_frac(step, elapsed_ms):
+        if max_wallclock_ms is None:
+            return step / max(h.iterations, 1)
+        return elapsed_ms / max(max_wallclock_ms, 1e-09)
+
+    def lr_mul(frac):
+        if h.warmdown_frac <= 0:
+            return 1.0
+        if frac >= 1.0 - h.warmdown_frac:
+            return max((1.0 - frac) / h.warmdown_frac, h.min_lr)
+        return 1.0
+
+    def step_fn(step, lr_scale):
+        optimizers.zero_grad_all()
+        train_loss = torch.zeros((), device=device)
+        for micro_step in range(h.grad_accum_steps):
+            x, y, cu_seqlens, _max_seqlen = train_loader.next_batch(
+                h.train_batch_tokens, h.grad_accum_steps
+            )
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                loss = model(x, y, cu_seqlens=cu_seqlens, max_seqlen=h.train_seq_len)
+            train_loss += loss.detach()
+            (loss / h.grad_accum_steps).backward()
+        train_loss /= h.grad_accum_steps
+        frac = (
+            min(step / h.muon_momentum_warmup_steps, 1.0)
+            if h.muon_momentum_warmup_steps > 0
+            else 1.0
+        )
+        muon_momentum = (
+            1 - frac
+        ) * h.muon_momentum_warmup_start + frac * h.muon_momentum
+        for group in optimizers.optimizer_muon.param_groups:
+            group["momentum"] = muon_momentum
+        for opt in optimizers:
+            for group in opt.param_groups:
+                group["lr"] = group["base_lr"] * lr_scale
+        if h.grad_clip_norm > 0:
+            torch.nn.utils.clip_grad_norm_(base_model.parameters(), h.grad_clip_norm)
+        optimizers.step(distributed=h.distributed)
+        return train_loss
+
+    if h.warmup_steps > 0:
+        initial_model_state = {
+            name: tensor.detach().cpu().clone()
+            for (name, tensor) in base_model.state_dict().items()
+        }
+        initial_optimizer_states = [
+            copy.deepcopy(opt.state_dict()) for opt in optimizers
+        ]
+        model.train()
+        num_tokens_local = h.train_batch_tokens // h.world_size
+        for blk in base_model.blocks:
+            blk.attn.rotary(num_tokens_local, device, torch.bfloat16)
+        cu_bucket_size = train_loader.cu_bucket_size
+        warmup_cu_buckets = tuple(cu_bucket_size * i for i in range(1, 5))
+        warmup_cu_iters = 3
+        x, y, cu_seqlens, _ = train_loader.next_batch(
+            h.train_batch_tokens, h.grad_accum_steps
+        )
+        log(f"warmup_cu_buckets:{','.join(str(b) for b in warmup_cu_buckets)} iters_each:{warmup_cu_iters}")
+        def _run_cu_bucket_warmup():
+            for bucket_len in warmup_cu_buckets:
+                boundaries = list(range(0, x.size(1), max(h.train_seq_len, 1)))
+                if boundaries[-1] != x.size(1):
+                    boundaries.append(x.size(1))
+                cu = torch.full((bucket_len,), x.size(1), dtype=torch.int32, device=device)
+                cu[: len(boundaries)] = torch.tensor(boundaries, dtype=torch.int32, device=device)
+                for _ in range(warmup_cu_iters):
+                    optimizers.zero_grad_all()
+                    with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+                        wloss = model(x, y, cu_seqlens=cu, max_seqlen=h.train_seq_len)
+                    (wloss / h.grad_accum_steps).backward()
+            optimizers.zero_grad_all()
+        _run_cu_bucket_warmup()
+        if h.num_loops > 0:
+            base_model.looping_active = True
+            _run_cu_bucket_warmup()
+            base_model.looping_active = False
+        for warmup_step in range(h.warmup_steps):
+            step_fn(warmup_step, 1.0)
+            if (
+                warmup_step <= 5
+                or (warmup_step + 1) % 10 == 0
+                or warmup_step + 1 == h.warmup_steps
+            ):
+                log(f"warmup_step: {warmup_step+1}/{h.warmup_steps}")
+        if h.num_loops > 0:
+            base_model.looping_active = True
+            log(
+                f"loop_warmup:enabled encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}"
+            )
+            for warmup_step in range(h.warmup_steps):
+                step_fn(warmup_step, 1.0)
+                if (
+                    warmup_step <= 5
+                    or (warmup_step + 1) % 10 == 0
+                    or warmup_step + 1 == h.warmup_steps
+                ):
+                    log(f"loop_warmup_step: {warmup_step+1}/{h.warmup_steps}")
+            base_model.looping_active = False
+        base_model.load_state_dict(initial_model_state, strict=True)
+        for (opt, state) in zip(optimizers, initial_optimizer_states, strict=True):
+            opt.load_state_dict(state)
+        optimizers.zero_grad_all()
+        train_loader = DocumentPackingLoader(h, device)
+    ema_state = {
+        name: t.detach().float().clone()
+        for (name, t) in base_model.state_dict().items()
+    }
+    ema_decay = h.ema_decay
+    training_time_ms = 0.0
+    stop_after_step = None
+    torch.cuda.synchronize()
+    t0 = time.perf_counter()
+    step = 0
+    while True:
+        last_step = (
+            step == h.iterations
+            or stop_after_step is not None
+            and step >= stop_after_step
+        )
+        should_validate = (
+            last_step or h.val_loss_every > 0 and step % h.val_loss_every == 0
+        )
+        if should_validate:
+            torch.cuda.synchronize()
+            training_time_ms += 1e3 * (time.perf_counter() - t0)
+            val_loss, val_bpb = eval_val(
+                h, device, val_data, model, compiled_forward_logits
+            )
+            log(
+                f"{step}/{h.iterations} val_loss: {val_loss:.4f} val_bpb: {val_bpb:.4f}"
+            )
+            torch.cuda.synchronize()
+            t0 = time.perf_counter()
+        if last_step:
+            if stop_after_step is not None and step < h.iterations:
+                log(
+                    f"stopping_early: wallclock_cap train_time: {training_time_ms:.0f}ms step: {step}/{h.iterations}"
+                )
+            break
+        elapsed_ms = training_time_ms + 1e3 * (time.perf_counter() - t0)
+        frac = training_frac(step, elapsed_ms)
+        scale = lr_mul(frac)
+        if (
+            h.num_loops > 0
+            and not base_model.looping_active
+            and frac >= h.enable_looping_at
+        ):
+            base_model.looping_active = True
+            log(
+                f"layer_loop:enabled step:{step} frac:{frac:.3f} encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}"
+            )
+        train_loss = step_fn(step, scale)
+        with torch.no_grad():
+            for (name, t) in base_model.state_dict().items():
+                ema_state[name].mul_(ema_decay).add_(
+                    t.detach().float(), alpha=1.0 - ema_decay
+                )
+        step += 1
+        approx_training_time_ms = training_time_ms + 1e3 * (time.perf_counter() - t0)
+        should_log_train = h.train_log_every > 0 and (
+            step <= 5 or step % h.train_log_every == 0 or stop_after_step is not None
+        )
+        if should_log_train:
+            tok_per_sec = step * h.train_batch_tokens / (approx_training_time_ms / 1e3)
+            log(
+                f"{step}/{h.iterations} train_loss: {train_loss.item():.4f} train_time: {approx_training_time_ms/60000:.1f}m tok/s: {tok_per_sec:.0f}"
+            )
+        reached_cap = (
+            max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms
+        )
+        if h.distributed and max_wallclock_ms is not None:
+            reached_cap_tensor = torch.tensor(int(reached_cap), device=device)
+            dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX)
+            reached_cap = bool(reached_cap_tensor.item())
+        if stop_after_step is None and reached_cap:
+            stop_after_step = step
+    log(
+        f"peak memory allocated: {torch.cuda.max_memory_allocated()//1024//1024} MiB reserved: {torch.cuda.max_memory_reserved()//1024//1024} MiB"
+    )
+    log("ema:applying EMA weights")
+    current_state = base_model.state_dict()
+    avg_state = {
+        name: t.to(dtype=current_state[name].dtype) for (name, t) in ema_state.items()
+    }
+    base_model.load_state_dict(avg_state, strict=True)
+    return base_model, compiled_model, compiled_forward_logits
+
+
+def train_and_eval(h, device):
+    random.seed(h.seed)
+    np.random.seed(h.seed)
+    torch.manual_seed(h.seed)
+    torch.cuda.manual_seed_all(h.seed)
+    if h.artifact_dir and h.is_main_process:
+        os.makedirs(h.artifact_dir, exist_ok=True)
+    val_data = ValidationData(h, device)
+    log(
+        f"train_shards: {len(list(Path(h.datasets_dir).resolve().glob('fineweb_train_*.bin')))}"
+    )
+    log(f"val_tokens: {val_data.val_tokens.numel()-1}")
+    base_model, compiled_model, compiled_forward_logits = train_model(
+        h, device, val_data
+    )
+    torch._dynamo.reset()
+    timed_eval(
+        "diagnostic pre-quantization post-ema",
+        eval_val,
+        h,
+        device,
+        val_data,
+        compiled_model,
+        compiled_forward_logits,
+    )
+    serialize(h, base_model, Path(__file__).read_text(encoding="utf-8"))
+    if h.distributed:
+        dist.barrier()
+    eval_model = deserialize(h, device)
+    if h.num_loops > 0:
+        eval_model.looping_active = True
+    compiled_model = torch.compile(eval_model, dynamic=False, fullgraph=True)
+    compiled_forward_logits = torch.compile(
+        eval_model.forward_logits, dynamic=False, fullgraph=True
+    )
+    timed_eval(
+        "diagnostic quantized",
+        eval_val,
+        h,
+        device,
+        val_data,
+        compiled_model,
+        compiled_forward_logits,
+    )
+    if h.sliding_window_enabled:
+        timed_eval(
+            "diagnostic quantized_sliding_window",
+            eval_val_sliding,
+            h,
+            device,
+            val_data,
+            eval_model,
+            forward_logits_fn=compiled_forward_logits,
+        )
+    if h.ttt_enabled:
+        del eval_model, compiled_model
+        torch._dynamo.reset()
+        torch.cuda.empty_cache()
+        ttt_model = deserialize(h, device)
+        if h.num_loops > 0:
+            ttt_model.looping_active = True
+        for p in ttt_model.parameters():
+            p.requires_grad_(False)
+
+        if h.rope_yarn:
+            _yarn_seqlen = h.train_batch_tokens // h.grad_accum_steps
+            for block in ttt_model.blocks:
+                block.attn.rotary(_yarn_seqlen, device, torch.bfloat16)
+        else:
+            for block in ttt_model.blocks:
+                block.attn.rotary._cos_cached = None
+                block.attn.rotary._sin_cached = None
+                block.attn.rotary._seq_len_cached = 0
+                block.attn.rotary(h.ttt_eval_seq_len, device, torch.bfloat16)
+
+        def _fwd_ttt_inner(input_ids, target_ids, lora):
+            return ttt_model.forward_ttt(input_ids, target_ids, lora=lora)
+
+        _fwd_ttt_compiled_inner = None
+
+        def _fwd_ttt(input_ids, target_ids, lora):
+            nonlocal _fwd_ttt_compiled_inner
+            if _fwd_ttt_compiled_inner is None:
+                _fwd_ttt_compiled_inner = torch.compile(_fwd_ttt_inner, dynamic=True)
+            return _fwd_ttt_compiled_inner(input_ids, target_ids, lora=lora)
+
+        fwd_ttt_compiled = _fwd_ttt
+        log(f"ttt_lora:warming up compile (random tokens, no val data)")
+        global BOS_ID
+        if BOS_ID is None:
+            BOS_ID = 1
+        t_warmup = time.perf_counter()
+        warmup_bszes = [h.ttt_batch_size]
+        for bsz in warmup_bszes:
+            wl = BatchedTTTLoRA(
+                bsz, ttt_model, h.ttt_lora_rank,
+                k_lora=h.ttt_k_lora, mlp_lora=h.ttt_mlp_lora, o_lora=h.ttt_o_lora,
+            ).to(device)
+            wo = torch.optim.AdamW(
+                wl.parameters(),
+                lr=h.ttt_lora_lr,
+                betas=(h.ttt_beta1, h.ttt_beta2),
+                eps=1e-10,
+                weight_decay=h.ttt_weight_decay,
+                fused=True,
+            )
+            for ctx_len in (h.ttt_chunk_size, h.ttt_eval_seq_len):
+                xw = torch.randint(0, h.vocab_size, (bsz, ctx_len), device=device, dtype=torch.int64)
+                yw = torch.randint(0, h.vocab_size, (bsz, ctx_len), device=device, dtype=torch.int64)
+                with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+                    ptl = fwd_ttt_compiled(xw, yw, lora=wl)
+                ptl[:, : min(h.ttt_chunk_size, ctx_len)].mean(dim=-1).sum().backward()
+                wo.step()
+                wo.zero_grad(set_to_none=True)
+            del wl, wo
+        torch.cuda.empty_cache()
+        compile_elapsed = time.perf_counter() - t_warmup
+        log(f"ttt_lora:compile warmup done ({compile_elapsed:.1f}s)")
+        log("\nbeginning TTT eval timer")
+        torch.cuda.synchronize()
+        t_ttt = time.perf_counter()
+        ttt_val_loss, ttt_val_bpb = eval_val_ttt_phased(
+            h, ttt_model, device, val_data, forward_ttt_train=fwd_ttt_compiled
+        )
+        torch.cuda.synchronize()
+        ttt_eval_elapsed = time.perf_counter() - t_ttt
+        log(
+            "quantized_ttt_phased "
+            f"val_loss:{ttt_val_loss:.8f} val_bpb:{ttt_val_bpb:.8f} "
+            f"eval_time:{1e3*ttt_eval_elapsed:.0f}ms"
+        )
+        log(f"total_eval_time:{ttt_eval_elapsed:.1f}s")
+        del ttt_model
+
+
+def main():
+    world_size = int(os.environ.get("WORLD_SIZE", "1"))
+    local_rank = int(os.environ.get("LOCAL_RANK", "0"))
+    distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ
+    if not torch.cuda.is_available():
+        raise RuntimeError("CUDA is required")
+    if world_size <= 0:
+        raise ValueError(f"WORLD_SIZE must be positive, got {world_size}")
+    if 8 % world_size != 0:
+        raise ValueError(
+            f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral"
+        )
+    device = torch.device("cuda", local_rank)
+    torch.cuda.set_device(device)
+    if distributed:
+        dist.init_process_group(backend="nccl", device_id=device)
+        dist.barrier()
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+    torch.set_float32_matmul_precision("high")
+    from torch.backends.cuda import (
+        enable_cudnn_sdp,
+        enable_flash_sdp,
+        enable_math_sdp,
+        enable_mem_efficient_sdp,
+    )
+
+    enable_cudnn_sdp(False)
+    enable_flash_sdp(True)
+    enable_mem_efficient_sdp(False)
+    enable_math_sdp(False)
+    torch._dynamo.config.optimize_ddp = False
+    torch._dynamo.config.cache_size_limit = 16
+    h = Hyperparameters()
+    set_logging_hparams(h)
+    if h.is_main_process:
+        os.makedirs(h.artifact_dir if h.artifact_dir else "logs", exist_ok=True)
+        log(100 * "=", console=False)
+        log("Hyperparameters:", console=True)
+        for (k, v) in sorted(vars(type(h)).items()):
+            if not k.startswith("_"):
+                log(f"  {k}: {v}", console=True)
+        log("=" * 100, console=False)
+        log("Source code:", console=False)
+        log("=" * 100, console=False)
+        with open(__file__, "r", encoding="utf-8") as _src:
+            log(_src.read(), console=False)
+        log("=" * 100, console=False)
+        log(f"Running Python {sys.version}", console=False)
+        log(f"Running PyTorch {torch.__version__}", console=False)
+        log("=" * 100, console=False)
+    train_and_eval(h, device)
+    if distributed:
+        dist.destroy_process_group()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/train_seed0.log b/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/train_seed0.log
new file mode 100644
index 0000000000..a0d3e9e7ba
--- /dev/null
+++ b/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/train_seed0.log
@@ -0,0 +1,697 @@
+
+*****************************************
+Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+*****************************************
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  artifact_dir: 
+  attn_clip_sigmas: 13.0
+  attn_out_gate: True
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: /tmp/casefold_data/
+  datasets_dir: /tmp/casefold_data/datasets/fineweb10B_sp8192
+  distributed: True
+  ema_decay: 0.9965
+  embed_bits: 7
+  embed_clip_sigmas: 15.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  enable_looping_at: 0.35
+  eval_seq_len: 2048
+  eval_stride: 64
+  global_ttt_batch_seqs: 32
+  global_ttt_chunk_tokens: 32768
+  global_ttt_epochs: 1
+  global_ttt_grad_clip: 1.0
+  global_ttt_lr: 0.001
+  global_ttt_momentum: 0.9
+  global_ttt_respect_doc_boundaries: True
+  global_ttt_warmup_chunks: 0
+  global_ttt_warmup_start_lr: 0.0
+  gptq_calibration_batches: 16
+  gptq_reserve_seconds: 4.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/PR1530_casefold_gates_s0.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.026
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_clip_sigmas: 12.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_momentum: 0.97
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_final_lane: mean
+  parallel_start_layer: 8
+  phased_ttt_enabled: True
+  phased_ttt_num_phases: 3
+  phased_ttt_prefix_docs: 2000
+  qk_gain_init: 5.0
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  rope_yarn: False
+  run_id: PR1530_casefold_gates_s0
+  scalar_lr: 0.02
+  seed: 0
+  skip_gates_enabled: True
+  sliding_window_enabled: False
+  smear_gate: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: /tmp/casefold_data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: /tmp/casefold_data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  ttt_batch_size: 64
+  ttt_beta1: 0.0
+  ttt_beta2: 0.999
+  ttt_chunk_size: 48
+  ttt_enabled: True
+  ttt_eval_batches: 
+  ttt_eval_seq_len: 2048
+  ttt_grad_steps: 1
+  ttt_k_lora: True
+  ttt_lora_lr: 0.0001
+  ttt_lora_rank: 96
+  ttt_mlp_lora: True
+  ttt_o_lora: True
+  ttt_optimizer: adam
+  ttt_weight_decay: 0.5
+  val_batch_tokens: 524288
+  val_doc_fraction: 1.0
+  val_files: /tmp/casefold_data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.75
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+train_shards: 72
+val_tokens: 36335616
+model_params:35945671
+gptq:reserving 4s, effective=596000ms
+warmup_cu_buckets:64,128,192,256 iters_each:3
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.0172 val_bpb: 3.1287
+1/20000 train_loss: 9.0187 train_time: 0.0m tok/s: 12186989
+2/20000 train_loss: 12.3783 train_time: 0.0m tok/s: 11548903
+3/20000 train_loss: 11.1996 train_time: 0.0m tok/s: 10294484
+4/20000 train_loss: 9.8573 train_time: 0.0m tok/s: 9783746
+5/20000 train_loss: 8.7009 train_time: 0.0m tok/s: 9502717
+500/20000 train_loss: 3.5583 train_time: 0.8m tok/s: 8215045
+1000/20000 train_loss: 3.4986 train_time: 1.6m tok/s: 8182454
+1500/20000 train_loss: 3.4106 train_time: 2.4m tok/s: 8180887
+2000/20000 train_loss: 3.3171 train_time: 3.2m tok/s: 8180713
+layer_loop:enabled step:2168 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+2500/20000 train_loss: 3.3800 train_time: 4.3m tok/s: 7627532
+3000/20000 train_loss: 3.1245 train_time: 5.5m tok/s: 7189899
+3500/20000 train_loss: 3.2409 train_time: 6.7m tok/s: 6888319
+4000/20000 train_loss: 3.2083 train_time: 7.9m tok/s: 6677196
+4000/20000 val_loss: 3.1608 val_bpb: 1.0967
+4500/20000 train_loss: 3.0350 train_time: 9.0m tok/s: 6536898
+4883/20000 val_loss: 3.0465 val_bpb: 1.0570
+stopping_early: wallclock_cap train_time: 596138ms step: 4883/20000
+peak memory allocated: 40372 MiB reserved: 44438 MiB
+ema:applying EMA weights
+diagnostic pre-quantization post-ema val_loss:3.04499851 val_bpb:1.05652454 eval_time:10116ms
+Serialized model: 135417533 bytes
+Code size (uncompressed): 124826 bytes
+Code size (compressed): 28060 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 67 Hessians in 3.4s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int7): tok_emb.weight
+  passthrough (float16): blocks.attn.attn_out_gate_w, blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights, smear_lambda, smear_w
+Serialized model quantized+brotli: 15909454 bytes
+Total submission size quantized+brotli: 15937514 bytes
+diagnostic quantized val_loss:3.07443402 val_bpb:1.06673779 eval_time:11586ms
+ttt_lora:warming up compile (random tokens, no val data)
+ttt_lora:compile warmup done (94.4s)
+
+beginning TTT eval timer
+ttt_phased: total_docs:50000 prefix_docs:2000 suffix_docs:48000 num_phases:3 boundaries:[666, 1333, 2000]
+ttp: b781/782 bl:2.8635 bb:1.0585 rl:2.8635 rb:1.0585 dl:13074-23027 gd:0
+ttpp: phase:1/3 pd:1104 gd:666 t:169.7s
+tttg: c1/85 lr:0.001000 t:0.4s
+tttg: c2/85 lr:0.001000 t:0.4s
+tttg: c3/85 lr:0.000999 t:0.5s
+tttg: c4/85 lr:0.000997 t:0.6s
+tttg: c5/85 lr:0.000994 t:0.7s
+tttg: c6/85 lr:0.000991 t:0.8s
+tttg: c7/85 lr:0.000987 t:0.8s
+tttg: c8/85 lr:0.000983 t:0.9s
+tttg: c9/85 lr:0.000978 t:1.0s
+tttg: c10/85 lr:0.000972 t:1.1s
+tttg: c11/85 lr:0.000965 t:1.2s
+tttg: c12/85 lr:0.000958 t:1.2s
+tttg: c13/85 lr:0.000950 t:1.3s
+tttg: c14/85 lr:0.000942 t:1.4s
+tttg: c15/85 lr:0.000933 t:1.5s
+tttg: c16/85 lr:0.000923 t:1.6s
+tttg: c17/85 lr:0.000913 t:1.6s
+tttg: c18/85 lr:0.000902 t:1.7s
+tttg: c19/85 lr:0.000891 t:1.8s
+tttg: c20/85 lr:0.000879 t:1.9s
+tttg: c21/85 lr:0.000867 t:2.0s
+tttg: c22/85 lr:0.000854 t:2.1s
+tttg: c23/85 lr:0.000840 t:2.1s
+tttg: c24/85 lr:0.000826 t:2.2s
+tttg: c25/85 lr:0.000812 t:2.3s
+tttg: c26/85 lr:0.000797 t:2.4s
+tttg: c27/85 lr:0.000782 t:2.4s
+tttg: c28/85 lr:0.000766 t:2.5s
+tttg: c29/85 lr:0.000750 t:2.6s
+tttg: c30/85 lr:0.000734 t:2.7s
+tttg: c31/85 lr:0.000717 t:2.8s
+tttg: c32/85 lr:0.000700 t:2.8s
+tttg: c33/85 lr:0.000683 t:2.9s
+tttg: c34/85 lr:0.000665 t:3.0s
+tttg: c35/85 lr:0.000647 t:3.1s
+tttg: c36/85 lr:0.000629 t:3.2s
+tttg: c37/85 lr:0.000611 t:3.2s
+tttg: c38/85 lr:0.000593 t:3.3s
+tttg: c39/85 lr:0.000575 t:3.4s
+tttg: c40/85 lr:0.000556 t:3.5s
+tttg: c41/85 lr:0.000537 t:3.6s
+tttg: c42/85 lr:0.000519 t:3.6s
+tttg: c43/85 lr:0.000500 t:3.7s
+tttg: c44/85 lr:0.000481 t:3.8s
+tttg: c45/85 lr:0.000463 t:3.9s
+tttg: c46/85 lr:0.000444 t:4.0s
+tttg: c47/85 lr:0.000425 t:4.0s
+tttg: c48/85 lr:0.000407 t:4.1s
+tttg: c49/85 lr:0.000389 t:4.2s
+tttg: c50/85 lr:0.000371 t:4.3s
+tttg: c51/85 lr:0.000353 t:4.4s
+tttg: c52/85 lr:0.000335 t:4.4s
+tttg: c53/85 lr:0.000317 t:4.5s
+tttg: c54/85 lr:0.000300 t:4.6s
+tttg: c55/85 lr:0.000283 t:4.7s
+tttg: c56/85 lr:0.000266 t:4.8s
+tttg: c57/85 lr:0.000250 t:4.8s
+tttg: c58/85 lr:0.000234 t:4.9s
+tttg: c59/85 lr:0.000218 t:5.0s
+tttg: c60/85 lr:0.000203 t:5.1s
+tttg: c61/85 lr:0.000188 t:5.2s
+tttg: c62/85 lr:0.000174 t:5.3s
+tttg: c63/85 lr:0.000160 t:5.3s
+tttg: c64/85 lr:0.000146 t:5.4s
+tttg: c65/85 lr:0.000133 t:5.5s
+tttg: c66/85 lr:0.000121 t:5.6s
+tttg: c67/85 lr:0.000109 t:5.7s
+tttg: c68/85 lr:0.000098 t:5.7s
+tttg: c69/85 lr:0.000087 t:5.8s
+tttg: c70/85 lr:0.000077 t:5.9s
+tttg: c71/85 lr:0.000067 t:6.0s
+tttg: c72/85 lr:0.000058 t:6.1s
+tttg: c73/85 lr:0.000050 t:6.1s
+tttg: c74/85 lr:0.000042 t:6.2s
+tttg: c75/85 lr:0.000035 t:6.3s
+tttg: c76/85 lr:0.000028 t:6.4s
+tttg: c77/85 lr:0.000022 t:6.5s
+tttg: c78/85 lr:0.000017 t:6.5s
+tttg: c79/85 lr:0.000013 t:6.6s
+tttg: c80/85 lr:0.000009 t:6.7s
+tttg: c81/85 lr:0.000006 t:6.8s
+tttg: c82/85 lr:0.000003 t:6.9s
+tttg: c83/85 lr:0.000001 t:7.0s
+tttg: c84/85 lr:0.000000 t:7.0s
+ttpr: phase:1/3 t:178.7s
+ttp: b759/782 bl:3.1063 bb:1.0765 rl:2.8993 rb:1.0613 dl:2858-2941 gd:0
+ttpp: phase:2/3 pd:1808 gd:1333 t:236.2s
+tttg: c1/142 lr:0.001000 t:0.1s
+tttg: c2/142 lr:0.001000 t:0.2s
+tttg: c3/142 lr:0.001000 t:0.2s
+tttg: c4/142 lr:0.000999 t:0.3s
+tttg: c5/142 lr:0.000998 t:0.4s
+tttg: c6/142 lr:0.000997 t:0.5s
+tttg: c7/142 lr:0.000996 t:0.6s
+tttg: c8/142 lr:0.000994 t:0.6s
+tttg: c9/142 lr:0.000992 t:0.7s
+tttg: c10/142 lr:0.000990 t:0.8s
+tttg: c11/142 lr:0.000988 t:0.9s
+tttg: c12/142 lr:0.000985 t:1.0s
+tttg: c13/142 lr:0.000982 t:1.0s
+tttg: c14/142 lr:0.000979 t:1.1s
+tttg: c15/142 lr:0.000976 t:1.2s
+tttg: c16/142 lr:0.000972 t:1.3s
+tttg: c17/142 lr:0.000969 t:1.3s
+tttg: c18/142 lr:0.000965 t:1.4s
+tttg: c19/142 lr:0.000960 t:1.5s
+tttg: c20/142 lr:0.000956 t:1.6s
+tttg: c21/142 lr:0.000951 t:1.7s
+tttg: c22/142 lr:0.000946 t:1.7s
+tttg: c23/142 lr:0.000941 t:1.8s
+tttg: c24/142 lr:0.000936 t:1.9s
+tttg: c25/142 lr:0.000930 t:2.0s
+tttg: c26/142 lr:0.000924 t:2.1s
+tttg: c27/142 lr:0.000918 t:2.1s
+tttg: c28/142 lr:0.000912 t:2.2s
+tttg: c29/142 lr:0.000906 t:2.3s
+tttg: c30/142 lr:0.000899 t:2.4s
+tttg: c31/142 lr:0.000892 t:2.4s
+tttg: c32/142 lr:0.000885 t:2.5s
+tttg: c33/142 lr:0.000878 t:2.6s
+tttg: c34/142 lr:0.000871 t:2.7s
+tttg: c35/142 lr:0.000863 t:2.8s
+tttg: c36/142 lr:0.000856 t:2.8s
+tttg: c37/142 lr:0.000848 t:2.9s
+tttg: c38/142 lr:0.000840 t:3.0s
+tttg: c39/142 lr:0.000831 t:3.1s
+tttg: c40/142 lr:0.000823 t:3.2s
+tttg: c41/142 lr:0.000814 t:3.2s
+tttg: c42/142 lr:0.000805 t:3.3s
+tttg: c43/142 lr:0.000797 t:3.4s
+tttg: c44/142 lr:0.000788 t:3.5s
+tttg: c45/142 lr:0.000778 t:3.6s
+tttg: c46/142 lr:0.000769 t:3.6s
+tttg: c47/142 lr:0.000760 t:3.7s
+tttg: c48/142 lr:0.000750 t:3.8s
+tttg: c49/142 lr:0.000740 t:3.9s
+tttg: c50/142 lr:0.000730 t:3.9s
+tttg: c51/142 lr:0.000721 t:4.0s
+tttg: c52/142 lr:0.000710 t:4.1s
+tttg: c53/142 lr:0.000700 t:4.2s
+tttg: c54/142 lr:0.000690 t:4.3s
+tttg: c55/142 lr:0.000680 t:4.3s
+tttg: c56/142 lr:0.000669 t:4.4s
+tttg: c57/142 lr:0.000659 t:4.5s
+tttg: c58/142 lr:0.000648 t:4.6s
+tttg: c59/142 lr:0.000637 t:4.7s
+tttg: c60/142 lr:0.000627 t:4.7s
+tttg: c61/142 lr:0.000616 t:4.8s
+tttg: c62/142 lr:0.000605 t:4.9s
+tttg: c63/142 lr:0.000594 t:5.0s
+tttg: c64/142 lr:0.000583 t:5.1s
+tttg: c65/142 lr:0.000572 t:5.1s
+tttg: c66/142 lr:0.000561 t:5.2s
+tttg: c67/142 lr:0.000550 t:5.3s
+tttg: c68/142 lr:0.000539 t:5.4s
+tttg: c69/142 lr:0.000528 t:5.5s
+tttg: c70/142 lr:0.000517 t:5.5s
+tttg: c71/142 lr:0.000506 t:5.6s
+tttg: c72/142 lr:0.000494 t:5.7s
+tttg: c73/142 lr:0.000483 t:5.8s
+tttg: c74/142 lr:0.000472 t:5.8s
+tttg: c75/142 lr:0.000461 t:5.9s
+tttg: c76/142 lr:0.000450 t:6.0s
+tttg: c77/142 lr:0.000439 t:6.1s
+tttg: c78/142 lr:0.000428 t:6.2s
+tttg: c79/142 lr:0.000417 t:6.2s
+tttg: c80/142 lr:0.000406 t:6.3s
+tttg: c81/142 lr:0.000395 t:6.4s
+tttg: c82/142 lr:0.000384 t:6.5s
+tttg: c83/142 lr:0.000373 t:6.6s
+tttg: c84/142 lr:0.000363 t:6.6s
+tttg: c85/142 lr:0.000352 t:6.7s
+tttg: c86/142 lr:0.000341 t:6.8s
+tttg: c87/142 lr:0.000331 t:6.9s
+tttg: c88/142 lr:0.000320 t:7.0s
+tttg: c89/142 lr:0.000310 t:7.0s
+tttg: c90/142 lr:0.000300 t:7.1s
+tttg: c91/142 lr:0.000290 t:7.2s
+tttg: c92/142 lr:0.000279 t:7.3s
+tttg: c93/142 lr:0.000270 t:7.4s
+tttg: c94/142 lr:0.000260 t:7.4s
+tttg: c95/142 lr:0.000250 t:7.5s
+tttg: c96/142 lr:0.000240 t:7.6s
+tttg: c97/142 lr:0.000231 t:7.7s
+tttg: c98/142 lr:0.000222 t:7.7s
+tttg: c99/142 lr:0.000212 t:7.8s
+tttg: c100/142 lr:0.000203 t:7.9s
+tttg: c101/142 lr:0.000195 t:8.0s
+tttg: c102/142 lr:0.000186 t:8.1s
+tttg: c103/142 lr:0.000177 t:8.1s
+tttg: c104/142 lr:0.000169 t:8.2s
+tttg: c105/142 lr:0.000160 t:8.3s
+tttg: c106/142 lr:0.000152 t:8.4s
+tttg: c107/142 lr:0.000144 t:8.5s
+tttg: c108/142 lr:0.000137 t:8.5s
+tttg: c109/142 lr:0.000129 t:8.6s
+tttg: c110/142 lr:0.000122 t:8.7s
+tttg: c111/142 lr:0.000115 t:8.8s
+tttg: c112/142 lr:0.000108 t:8.9s
+tttg: c113/142 lr:0.000101 t:8.9s
+tttg: c114/142 lr:0.000094 t:9.0s
+tttg: c115/142 lr:0.000088 t:9.1s
+tttg: c116/142 lr:0.000082 t:9.2s
+tttg: c117/142 lr:0.000076 t:9.3s
+tttg: c118/142 lr:0.000070 t:9.3s
+tttg: c119/142 lr:0.000064 t:9.4s
+tttg: c120/142 lr:0.000059 t:9.5s
+tttg: c121/142 lr:0.000054 t:9.6s
+tttg: c122/142 lr:0.000049 t:9.7s
+tttg: c123/142 lr:0.000044 t:9.7s
+tttg: c124/142 lr:0.000040 t:9.8s
+tttg: c125/142 lr:0.000035 t:9.9s
+tttg: c126/142 lr:0.000031 t:10.0s
+tttg: c127/142 lr:0.000028 t:10.1s
+tttg: c128/142 lr:0.000024 t:10.2s
+tttg: c129/142 lr:0.000021 t:10.2s
+tttg: c130/142 lr:0.000018 t:10.3s
+tttg: c131/142 lr:0.000015 t:10.4s
+tttg: c132/142 lr:0.000012 t:10.5s
+tttg: c133/142 lr:0.000010 t:10.6s
+tttg: c134/142 lr:0.000008 t:10.6s
+tttg: c135/142 lr:0.000006 t:10.7s
+tttg: c136/142 lr:0.000004 t:10.8s
+tttg: c137/142 lr:0.000003 t:10.9s
+tttg: c138/142 lr:0.000002 t:11.0s
+tttg: c139/142 lr:0.000001 t:11.0s
+tttg: c140/142 lr:0.000000 t:11.1s
+tttg: c141/142 lr:0.000000 t:11.2s
+ttpr: phase:2/3 t:249.4s
+ttp: b749/782 bl:3.0179 bb:1.0756 rl:2.9120 rb:1.0629 dl:2339-2378 gd:0
+ttpp: phase:3/3 pd:2448 gd:2000 t:260.5s
+tttg: c1/192 lr:0.001000 t:0.1s
+tttg: c2/192 lr:0.001000 t:0.2s
+tttg: c3/192 lr:0.001000 t:0.2s
+tttg: c4/192 lr:0.000999 t:0.3s
+tttg: c5/192 lr:0.000999 t:0.4s
+tttg: c6/192 lr:0.000998 t:0.5s
+tttg: c7/192 lr:0.000998 t:0.6s
+tttg: c8/192 lr:0.000997 t:0.6s
+tttg: c9/192 lr:0.000996 t:0.7s
+tttg: c10/192 lr:0.000995 t:0.8s
+tttg: c11/192 lr:0.000993 t:0.9s
+tttg: c12/192 lr:0.000992 t:1.0s
+tttg: c13/192 lr:0.000990 t:1.1s
+tttg: c14/192 lr:0.000989 t:1.1s
+tttg: c15/192 lr:0.000987 t:1.2s
+tttg: c16/192 lr:0.000985 t:1.3s
+tttg: c17/192 lr:0.000983 t:1.4s
+tttg: c18/192 lr:0.000981 t:1.5s
+tttg: c19/192 lr:0.000978 t:1.5s
+tttg: c20/192 lr:0.000976 t:1.6s
+tttg: c21/192 lr:0.000973 t:1.7s
+tttg: c22/192 lr:0.000970 t:1.8s
+tttg: c23/192 lr:0.000968 t:1.9s
+tttg: c24/192 lr:0.000965 t:2.0s
+tttg: c25/192 lr:0.000962 t:2.0s
+tttg: c26/192 lr:0.000958 t:2.1s
+tttg: c27/192 lr:0.000955 t:2.2s
+tttg: c28/192 lr:0.000951 t:2.3s
+tttg: c29/192 lr:0.000948 t:2.4s
+tttg: c30/192 lr:0.000944 t:2.5s
+tttg: c31/192 lr:0.000940 t:2.6s
+tttg: c32/192 lr:0.000936 t:2.6s
+tttg: c33/192 lr:0.000932 t:2.7s
+tttg: c34/192 lr:0.000928 t:2.8s
+tttg: c35/192 lr:0.000924 t:2.9s
+tttg: c36/192 lr:0.000919 t:3.0s
+tttg: c37/192 lr:0.000915 t:3.1s
+tttg: c38/192 lr:0.000910 t:3.1s
+tttg: c39/192 lr:0.000905 t:3.2s
+tttg: c40/192 lr:0.000901 t:3.3s
+tttg: c41/192 lr:0.000896 t:3.4s
+tttg: c42/192 lr:0.000891 t:3.5s
+tttg: c43/192 lr:0.000885 t:3.5s
+tttg: c44/192 lr:0.000880 t:3.6s
+tttg: c45/192 lr:0.000875 t:3.7s
+tttg: c46/192 lr:0.000869 t:3.8s
+tttg: c47/192 lr:0.000864 t:3.9s
+tttg: c48/192 lr:0.000858 t:3.9s
+tttg: c49/192 lr:0.000852 t:4.0s
+tttg: c50/192 lr:0.000846 t:4.1s
+tttg: c51/192 lr:0.000840 t:4.2s
+tttg: c52/192 lr:0.000834 t:4.3s
+tttg: c53/192 lr:0.000828 t:4.4s
+tttg: c54/192 lr:0.000822 t:4.4s
+tttg: c55/192 lr:0.000815 t:4.5s
+tttg: c56/192 lr:0.000809 t:4.6s
+tttg: c57/192 lr:0.000802 t:4.7s
+tttg: c58/192 lr:0.000796 t:4.8s
+tttg: c59/192 lr:0.000789 t:4.9s
+tttg: c60/192 lr:0.000782 t:4.9s
+tttg: c61/192 lr:0.000776 t:5.0s
+tttg: c62/192 lr:0.000769 t:5.1s
+tttg: c63/192 lr:0.000762 t:5.2s
+tttg: c64/192 lr:0.000755 t:5.3s
+tttg: c65/192 lr:0.000748 t:5.4s
+tttg: c66/192 lr:0.000740 t:5.4s
+tttg: c67/192 lr:0.000733 t:5.5s
+tttg: c68/192 lr:0.000726 t:5.6s
+tttg: c69/192 lr:0.000719 t:5.7s
+tttg: c70/192 lr:0.000711 t:5.8s
+tttg: c71/192 lr:0.000704 t:5.8s
+tttg: c72/192 lr:0.000696 t:5.9s
+tttg: c73/192 lr:0.000688 t:6.0s
+tttg: c74/192 lr:0.000681 t:6.1s
+tttg: c75/192 lr:0.000673 t:6.2s
+tttg: c76/192 lr:0.000665 t:6.3s
+tttg: c77/192 lr:0.000658 t:6.3s
+tttg: c78/192 lr:0.000650 t:6.4s
+tttg: c79/192 lr:0.000642 t:6.5s
+tttg: c80/192 lr:0.000634 t:6.6s
+tttg: c81/192 lr:0.000626 t:6.7s
+tttg: c82/192 lr:0.000618 t:6.8s
+tttg: c83/192 lr:0.000610 t:6.8s
+tttg: c84/192 lr:0.000602 t:6.9s
+tttg: c85/192 lr:0.000594 t:7.0s
+tttg: c86/192 lr:0.000586 t:7.1s
+tttg: c87/192 lr:0.000578 t:7.2s
+tttg: c88/192 lr:0.000570 t:7.3s
+tttg: c89/192 lr:0.000562 t:7.3s
+tttg: c90/192 lr:0.000553 t:7.4s
+tttg: c91/192 lr:0.000545 t:7.5s
+tttg: c92/192 lr:0.000537 t:7.6s
+tttg: c93/192 lr:0.000529 t:7.7s
+tttg: c94/192 lr:0.000521 t:7.7s
+tttg: c95/192 lr:0.000512 t:7.8s
+tttg: c96/192 lr:0.000504 t:7.9s
+tttg: c97/192 lr:0.000496 t:8.0s
+tttg: c98/192 lr:0.000488 t:8.1s
+tttg: c99/192 lr:0.000479 t:8.2s
+tttg: c100/192 lr:0.000471 t:8.2s
+tttg: c101/192 lr:0.000463 t:8.3s
+tttg: c102/192 lr:0.000455 t:8.4s
+tttg: c103/192 lr:0.000447 t:8.5s
+tttg: c104/192 lr:0.000438 t:8.6s
+tttg: c105/192 lr:0.000430 t:8.7s
+tttg: c106/192 lr:0.000422 t:8.7s
+tttg: c107/192 lr:0.000414 t:8.8s
+tttg: c108/192 lr:0.000406 t:8.9s
+tttg: c109/192 lr:0.000398 t:9.0s
+tttg: c110/192 lr:0.000390 t:9.1s
+tttg: c111/192 lr:0.000382 t:9.2s
+tttg: c112/192 lr:0.000374 t:9.2s
+tttg: c113/192 lr:0.000366 t:9.3s
+tttg: c114/192 lr:0.000358 t:9.4s
+tttg: c115/192 lr:0.000350 t:9.5s
+tttg: c116/192 lr:0.000342 t:9.6s
+tttg: c117/192 lr:0.000335 t:9.6s
+tttg: c118/192 lr:0.000327 t:9.7s
+tttg: c119/192 lr:0.000319 t:9.8s
+tttg: c120/192 lr:0.000312 t:9.9s
+tttg: c121/192 lr:0.000304 t:10.0s
+tttg: c122/192 lr:0.000296 t:10.1s
+tttg: c123/192 lr:0.000289 t:10.1s
+tttg: c124/192 lr:0.000281 t:10.2s
+tttg: c125/192 lr:0.000274 t:10.3s
+tttg: c126/192 lr:0.000267 t:10.4s
+tttg: c127/192 lr:0.000260 t:10.5s
+tttg: c128/192 lr:0.000252 t:10.6s
+tttg: c129/192 lr:0.000245 t:10.6s
+tttg: c130/192 lr:0.000238 t:10.7s
+tttg: c131/192 lr:0.000231 t:10.8s
+tttg: c132/192 lr:0.000224 t:10.9s
+tttg: c133/192 lr:0.000218 t:11.0s
+tttg: c134/192 lr:0.000211 t:11.1s
+tttg: c135/192 lr:0.000204 t:11.1s
+tttg: c136/192 lr:0.000198 t:11.2s
+tttg: c137/192 lr:0.000191 t:11.3s
+tttg: c138/192 lr:0.000185 t:11.4s
+tttg: c139/192 lr:0.000178 t:11.5s
+tttg: c140/192 lr:0.000172 t:11.5s
+tttg: c141/192 lr:0.000166 t:11.6s
+tttg: c142/192 lr:0.000160 t:11.7s
+tttg: c143/192 lr:0.000154 t:11.8s
+tttg: c144/192 lr:0.000148 t:11.9s
+tttg: c145/192 lr:0.000142 t:12.0s
+tttg: c146/192 lr:0.000136 t:12.1s
+tttg: c147/192 lr:0.000131 t:12.1s
+tttg: c148/192 lr:0.000125 t:12.2s
+tttg: c149/192 lr:0.000120 t:12.3s
+tttg: c150/192 lr:0.000115 t:12.4s
+tttg: c151/192 lr:0.000109 t:12.5s
+tttg: c152/192 lr:0.000104 t:12.5s
+tttg: c153/192 lr:0.000099 t:12.6s
+tttg: c154/192 lr:0.000095 t:12.7s
+tttg: c155/192 lr:0.000090 t:12.8s
+tttg: c156/192 lr:0.000085 t:12.9s
+tttg: c157/192 lr:0.000081 t:13.0s
+tttg: c158/192 lr:0.000076 t:13.0s
+tttg: c159/192 lr:0.000072 t:13.1s
+tttg: c160/192 lr:0.000068 t:13.2s
+tttg: c161/192 lr:0.000064 t:13.3s
+tttg: c162/192 lr:0.000060 t:13.4s
+tttg: c163/192 lr:0.000056 t:13.5s
+tttg: c164/192 lr:0.000052 t:13.5s
+tttg: c165/192 lr:0.000049 t:13.6s
+tttg: c166/192 lr:0.000045 t:13.7s
+tttg: c167/192 lr:0.000042 t:13.8s
+tttg: c168/192 lr:0.000038 t:13.9s
+tttg: c169/192 lr:0.000035 t:14.0s
+tttg: c170/192 lr:0.000032 t:14.0s
+tttg: c171/192 lr:0.000030 t:14.1s
+tttg: c172/192 lr:0.000027 t:14.2s
+tttg: c173/192 lr:0.000024 t:14.3s
+tttg: c174/192 lr:0.000022 t:14.4s
+tttg: c175/192 lr:0.000019 t:14.5s
+tttg: c176/192 lr:0.000017 t:14.5s
+tttg: c177/192 lr:0.000015 t:14.6s
+tttg: c178/192 lr:0.000013 t:14.7s
+tttg: c179/192 lr:0.000011 t:14.8s
+tttg: c180/192 lr:0.000010 t:14.9s
+tttg: c181/192 lr:0.000008 t:15.0s
+tttg: c182/192 lr:0.000007 t:15.0s
+tttg: c183/192 lr:0.000005 t:15.1s
+tttg: c184/192 lr:0.000004 t:15.2s
+tttg: c185/192 lr:0.000003 t:15.3s
+tttg: c186/192 lr:0.000002 t:15.4s
+tttg: c187/192 lr:0.000002 t:15.4s
+tttg: c188/192 lr:0.000001 t:15.5s
+tttg: c189/192 lr:0.000001 t:15.6s
+tttg: c190/192 lr:0.000000 t:15.7s
+tttg: c191/192 lr:0.000000 t:15.8s
+ttpr: phase:3/3 t:278.2s
+ttp: b739/782 bl:3.0119 bb:1.0179 rl:2.9204 rb:1.0588 dl:2000-2025 gd:1
+ttp: b732/782 bl:3.0393 bb:1.0446 rl:2.9288 rb:1.0578 dl:1835-1855 gd:1
+ttp: b726/782 bl:2.9525 bb:0.9995 rl:2.9303 rb:1.0539 dl:1728-1743 gd:1
+ttp: b713/782 bl:3.0505 bb:1.0424 rl:2.9367 rb:1.0532 dl:1534-1547 gd:1
+ttp: b709/782 bl:2.9911 bb:1.0318 rl:2.9393 rb:1.0521 dl:1483-1495 gd:1
+ttp: b697/782 bl:3.0342 bb:1.0246 rl:2.9434 rb:1.0509 dl:1371-1379 gd:1
+ttp: b694/782 bl:2.9856 bb:1.0128 rl:2.9451 rb:1.0493 dl:1345-1353 gd:1
+ttp: b681/782 bl:3.0936 bb:1.0452 rl:2.9505 rb:1.0491 dl:1248-1254 gd:1
+ttp: b673/782 bl:3.1366 bb:1.0844 rl:2.9567 rb:1.0503 dl:1194-1199 gd:1
+ttp: b670/782 bl:3.0710 bb:1.0404 rl:2.9603 rb:1.0500 dl:1175-1182 gd:1
+ttp: b661/782 bl:3.0140 bb:1.0210 rl:2.9619 rb:1.0491 dl:1126-1131 gd:1
+ttp: b653/782 bl:3.0668 bb:1.0582 rl:2.9648 rb:1.0494 dl:1085-1091 gd:1
+ttp: b645/782 bl:3.0477 bb:1.0366 rl:2.9670 rb:1.0490 dl:1044-1049 gd:1
+ttp: b639/782 bl:3.0401 bb:1.0267 rl:2.9688 rb:1.0484 dl:1015-1020 gd:1
+ttp: b628/782 bl:3.1039 bb:1.0610 rl:2.9719 rb:1.0487 dl:969-973 gd:1
+ttp: b619/782 bl:3.0585 bb:1.0450 rl:2.9738 rb:1.0487 dl:932-935 gd:1
+ttp: b611/782 bl:3.0586 bb:1.0613 rl:2.9755 rb:1.0489 dl:901-905 gd:1
+ttp: b604/782 bl:3.0265 bb:1.0290 rl:2.9765 rb:1.0485 dl:875-878 gd:1
+ttp: b596/782 bl:3.0095 bb:1.0108 rl:2.9771 rb:1.0478 dl:849-852 gd:1
+ttp: b588/782 bl:3.0528 bb:1.0325 rl:2.9784 rb:1.0475 dl:823-826 gd:1
+ttp: b579/782 bl:3.0295 bb:1.0339 rl:2.9793 rb:1.0473 dl:796-799 gd:1
+ttp: b572/782 bl:3.0760 bb:1.0459 rl:2.9808 rb:1.0473 dl:777-779 gd:1
+ttp: b564/782 bl:2.9808 bb:0.9945 rl:2.9808 rb:1.0464 dl:754-757 gd:1
+ttp: b556/782 bl:3.0821 bb:1.0586 rl:2.9823 rb:1.0466 dl:732-735 gd:1
+ttp: b549/782 bl:3.1089 bb:1.0399 rl:2.9841 rb:1.0465 dl:715-718 gd:1
+ttp: b543/782 bl:3.0519 bb:1.0248 rl:2.9851 rb:1.0462 dl:698-701 gd:1
+ttp: b536/782 bl:3.0834 bb:1.0563 rl:2.9864 rb:1.0463 dl:684-686 gd:1
+ttp: b528/782 bl:3.0277 bb:1.0252 rl:2.9869 rb:1.0460 dl:666-668 gd:1
+ttp: b520/782 bl:3.0745 bb:1.0400 rl:2.9880 rb:1.0460 dl:648-650 gd:1
+ttp: b512/782 bl:2.9769 bb:1.0200 rl:2.9878 rb:1.0456 dl:630-633 gd:1
+ttp: b504/782 bl:3.1026 bb:1.0765 rl:2.9891 rb:1.0460 dl:614-616 gd:1
+ttp: b496/782 bl:3.1153 bb:1.0694 rl:2.9905 rb:1.0463 dl:598-599 gd:1
+ttp: b488/782 bl:3.1173 bb:1.0693 rl:2.9918 rb:1.0465 dl:582-584 gd:1
+ttp: b480/782 bl:3.1786 bb:1.0801 rl:2.9937 rb:1.0468 dl:566-569 gd:1
+ttp: b472/782 bl:3.0599 bb:1.0565 rl:2.9944 rb:1.0469 dl:552-554 gd:1
+ttp: b464/782 bl:3.0231 bb:1.0151 rl:2.9946 rb:1.0466 dl:538-540 gd:1
+ttp: b456/782 bl:3.1174 bb:1.0515 rl:2.9957 rb:1.0467 dl:524-526 gd:1
+ttp: b448/782 bl:2.9942 bb:1.0410 rl:2.9957 rb:1.0466 dl:511-512 gd:1
+ttp: b440/782 bl:3.0270 bb:1.0328 rl:2.9960 rb:1.0465 dl:497-499 gd:1
+ttp: b432/782 bl:3.0002 bb:1.0489 rl:2.9960 rb:1.0465 dl:485-487 gd:1
+ttp: b424/782 bl:3.0122 bb:1.0549 rl:2.9962 rb:1.0466 dl:473-474 gd:1
+ttp: b416/782 bl:3.0562 bb:1.0607 rl:2.9966 rb:1.0467 dl:460-462 gd:1
+ttp: b408/782 bl:3.1749 bb:1.0865 rl:2.9979 rb:1.0470 dl:448-449 gd:1
+ttp: b400/782 bl:3.0718 bb:1.0672 rl:2.9985 rb:1.0471 dl:436-438 gd:1
+ttp: b392/782 bl:3.1489 bb:1.0746 rl:2.9995 rb:1.0473 dl:426-427 gd:1
+ttp: b384/782 bl:2.9833 bb:1.0420 rl:2.9994 rb:1.0473 dl:415-416 gd:1
+ttp: b376/782 bl:3.1068 bb:1.0620 rl:3.0001 rb:1.0474 dl:404-405 gd:1
+ttp: b368/782 bl:3.0531 bb:1.0438 rl:3.0004 rb:1.0474 dl:393-395 gd:1
+ttp: b361/782 bl:3.0046 bb:1.0562 rl:3.0005 rb:1.0474 dl:385-386 gd:1
+ttp: b351/782 bl:3.0890 bb:1.0457 rl:3.0010 rb:1.0474 dl:373-374 gd:1
+ttp: b343/782 bl:3.1392 bb:1.0753 rl:3.0018 rb:1.0476 dl:363-364 gd:1
+ttp: b335/782 bl:3.0680 bb:1.0771 rl:3.0021 rb:1.0477 dl:353-355 gd:1
+ttp: b327/782 bl:3.1849 bb:1.1107 rl:3.0031 rb:1.0481 dl:344-345 gd:1
+ttp: b319/782 bl:3.0689 bb:1.0754 rl:3.0034 rb:1.0482 dl:335-336 gd:1
+ttp: b312/782 bl:3.0637 bb:1.0698 rl:3.0037 rb:1.0483 dl:327-328 gd:1
+ttp: b304/782 bl:3.1178 bb:1.0699 rl:3.0043 rb:1.0484 dl:317-318 gd:1
+ttp: b296/782 bl:3.1649 bb:1.0883 rl:3.0050 rb:1.0486 dl:309-310 gd:1
+ttp: b288/782 bl:3.1460 bb:1.0870 rl:3.0057 rb:1.0488 dl:300-301 gd:1
+ttp: b280/782 bl:3.1234 bb:1.1096 rl:3.0062 rb:1.0491 dl:292-293 gd:1
+ttp: b272/782 bl:3.1262 bb:1.0699 rl:3.0067 rb:1.0492 dl:284-285 gd:1
+ttp: b264/782 bl:3.0803 bb:1.0750 rl:3.0070 rb:1.0493 dl:275-276 gd:1
+ttp: b256/782 bl:3.1314 bb:1.0880 rl:3.0075 rb:1.0494 dl:268-269 gd:1
+ttp: b249/782 bl:3.1149 bb:1.0897 rl:3.0079 rb:1.0496 dl:261-262 gd:1
+ttp: b242/782 bl:3.1145 bb:1.0788 rl:3.0083 rb:1.0497 dl:255-256 gd:1
+ttp: b234/782 bl:3.2366 bb:1.1443 rl:3.0092 rb:1.0500 dl:248-249 gd:1
+ttp: b227/782 bl:3.0997 bb:1.0557 rl:3.0095 rb:1.0500 dl:242-242 gd:1
+ttp: b215/782 bl:3.0797 bb:1.0697 rl:3.0097 rb:1.0501 dl:231-232 gd:1
+ttp: b206/782 bl:3.0945 bb:1.0998 rl:3.0100 rb:1.0503 dl:224-225 gd:1
+ttp: b200/782 bl:3.1316 bb:1.0984 rl:3.0104 rb:1.0504 dl:219-220 gd:1
+ttp: b193/782 bl:3.1980 bb:1.1218 rl:3.0110 rb:1.0506 dl:213-214 gd:1
+ttp: b186/782 bl:3.1176 bb:1.0899 rl:3.0113 rb:1.0508 dl:208-208 gd:1
+ttp: b176/782 bl:3.1924 bb:1.1146 rl:3.0118 rb:1.0509 dl:200-201 gd:1
+ttp: b169/782 bl:3.1891 bb:1.1195 rl:3.0123 rb:1.0511 dl:194-195 gd:1
+ttp: b161/782 bl:3.2370 bb:1.1418 rl:3.0129 rb:1.0514 dl:188-189 gd:1
+ttp: b152/782 bl:3.3102 bb:1.1563 rl:3.0137 rb:1.0516 dl:182-183 gd:1
+ttp: b146/782 bl:3.2722 bb:1.1379 rl:3.0143 rb:1.0519 dl:178-178 gd:1
+ttp: b138/782 bl:3.2113 bb:1.1119 rl:3.0148 rb:1.0520 dl:172-172 gd:1
+ttp: b130/782 bl:3.1996 bb:1.1215 rl:3.0152 rb:1.0522 dl:166-166 gd:1
+ttp: b120/782 bl:3.3579 bb:1.1713 rl:3.0160 rb:1.0524 dl:159-160 gd:1
+ttp: b113/782 bl:3.1482 bb:1.1278 rl:3.0163 rb:1.0526 dl:154-155 gd:1
+ttp: b106/782 bl:3.2228 bb:1.1166 rl:3.0167 rb:1.0527 dl:150-150 gd:1
+ttp: b96/782 bl:3.2728 bb:1.1733 rl:3.0172 rb:1.0530 dl:143-144 gd:1
+ttp: b92/782 bl:3.2968 bb:1.1517 rl:3.0178 rb:1.0532 dl:140-141 gd:1
+ttp: b84/782 bl:3.4326 bb:1.2264 rl:3.0185 rb:1.0535 dl:135-136 gd:1
+ttp: b76/782 bl:3.3582 bb:1.1689 rl:3.0192 rb:1.0537 dl:129-130 gd:1
+ttp: b70/782 bl:3.3148 bb:1.1615 rl:3.0197 rb:1.0539 dl:125-125 gd:1
+ttp: b61/782 bl:3.3277 bb:1.1505 rl:3.0202 rb:1.0540 dl:118-119 gd:1
+ttp: b54/782 bl:3.3379 bb:1.1933 rl:3.0207 rb:1.0542 dl:114-114 gd:1
+ttp: b45/782 bl:3.3367 bb:1.1777 rl:3.0211 rb:1.0544 dl:107-108 gd:1
+ttp: b36/782 bl:3.4517 bb:1.2496 rl:3.0217 rb:1.0547 dl:101-102 gd:1
+ttp: b30/782 bl:3.3927 bb:1.2055 rl:3.0222 rb:1.0549 dl:97-97 gd:1
+ttp: b25/782 bl:3.5058 bb:1.2663 rl:3.0228 rb:1.0551 dl:93-93 gd:1
+ttp: b16/782 bl:3.4900 bb:1.2089 rl:3.0234 rb:1.0553 dl:84-86 gd:1
+ttp: b9/782 bl:3.5745 bb:1.2473 rl:3.0240 rb:1.0555 dl:77-78 gd:1
+ttp: b1/782 bl:3.7285 bb:1.2761 rl:3.0245 rb:1.0557 dl:40-61 gd:1
+quantized_ttt_phased val_loss:3.04711711 val_bpb:1.05730364 eval_time:347304ms
+total_eval_time:347.3s
diff --git a/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/train_seed1234.log b/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/train_seed1234.log
new file mode 100644
index 0000000000..7c863ee377
--- /dev/null
+++ b/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/train_seed1234.log
@@ -0,0 +1,698 @@
+
+*****************************************
+Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+*****************************************
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  artifact_dir: 
+  attn_clip_sigmas: 13.0
+  attn_out_gate: True
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: /tmp/casefold_data/
+  datasets_dir: /tmp/casefold_data/datasets/fineweb10B_sp8192
+  distributed: True
+  ema_decay: 0.9965
+  embed_bits: 7
+  embed_clip_sigmas: 15.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  enable_looping_at: 0.35
+  eval_seq_len: 2048
+  eval_stride: 64
+  global_ttt_batch_seqs: 32
+  global_ttt_chunk_tokens: 32768
+  global_ttt_epochs: 1
+  global_ttt_grad_clip: 1.0
+  global_ttt_lr: 0.001
+  global_ttt_momentum: 0.9
+  global_ttt_respect_doc_boundaries: True
+  global_ttt_warmup_chunks: 0
+  global_ttt_warmup_start_lr: 0.0
+  gptq_calibration_batches: 16
+  gptq_reserve_seconds: 4.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/PR1530_casefold_gates_s1234.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.026
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_clip_sigmas: 12.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_momentum: 0.97
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_final_lane: mean
+  parallel_start_layer: 8
+  phased_ttt_enabled: True
+  phased_ttt_num_phases: 3
+  phased_ttt_prefix_docs: 2000
+  qk_gain_init: 5.0
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  rope_yarn: False
+  run_id: PR1530_casefold_gates_s1234
+  scalar_lr: 0.02
+  seed: 1234
+  skip_gates_enabled: True
+  sliding_window_enabled: False
+  smear_gate: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: /tmp/casefold_data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: /tmp/casefold_data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  ttt_batch_size: 64
+  ttt_beta1: 0.0
+  ttt_beta2: 0.999
+  ttt_chunk_size: 48
+  ttt_enabled: True
+  ttt_eval_batches: 
+  ttt_eval_seq_len: 2048
+  ttt_grad_steps: 1
+  ttt_k_lora: True
+  ttt_lora_lr: 0.0001
+  ttt_lora_rank: 96
+  ttt_mlp_lora: True
+  ttt_o_lora: True
+  ttt_optimizer: adam
+  ttt_weight_decay: 0.5
+  val_batch_tokens: 524288
+  val_doc_fraction: 1.0
+  val_files: /tmp/casefold_data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.75
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+train_shards: 72
+val_tokens: 36335616
+model_params:35945671
+gptq:reserving 4s, effective=596000ms
+warmup_cu_buckets:64,128,192,256 iters_each:3
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.0152 val_bpb: 3.1280
+1/20000 train_loss: 9.0164 train_time: 0.0m tok/s: 12190574
+2/20000 train_loss: 12.2306 train_time: 0.0m tok/s: 11440189
+3/20000 train_loss: 11.0886 train_time: 0.0m tok/s: 10251851
+4/20000 train_loss: 9.8498 train_time: 0.0m tok/s: 9767778
+5/20000 train_loss: 8.7410 train_time: 0.0m tok/s: 9489646
+500/20000 train_loss: 3.5556 train_time: 0.8m tok/s: 8255778
+1000/20000 train_loss: 3.5009 train_time: 1.6m tok/s: 8239399
+1500/20000 train_loss: 3.4148 train_time: 2.4m tok/s: 8227793
+2000/20000 train_loss: 3.3226 train_time: 3.2m tok/s: 8224853
+layer_loop:enabled step:2180 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+2500/20000 train_loss: 3.3901 train_time: 4.3m tok/s: 7654076
+3000/20000 train_loss: 3.1295 train_time: 5.4m tok/s: 7215135
+3500/20000 train_loss: 3.2427 train_time: 6.6m tok/s: 6912143
+4000/20000 train_loss: 3.2142 train_time: 7.8m tok/s: 6717683
+4000/20000 val_loss: 3.1637 val_bpb: 1.0977
+4500/20000 train_loss: 3.0441 train_time: 9.0m tok/s: 6573558
+4906/20000 val_loss: 3.0482 val_bpb: 1.0576
+stopping_early: wallclock_cap train_time: 596167ms step: 4906/20000
+peak memory allocated: 40372 MiB reserved: 44438 MiB
+ema:applying EMA weights
+diagnostic pre-quantization post-ema val_loss:3.04656291 val_bpb:1.05706734 eval_time:9624ms
+Serialized model: 135417533 bytes
+Code size (uncompressed): 124826 bytes
+Code size (compressed): 28060 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 67 Hessians in 3.4s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int7): tok_emb.weight
+  passthrough (float16): blocks.attn.attn_out_gate_w, blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights, smear_lambda, smear_w
+Serialized model quantized+brotli: 15910712 bytes
+Total submission size quantized+brotli: 15938772 bytes
+diagnostic quantized val_loss:3.07560312 val_bpb:1.06714344 eval_time:11231ms
+ttt_lora:warming up compile (random tokens, no val data)
+ttt_lora:compile warmup done (84.6s)
+
+beginning TTT eval timer
+ttt_phased: total_docs:50000 prefix_docs:2000 suffix_docs:48000 num_phases:3 boundaries:[666, 1333, 2000]
+ttp: b775/782 bl:2.8961 bb:1.0364 rl:2.8961 rb:1.0364 dl:5294-5714 gd:0
+ttp: b774/782 bl:3.0449 bb:1.0829 rl:2.9678 rb:1.0589 dl:4929-5292 gd:0
+ttp: b769/782 bl:2.9453 bb:1.0215 rl:2.9618 rb:1.0486 dl:3826-4013 gd:0
+ttpp: phase:1/3 pd:1104 gd:666 t:133.1s
+tttg: c1/85 lr:0.001000 t:0.3s
+tttg: c2/85 lr:0.001000 t:0.4s
+tttg: c3/85 lr:0.000999 t:0.4s
+tttg: c4/85 lr:0.000997 t:0.5s
+tttg: c5/85 lr:0.000994 t:0.6s
+tttg: c6/85 lr:0.000991 t:0.7s
+tttg: c7/85 lr:0.000987 t:0.7s
+tttg: c8/85 lr:0.000983 t:0.8s
+tttg: c9/85 lr:0.000978 t:0.9s
+tttg: c10/85 lr:0.000972 t:1.0s
+tttg: c11/85 lr:0.000965 t:1.0s
+tttg: c12/85 lr:0.000958 t:1.1s
+tttg: c13/85 lr:0.000950 t:1.2s
+tttg: c14/85 lr:0.000942 t:1.3s
+tttg: c15/85 lr:0.000933 t:1.3s
+tttg: c16/85 lr:0.000923 t:1.4s
+tttg: c17/85 lr:0.000913 t:1.5s
+tttg: c18/85 lr:0.000902 t:1.6s
+tttg: c19/85 lr:0.000891 t:1.6s
+tttg: c20/85 lr:0.000879 t:1.7s
+tttg: c21/85 lr:0.000867 t:1.8s
+tttg: c22/85 lr:0.000854 t:1.9s
+tttg: c23/85 lr:0.000840 t:1.9s
+tttg: c24/85 lr:0.000826 t:2.0s
+tttg: c25/85 lr:0.000812 t:2.1s
+tttg: c26/85 lr:0.000797 t:2.2s
+tttg: c27/85 lr:0.000782 t:2.2s
+tttg: c28/85 lr:0.000766 t:2.3s
+tttg: c29/85 lr:0.000750 t:2.4s
+tttg: c30/85 lr:0.000734 t:2.4s
+tttg: c31/85 lr:0.000717 t:2.5s
+tttg: c32/85 lr:0.000700 t:2.6s
+tttg: c33/85 lr:0.000683 t:2.7s
+tttg: c34/85 lr:0.000665 t:2.7s
+tttg: c35/85 lr:0.000647 t:2.8s
+tttg: c36/85 lr:0.000629 t:2.9s
+tttg: c37/85 lr:0.000611 t:3.0s
+tttg: c38/85 lr:0.000593 t:3.0s
+tttg: c39/85 lr:0.000575 t:3.1s
+tttg: c40/85 lr:0.000556 t:3.2s
+tttg: c41/85 lr:0.000537 t:3.3s
+tttg: c42/85 lr:0.000519 t:3.3s
+tttg: c43/85 lr:0.000500 t:3.4s
+tttg: c44/85 lr:0.000481 t:3.5s
+tttg: c45/85 lr:0.000463 t:3.6s
+tttg: c46/85 lr:0.000444 t:3.6s
+tttg: c47/85 lr:0.000425 t:3.7s
+tttg: c48/85 lr:0.000407 t:3.8s
+tttg: c49/85 lr:0.000389 t:3.9s
+tttg: c50/85 lr:0.000371 t:3.9s
+tttg: c51/85 lr:0.000353 t:4.0s
+tttg: c52/85 lr:0.000335 t:4.1s
+tttg: c53/85 lr:0.000317 t:4.2s
+tttg: c54/85 lr:0.000300 t:4.2s
+tttg: c55/85 lr:0.000283 t:4.3s
+tttg: c56/85 lr:0.000266 t:4.4s
+tttg: c57/85 lr:0.000250 t:4.5s
+tttg: c58/85 lr:0.000234 t:4.5s
+tttg: c59/85 lr:0.000218 t:4.6s
+tttg: c60/85 lr:0.000203 t:4.7s
+tttg: c61/85 lr:0.000188 t:4.8s
+tttg: c62/85 lr:0.000174 t:4.8s
+tttg: c63/85 lr:0.000160 t:4.9s
+tttg: c64/85 lr:0.000146 t:5.0s
+tttg: c65/85 lr:0.000133 t:5.1s
+tttg: c66/85 lr:0.000121 t:5.1s
+tttg: c67/85 lr:0.000109 t:5.2s
+tttg: c68/85 lr:0.000098 t:5.3s
+tttg: c69/85 lr:0.000087 t:5.4s
+tttg: c70/85 lr:0.000077 t:5.4s
+tttg: c71/85 lr:0.000067 t:5.5s
+tttg: c72/85 lr:0.000058 t:5.6s
+tttg: c73/85 lr:0.000050 t:5.6s
+tttg: c74/85 lr:0.000042 t:5.7s
+tttg: c75/85 lr:0.000035 t:5.8s
+tttg: c76/85 lr:0.000028 t:5.9s
+tttg: c77/85 lr:0.000022 t:5.9s
+tttg: c78/85 lr:0.000017 t:6.0s
+tttg: c79/85 lr:0.000013 t:6.1s
+tttg: c80/85 lr:0.000009 t:6.2s
+tttg: c81/85 lr:0.000006 t:6.3s
+tttg: c82/85 lr:0.000003 t:6.3s
+tttg: c83/85 lr:0.000001 t:6.4s
+tttg: c84/85 lr:0.000000 t:6.5s
+ttpr: phase:1/3 t:141.4s
+ttp: b762/782 bl:3.1124 bb:1.0940 rl:2.9885 rb:1.0567 dl:3096-3180 gd:0
+ttpp: phase:2/3 pd:1808 gd:1333 t:196.9s
+tttg: c1/142 lr:0.001000 t:0.1s
+tttg: c2/142 lr:0.001000 t:0.2s
+tttg: c3/142 lr:0.001000 t:0.2s
+tttg: c4/142 lr:0.000999 t:0.3s
+tttg: c5/142 lr:0.000998 t:0.4s
+tttg: c6/142 lr:0.000997 t:0.5s
+tttg: c7/142 lr:0.000996 t:0.5s
+tttg: c8/142 lr:0.000994 t:0.6s
+tttg: c9/142 lr:0.000992 t:0.7s
+tttg: c10/142 lr:0.000990 t:0.8s
+tttg: c11/142 lr:0.000988 t:0.8s
+tttg: c12/142 lr:0.000985 t:0.9s
+tttg: c13/142 lr:0.000982 t:1.0s
+tttg: c14/142 lr:0.000979 t:1.1s
+tttg: c15/142 lr:0.000976 t:1.1s
+tttg: c16/142 lr:0.000972 t:1.2s
+tttg: c17/142 lr:0.000969 t:1.3s
+tttg: c18/142 lr:0.000965 t:1.4s
+tttg: c19/142 lr:0.000960 t:1.5s
+tttg: c20/142 lr:0.000956 t:1.5s
+tttg: c21/142 lr:0.000951 t:1.6s
+tttg: c22/142 lr:0.000946 t:1.7s
+tttg: c23/142 lr:0.000941 t:1.8s
+tttg: c24/142 lr:0.000936 t:1.8s
+tttg: c25/142 lr:0.000930 t:1.9s
+tttg: c26/142 lr:0.000924 t:2.0s
+tttg: c27/142 lr:0.000918 t:2.1s
+tttg: c28/142 lr:0.000912 t:2.1s
+tttg: c29/142 lr:0.000906 t:2.2s
+tttg: c30/142 lr:0.000899 t:2.3s
+tttg: c31/142 lr:0.000892 t:2.4s
+tttg: c32/142 lr:0.000885 t:2.4s
+tttg: c33/142 lr:0.000878 t:2.5s
+tttg: c34/142 lr:0.000871 t:2.6s
+tttg: c35/142 lr:0.000863 t:2.7s
+tttg: c36/142 lr:0.000856 t:2.7s
+tttg: c37/142 lr:0.000848 t:2.8s
+tttg: c38/142 lr:0.000840 t:2.9s
+tttg: c39/142 lr:0.000831 t:3.0s
+tttg: c40/142 lr:0.000823 t:3.1s
+tttg: c41/142 lr:0.000814 t:3.1s
+tttg: c42/142 lr:0.000805 t:3.2s
+tttg: c43/142 lr:0.000797 t:3.3s
+tttg: c44/142 lr:0.000788 t:3.4s
+tttg: c45/142 lr:0.000778 t:3.4s
+tttg: c46/142 lr:0.000769 t:3.5s
+tttg: c47/142 lr:0.000760 t:3.6s
+tttg: c48/142 lr:0.000750 t:3.7s
+tttg: c49/142 lr:0.000740 t:3.7s
+tttg: c50/142 lr:0.000730 t:3.8s
+tttg: c51/142 lr:0.000721 t:3.9s
+tttg: c52/142 lr:0.000710 t:4.0s
+tttg: c53/142 lr:0.000700 t:4.0s
+tttg: c54/142 lr:0.000690 t:4.1s
+tttg: c55/142 lr:0.000680 t:4.2s
+tttg: c56/142 lr:0.000669 t:4.3s
+tttg: c57/142 lr:0.000659 t:4.4s
+tttg: c58/142 lr:0.000648 t:4.4s
+tttg: c59/142 lr:0.000637 t:4.5s
+tttg: c60/142 lr:0.000627 t:4.6s
+tttg: c61/142 lr:0.000616 t:4.7s
+tttg: c62/142 lr:0.000605 t:4.7s
+tttg: c63/142 lr:0.000594 t:4.8s
+tttg: c64/142 lr:0.000583 t:4.9s
+tttg: c65/142 lr:0.000572 t:5.0s
+tttg: c66/142 lr:0.000561 t:5.0s
+tttg: c67/142 lr:0.000550 t:5.1s
+tttg: c68/142 lr:0.000539 t:5.2s
+tttg: c69/142 lr:0.000528 t:5.3s
+tttg: c70/142 lr:0.000517 t:5.4s
+tttg: c71/142 lr:0.000506 t:5.4s
+tttg: c72/142 lr:0.000494 t:5.5s
+tttg: c73/142 lr:0.000483 t:5.6s
+tttg: c74/142 lr:0.000472 t:5.7s
+tttg: c75/142 lr:0.000461 t:5.7s
+tttg: c76/142 lr:0.000450 t:5.8s
+tttg: c77/142 lr:0.000439 t:5.9s
+tttg: c78/142 lr:0.000428 t:6.0s
+tttg: c79/142 lr:0.000417 t:6.0s
+tttg: c80/142 lr:0.000406 t:6.1s
+tttg: c81/142 lr:0.000395 t:6.2s
+tttg: c82/142 lr:0.000384 t:6.3s
+tttg: c83/142 lr:0.000373 t:6.3s
+tttg: c84/142 lr:0.000363 t:6.4s
+tttg: c85/142 lr:0.000352 t:6.5s
+tttg: c86/142 lr:0.000341 t:6.6s
+tttg: c87/142 lr:0.000331 t:6.6s
+tttg: c88/142 lr:0.000320 t:6.7s
+tttg: c89/142 lr:0.000310 t:6.8s
+tttg: c90/142 lr:0.000300 t:6.9s
+tttg: c91/142 lr:0.000290 t:6.9s
+tttg: c92/142 lr:0.000279 t:7.0s
+tttg: c93/142 lr:0.000270 t:7.1s
+tttg: c94/142 lr:0.000260 t:7.2s
+tttg: c95/142 lr:0.000250 t:7.2s
+tttg: c96/142 lr:0.000240 t:7.3s
+tttg: c97/142 lr:0.000231 t:7.4s
+tttg: c98/142 lr:0.000222 t:7.5s
+tttg: c99/142 lr:0.000212 t:7.5s
+tttg: c100/142 lr:0.000203 t:7.6s
+tttg: c101/142 lr:0.000195 t:7.7s
+tttg: c102/142 lr:0.000186 t:7.8s
+tttg: c103/142 lr:0.000177 t:7.8s
+tttg: c104/142 lr:0.000169 t:7.9s
+tttg: c105/142 lr:0.000160 t:8.0s
+tttg: c106/142 lr:0.000152 t:8.1s
+tttg: c107/142 lr:0.000144 t:8.1s
+tttg: c108/142 lr:0.000137 t:8.2s
+tttg: c109/142 lr:0.000129 t:8.3s
+tttg: c110/142 lr:0.000122 t:8.4s
+tttg: c111/142 lr:0.000115 t:8.4s
+tttg: c112/142 lr:0.000108 t:8.5s
+tttg: c113/142 lr:0.000101 t:8.6s
+tttg: c114/142 lr:0.000094 t:8.7s
+tttg: c115/142 lr:0.000088 t:8.7s
+tttg: c116/142 lr:0.000082 t:8.8s
+tttg: c117/142 lr:0.000076 t:8.9s
+tttg: c118/142 lr:0.000070 t:9.0s
+tttg: c119/142 lr:0.000064 t:9.0s
+tttg: c120/142 lr:0.000059 t:9.1s
+tttg: c121/142 lr:0.000054 t:9.2s
+tttg: c122/142 lr:0.000049 t:9.3s
+tttg: c123/142 lr:0.000044 t:9.3s
+tttg: c124/142 lr:0.000040 t:9.4s
+tttg: c125/142 lr:0.000035 t:9.5s
+tttg: c126/142 lr:0.000031 t:9.6s
+tttg: c127/142 lr:0.000028 t:9.6s
+tttg: c128/142 lr:0.000024 t:9.7s
+tttg: c129/142 lr:0.000021 t:9.8s
+tttg: c130/142 lr:0.000018 t:9.9s
+tttg: c131/142 lr:0.000015 t:9.9s
+tttg: c132/142 lr:0.000012 t:10.0s
+tttg: c133/142 lr:0.000010 t:10.1s
+tttg: c134/142 lr:0.000008 t:10.2s
+tttg: c135/142 lr:0.000006 t:10.2s
+tttg: c136/142 lr:0.000004 t:10.3s
+tttg: c137/142 lr:0.000003 t:10.4s
+tttg: c138/142 lr:0.000002 t:10.5s
+tttg: c139/142 lr:0.000001 t:10.5s
+tttg: c140/142 lr:0.000000 t:10.6s
+tttg: c141/142 lr:0.000000 t:10.7s
+ttpr: phase:2/3 t:209.4s
+ttp: b752/782 bl:3.0154 bb:1.0359 rl:2.9918 rb:1.0540 dl:2475-2522 gd:0
+ttpp: phase:3/3 pd:2448 gd:2000 t:220.5s
+tttg: c1/192 lr:0.001000 t:0.1s
+tttg: c2/192 lr:0.001000 t:0.2s
+tttg: c3/192 lr:0.001000 t:0.2s
+tttg: c4/192 lr:0.000999 t:0.3s
+tttg: c5/192 lr:0.000999 t:0.4s
+tttg: c6/192 lr:0.000998 t:0.5s
+tttg: c7/192 lr:0.000998 t:0.5s
+tttg: c8/192 lr:0.000997 t:0.6s
+tttg: c9/192 lr:0.000996 t:0.7s
+tttg: c10/192 lr:0.000995 t:0.8s
+tttg: c11/192 lr:0.000993 t:0.8s
+tttg: c12/192 lr:0.000992 t:0.9s
+tttg: c13/192 lr:0.000990 t:1.0s
+tttg: c14/192 lr:0.000989 t:1.1s
+tttg: c15/192 lr:0.000987 t:1.1s
+tttg: c16/192 lr:0.000985 t:1.2s
+tttg: c17/192 lr:0.000983 t:1.3s
+tttg: c18/192 lr:0.000981 t:1.4s
+tttg: c19/192 lr:0.000978 t:1.4s
+tttg: c20/192 lr:0.000976 t:1.5s
+tttg: c21/192 lr:0.000973 t:1.6s
+tttg: c22/192 lr:0.000970 t:1.7s
+tttg: c23/192 lr:0.000968 t:1.8s
+tttg: c24/192 lr:0.000965 t:1.8s
+tttg: c25/192 lr:0.000962 t:1.9s
+tttg: c26/192 lr:0.000958 t:2.0s
+tttg: c27/192 lr:0.000955 t:2.1s
+tttg: c28/192 lr:0.000951 t:2.1s
+tttg: c29/192 lr:0.000948 t:2.2s
+tttg: c30/192 lr:0.000944 t:2.3s
+tttg: c31/192 lr:0.000940 t:2.4s
+tttg: c32/192 lr:0.000936 t:2.4s
+tttg: c33/192 lr:0.000932 t:2.5s
+tttg: c34/192 lr:0.000928 t:2.6s
+tttg: c35/192 lr:0.000924 t:2.7s
+tttg: c36/192 lr:0.000919 t:2.7s
+tttg: c37/192 lr:0.000915 t:2.8s
+tttg: c38/192 lr:0.000910 t:2.9s
+tttg: c39/192 lr:0.000905 t:3.0s
+tttg: c40/192 lr:0.000901 t:3.1s
+tttg: c41/192 lr:0.000896 t:3.1s
+tttg: c42/192 lr:0.000891 t:3.2s
+tttg: c43/192 lr:0.000885 t:3.3s
+tttg: c44/192 lr:0.000880 t:3.4s
+tttg: c45/192 lr:0.000875 t:3.4s
+tttg: c46/192 lr:0.000869 t:3.5s
+tttg: c47/192 lr:0.000864 t:3.6s
+tttg: c48/192 lr:0.000858 t:3.7s
+tttg: c49/192 lr:0.000852 t:3.8s
+tttg: c50/192 lr:0.000846 t:3.8s
+tttg: c51/192 lr:0.000840 t:3.9s
+tttg: c52/192 lr:0.000834 t:4.0s
+tttg: c53/192 lr:0.000828 t:4.1s
+tttg: c54/192 lr:0.000822 t:4.1s
+tttg: c55/192 lr:0.000815 t:4.2s
+tttg: c56/192 lr:0.000809 t:4.3s
+tttg: c57/192 lr:0.000802 t:4.4s
+tttg: c58/192 lr:0.000796 t:4.4s
+tttg: c59/192 lr:0.000789 t:4.5s
+tttg: c60/192 lr:0.000782 t:4.6s
+tttg: c61/192 lr:0.000776 t:4.7s
+tttg: c62/192 lr:0.000769 t:4.7s
+tttg: c63/192 lr:0.000762 t:4.8s
+tttg: c64/192 lr:0.000755 t:4.9s
+tttg: c65/192 lr:0.000748 t:5.0s
+tttg: c66/192 lr:0.000740 t:5.0s
+tttg: c67/192 lr:0.000733 t:5.1s
+tttg: c68/192 lr:0.000726 t:5.2s
+tttg: c69/192 lr:0.000719 t:5.3s
+tttg: c70/192 lr:0.000711 t:5.3s
+tttg: c71/192 lr:0.000704 t:5.4s
+tttg: c72/192 lr:0.000696 t:5.5s
+tttg: c73/192 lr:0.000688 t:5.6s
+tttg: c74/192 lr:0.000681 t:5.7s
+tttg: c75/192 lr:0.000673 t:5.7s
+tttg: c76/192 lr:0.000665 t:5.8s
+tttg: c77/192 lr:0.000658 t:5.9s
+tttg: c78/192 lr:0.000650 t:6.0s
+tttg: c79/192 lr:0.000642 t:6.0s
+tttg: c80/192 lr:0.000634 t:6.1s
+tttg: c81/192 lr:0.000626 t:6.2s
+tttg: c82/192 lr:0.000618 t:6.3s
+tttg: c83/192 lr:0.000610 t:6.3s
+tttg: c84/192 lr:0.000602 t:6.4s
+tttg: c85/192 lr:0.000594 t:6.5s
+tttg: c86/192 lr:0.000586 t:6.6s
+tttg: c87/192 lr:0.000578 t:6.6s
+tttg: c88/192 lr:0.000570 t:6.7s
+tttg: c89/192 lr:0.000562 t:6.8s
+tttg: c90/192 lr:0.000553 t:6.9s
+tttg: c91/192 lr:0.000545 t:6.9s
+tttg: c92/192 lr:0.000537 t:7.0s
+tttg: c93/192 lr:0.000529 t:7.1s
+tttg: c94/192 lr:0.000521 t:7.2s
+tttg: c95/192 lr:0.000512 t:7.2s
+tttg: c96/192 lr:0.000504 t:7.3s
+tttg: c97/192 lr:0.000496 t:7.4s
+tttg: c98/192 lr:0.000488 t:7.5s
+tttg: c99/192 lr:0.000479 t:7.5s
+tttg: c100/192 lr:0.000471 t:7.6s
+tttg: c101/192 lr:0.000463 t:7.7s
+tttg: c102/192 lr:0.000455 t:7.8s
+tttg: c103/192 lr:0.000447 t:7.9s
+tttg: c104/192 lr:0.000438 t:7.9s
+tttg: c105/192 lr:0.000430 t:8.0s
+tttg: c106/192 lr:0.000422 t:8.1s
+tttg: c107/192 lr:0.000414 t:8.2s
+tttg: c108/192 lr:0.000406 t:8.2s
+tttg: c109/192 lr:0.000398 t:8.3s
+tttg: c110/192 lr:0.000390 t:8.4s
+tttg: c111/192 lr:0.000382 t:8.5s
+tttg: c112/192 lr:0.000374 t:8.5s
+tttg: c113/192 lr:0.000366 t:8.6s
+tttg: c114/192 lr:0.000358 t:8.7s
+tttg: c115/192 lr:0.000350 t:8.8s
+tttg: c116/192 lr:0.000342 t:8.8s
+tttg: c117/192 lr:0.000335 t:8.9s
+tttg: c118/192 lr:0.000327 t:9.0s
+tttg: c119/192 lr:0.000319 t:9.1s
+tttg: c120/192 lr:0.000312 t:9.1s
+tttg: c121/192 lr:0.000304 t:9.2s
+tttg: c122/192 lr:0.000296 t:9.3s
+tttg: c123/192 lr:0.000289 t:9.4s
+tttg: c124/192 lr:0.000281 t:9.5s
+tttg: c125/192 lr:0.000274 t:9.5s
+tttg: c126/192 lr:0.000267 t:9.6s
+tttg: c127/192 lr:0.000260 t:9.7s
+tttg: c128/192 lr:0.000252 t:9.8s
+tttg: c129/192 lr:0.000245 t:9.8s
+tttg: c130/192 lr:0.000238 t:9.9s
+tttg: c131/192 lr:0.000231 t:10.0s
+tttg: c132/192 lr:0.000224 t:10.1s
+tttg: c133/192 lr:0.000218 t:10.1s
+tttg: c134/192 lr:0.000211 t:10.2s
+tttg: c135/192 lr:0.000204 t:10.3s
+tttg: c136/192 lr:0.000198 t:10.4s
+tttg: c137/192 lr:0.000191 t:10.4s
+tttg: c138/192 lr:0.000185 t:10.5s
+tttg: c139/192 lr:0.000178 t:10.6s
+tttg: c140/192 lr:0.000172 t:10.7s
+tttg: c141/192 lr:0.000166 t:10.7s
+tttg: c142/192 lr:0.000160 t:10.8s
+tttg: c143/192 lr:0.000154 t:10.9s
+tttg: c144/192 lr:0.000148 t:11.0s
+tttg: c145/192 lr:0.000142 t:11.0s
+tttg: c146/192 lr:0.000136 t:11.1s
+tttg: c147/192 lr:0.000131 t:11.2s
+tttg: c148/192 lr:0.000125 t:11.3s
+tttg: c149/192 lr:0.000120 t:11.3s
+tttg: c150/192 lr:0.000115 t:11.4s
+tttg: c151/192 lr:0.000109 t:11.5s
+tttg: c152/192 lr:0.000104 t:11.6s
+tttg: c153/192 lr:0.000099 t:11.6s
+tttg: c154/192 lr:0.000095 t:11.7s
+tttg: c155/192 lr:0.000090 t:11.8s
+tttg: c156/192 lr:0.000085 t:11.9s
+tttg: c157/192 lr:0.000081 t:11.9s
+tttg: c158/192 lr:0.000076 t:12.0s
+tttg: c159/192 lr:0.000072 t:12.1s
+tttg: c160/192 lr:0.000068 t:12.2s
+tttg: c161/192 lr:0.000064 t:12.2s
+tttg: c162/192 lr:0.000060 t:12.3s
+tttg: c163/192 lr:0.000056 t:12.4s
+tttg: c164/192 lr:0.000052 t:12.5s
+tttg: c165/192 lr:0.000049 t:12.5s
+tttg: c166/192 lr:0.000045 t:12.6s
+tttg: c167/192 lr:0.000042 t:12.7s
+tttg: c168/192 lr:0.000038 t:12.8s
+tttg: c169/192 lr:0.000035 t:12.9s
+tttg: c170/192 lr:0.000032 t:12.9s
+tttg: c171/192 lr:0.000030 t:13.0s
+tttg: c172/192 lr:0.000027 t:13.1s
+tttg: c173/192 lr:0.000024 t:13.2s
+tttg: c174/192 lr:0.000022 t:13.2s
+tttg: c175/192 lr:0.000019 t:13.3s
+tttg: c176/192 lr:0.000017 t:13.4s
+tttg: c177/192 lr:0.000015 t:13.5s
+tttg: c178/192 lr:0.000013 t:13.5s
+tttg: c179/192 lr:0.000011 t:13.6s
+tttg: c180/192 lr:0.000010 t:13.7s
+tttg: c181/192 lr:0.000008 t:13.8s
+tttg: c182/192 lr:0.000007 t:13.8s
+tttg: c183/192 lr:0.000005 t:13.9s
+tttg: c184/192 lr:0.000004 t:14.0s
+tttg: c185/192 lr:0.000003 t:14.1s
+tttg: c186/192 lr:0.000002 t:14.1s
+tttg: c187/192 lr:0.000002 t:14.2s
+tttg: c188/192 lr:0.000001 t:14.3s
+tttg: c189/192 lr:0.000001 t:14.4s
+tttg: c190/192 lr:0.000000 t:14.4s
+tttg: c191/192 lr:0.000000 t:14.5s
+ttpr: phase:3/3 t:236.8s
+ttp: b738/782 bl:3.0362 bb:1.0400 rl:2.9958 rb:1.0527 dl:1974-1999 gd:1
+ttp: b733/782 bl:2.9970 bb:1.0280 rl:2.9959 rb:1.0508 dl:1855-1877 gd:1
+ttp: b722/782 bl:2.9445 bb:1.0190 rl:2.9926 rb:1.0487 dl:1657-1675 gd:1
+ttp: b719/782 bl:3.0794 bb:1.0632 rl:2.9977 rb:1.0496 dl:1614-1627 gd:1
+ttp: b709/782 bl:2.9910 bb:1.0318 rl:2.9974 rb:1.0486 dl:1483-1495 gd:1
+ttp: b696/782 bl:3.0426 bb:1.0595 rl:2.9994 rb:1.0491 dl:1362-1371 gd:1
+ttp: b692/782 bl:3.1296 bb:1.0738 rl:3.0050 rb:1.0502 dl:1329-1338 gd:1
+ttp: b686/782 bl:2.9798 bb:1.0080 rl:3.0040 rb:1.0485 dl:1282-1289 gd:1
+ttp: b677/782 bl:3.0582 bb:1.0371 rl:3.0059 rb:1.0481 dl:1218-1226 gd:1
+ttp: b667/782 bl:3.0622 bb:1.0558 rl:3.0078 rb:1.0483 dl:1159-1165 gd:1
+ttp: b656/782 bl:3.0689 bb:1.0411 rl:3.0096 rb:1.0481 dl:1099-1104 gd:1
+ttp: b654/782 bl:2.9881 bb:1.0430 rl:3.0090 rb:1.0480 dl:1091-1095 gd:1
+ttp: b646/782 bl:3.0708 bb:1.0613 rl:3.0107 rb:1.0483 dl:1049-1054 gd:1
+ttp: b639/782 bl:3.0382 bb:1.0260 rl:3.0114 rb:1.0477 dl:1015-1020 gd:1
+ttp: b625/782 bl:3.0419 bb:1.0544 rl:3.0121 rb:1.0479 dl:957-961 gd:1
+ttp: b623/782 bl:3.0775 bb:1.0437 rl:3.0136 rb:1.0478 dl:948-953 gd:1
+ttp: b615/782 bl:2.9772 bb:1.0105 rl:3.0128 rb:1.0470 dl:916-920 gd:1
+ttp: b601/782 bl:3.0443 bb:1.0328 rl:3.0135 rb:1.0467 dl:866-869 gd:1
+ttp: b593/782 bl:3.0491 bb:1.0520 rl:3.0142 rb:1.0468 dl:839-842 gd:1
+ttp: b584/782 bl:3.0053 bb:1.0199 rl:3.0140 rb:1.0463 dl:811-814 gd:1
+ttp: b580/782 bl:3.0406 bb:1.0463 rl:3.0145 rb:1.0463 dl:799-802 gd:1
+ttp: b572/782 bl:3.0728 bb:1.0448 rl:3.0154 rb:1.0463 dl:777-779 gd:1
+ttp: b563/782 bl:3.0543 bb:1.0532 rl:3.0161 rb:1.0464 dl:751-754 gd:1
+ttp: b555/782 bl:2.9852 bb:1.0291 rl:3.0156 rb:1.0461 dl:730-732 gd:1
+ttp: b547/782 bl:3.0153 bb:1.0312 rl:3.0156 rb:1.0459 dl:709-713 gd:1
+ttp: b540/782 bl:2.9828 bb:1.0224 rl:3.0151 rb:1.0455 dl:692-694 gd:1
+ttp: b532/782 bl:3.0479 bb:1.0328 rl:3.0156 rb:1.0454 dl:675-677 gd:1
+ttp: b521/782 bl:3.0269 bb:1.0284 rl:3.0157 rb:1.0451 dl:650-652 gd:1
+ttp: b513/782 bl:2.9417 bb:1.0080 rl:3.0148 rb:1.0447 dl:633-635 gd:1
+ttp: b505/782 bl:3.1445 bb:1.0658 rl:3.0163 rb:1.0449 dl:616-618 gd:1
+ttp: b498/782 bl:3.1072 bb:1.0692 rl:3.0174 rb:1.0452 dl:602-603 gd:1
+ttp: b490/782 bl:3.0424 bb:1.0536 rl:3.0177 rb:1.0453 dl:586-588 gd:1
+ttp: b482/782 bl:3.0908 bb:1.0539 rl:3.0184 rb:1.0454 dl:570-573 gd:1
+ttp: b477/782 bl:3.0374 bb:1.0126 rl:3.0186 rb:1.0451 dl:561-563 gd:1
+ttp: b469/782 bl:3.0695 bb:1.0586 rl:3.0191 rb:1.0452 dl:547-549 gd:1
+ttp: b461/782 bl:3.0864 bb:1.0338 rl:3.0198 rb:1.0451 dl:533-535 gd:1
+ttp: b450/782 bl:3.0330 bb:1.0204 rl:3.0199 rb:1.0448 dl:514-516 gd:1
+ttp: b442/782 bl:3.1304 bb:1.0740 rl:3.0209 rb:1.0451 dl:500-502 gd:1
+ttp: b434/782 bl:2.9952 bb:1.0559 rl:3.0207 rb:1.0452 dl:488-490 gd:1
+ttp: b427/782 bl:3.0859 bb:1.0623 rl:3.0212 rb:1.0453 dl:477-479 gd:1
+ttp: b421/782 bl:3.1047 bb:1.0321 rl:3.0219 rb:1.0452 dl:468-470 gd:1
+ttp: b414/782 bl:3.1737 bb:1.0790 rl:3.0231 rb:1.0455 dl:457-458 gd:1
+ttp: b407/782 bl:3.1484 bb:1.0707 rl:3.0241 rb:1.0457 dl:446-448 gd:1
+ttp: b399/782 bl:3.0869 bb:1.0569 rl:3.0245 rb:1.0458 dl:435-436 gd:1
+ttp: b390/782 bl:3.1211 bb:1.0618 rl:3.0252 rb:1.0459 dl:423-424 gd:1
+ttp: b382/782 bl:3.0769 bb:1.0520 rl:3.0256 rb:1.0459 dl:412-414 gd:1
+ttp: b374/782 bl:3.1337 bb:1.0721 rl:3.0263 rb:1.0461 dl:401-403 gd:1
+ttp: b366/782 bl:3.0377 bb:1.0297 rl:3.0264 rb:1.0460 dl:391-392 gd:1
+ttp: b357/782 bl:3.0382 bb:1.0308 rl:3.0264 rb:1.0459 dl:380-382 gd:1
+ttp: b352/782 bl:3.1602 bb:1.0767 rl:3.0272 rb:1.0461 dl:374-375 gd:1
+ttp: b344/782 bl:3.0313 bb:1.0515 rl:3.0273 rb:1.0461 dl:364-365 gd:1
+ttp: b335/782 bl:3.0788 bb:1.0809 rl:3.0276 rb:1.0463 dl:353-355 gd:1
+ttp: b326/782 bl:3.1234 bb:1.0799 rl:3.0281 rb:1.0465 dl:343-344 gd:1
+ttp: b318/782 bl:3.1266 bb:1.0850 rl:3.0286 rb:1.0467 dl:334-335 gd:1
+ttp: b313/782 bl:3.0868 bb:1.0825 rl:3.0289 rb:1.0469 dl:328-329 gd:1
+ttp: b305/782 bl:3.1795 bb:1.0967 rl:3.0297 rb:1.0471 dl:318-320 gd:1
+ttp: b297/782 bl:3.1465 bb:1.0895 rl:3.0302 rb:1.0473 dl:310-311 gd:1
+ttp: b288/782 bl:3.1540 bb:1.0898 rl:3.0308 rb:1.0475 dl:300-301 gd:1
+ttp: b280/782 bl:3.1240 bb:1.1098 rl:3.0312 rb:1.0478 dl:292-293 gd:1
+ttp: b271/782 bl:3.1256 bb:1.0838 rl:3.0316 rb:1.0480 dl:283-284 gd:1
+ttp: b263/782 bl:3.1529 bb:1.0929 rl:3.0322 rb:1.0482 dl:274-275 gd:1
+ttp: b255/782 bl:3.1124 bb:1.0653 rl:3.0325 rb:1.0482 dl:267-268 gd:1
+ttp: b246/782 bl:3.1637 bb:1.0712 rl:3.0330 rb:1.0483 dl:259-260 gd:1
+ttp: b240/782 bl:3.2566 bb:1.1377 rl:3.0339 rb:1.0487 dl:254-254 gd:1
+ttp: b230/782 bl:3.2479 bb:1.1269 rl:3.0347 rb:1.0490 dl:244-245 gd:1
+ttp: b222/782 bl:3.1368 bb:1.0826 rl:3.0350 rb:1.0491 dl:237-238 gd:1
+ttp: b215/782 bl:3.0804 bb:1.0700 rl:3.0352 rb:1.0492 dl:231-232 gd:1
+ttp: b206/782 bl:3.0993 bb:1.1015 rl:3.0354 rb:1.0493 dl:224-225 gd:1
+ttp: b198/782 bl:3.1701 bb:1.1260 rl:3.0358 rb:1.0496 dl:217-218 gd:1
+ttp: b190/782 bl:3.2051 bb:1.1024 rl:3.0364 rb:1.0497 dl:211-212 gd:1
+ttp: b182/782 bl:3.3808 bb:1.1936 rl:3.0374 rb:1.0502 dl:204-205 gd:1
+ttp: b174/782 bl:3.1089 bb:1.0840 rl:3.0376 rb:1.0503 dl:199-199 gd:1
+ttp: b166/782 bl:3.1281 bb:1.0699 rl:3.0379 rb:1.0503 dl:192-193 gd:1
+ttp: b161/782 bl:3.2363 bb:1.1415 rl:3.0384 rb:1.0506 dl:188-189 gd:1
+ttp: b152/782 bl:3.2996 bb:1.1526 rl:3.0391 rb:1.0508 dl:182-183 gd:1
+ttp: b146/782 bl:3.2788 bb:1.1403 rl:3.0397 rb:1.0511 dl:178-178 gd:1
+ttp: b138/782 bl:3.2041 bb:1.1095 rl:3.0402 rb:1.0512 dl:172-172 gd:1
+ttp: b130/782 bl:3.2124 bb:1.1260 rl:3.0406 rb:1.0514 dl:166-166 gd:1
+ttp: b121/782 bl:3.2959 bb:1.1570 rl:3.0412 rb:1.0516 dl:160-160 gd:1
+ttp: b111/782 bl:3.2348 bb:1.1652 rl:3.0416 rb:1.0519 dl:153-154 gd:1
+ttp: b104/782 bl:3.2349 bb:1.1250 rl:3.0420 rb:1.0520 dl:148-149 gd:1
+ttp: b97/782 bl:3.2998 bb:1.1876 rl:3.0425 rb:1.0523 dl:144-144 gd:1
+ttp: b87/782 bl:3.2870 bb:1.1609 rl:3.0430 rb:1.0525 dl:137-138 gd:1
+ttp: b79/782 bl:3.2824 bb:1.1627 rl:3.0435 rb:1.0527 dl:131-132 gd:1
+ttp: b71/782 bl:3.3154 bb:1.1671 rl:3.0440 rb:1.0529 dl:125-126 gd:1
+ttp: b62/782 bl:3.3892 bb:1.2143 rl:3.0445 rb:1.0532 dl:119-120 gd:1
+ttp: b55/782 bl:3.3791 bb:1.1936 rl:3.0451 rb:1.0534 dl:114-115 gd:1
+ttp: b48/782 bl:3.3922 bb:1.1827 rl:3.0456 rb:1.0536 dl:110-110 gd:1
+ttp: b40/782 bl:3.1990 bb:1.1406 rl:3.0459 rb:1.0537 dl:104-104 gd:1
+ttp: b29/782 bl:3.4635 bb:1.2324 rl:3.0464 rb:1.0540 dl:96-97 gd:1
+ttp: b28/782 bl:3.4917 bb:1.2268 rl:3.0470 rb:1.0542 dl:95-96 gd:1
+ttp: b20/782 bl:3.4816 bb:1.2107 rl:3.0476 rb:1.0544 dl:88-89 gd:1
+ttp: b12/782 bl:3.6249 bb:1.2594 rl:3.0482 rb:1.0546 dl:81-81 gd:1
+ttp: b3/782 bl:3.5694 bb:1.1938 rl:3.0487 rb:1.0548 dl:65-68 gd:1
+quantized_ttt_phased val_loss:3.04845738 val_bpb:1.05776870 eval_time:306651ms
+total_eval_time:306.7s
diff --git a/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/train_seed42.log b/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/train_seed42.log
new file mode 100644
index 0000000000..ebb2d03d80
--- /dev/null
+++ b/records/track_10min_16mb/2026-04-17_CasefoldV4_AttnOutGate_PhasedTTT/train_seed42.log
@@ -0,0 +1,698 @@
+
+*****************************************
+Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+*****************************************
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  artifact_dir: 
+  attn_clip_sigmas: 13.0
+  attn_out_gate: True
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: /tmp/casefold_data/
+  datasets_dir: /tmp/casefold_data/datasets/fineweb10B_sp8192
+  distributed: True
+  ema_decay: 0.9965
+  embed_bits: 7
+  embed_clip_sigmas: 15.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  enable_looping_at: 0.35
+  eval_seq_len: 2048
+  eval_stride: 64
+  global_ttt_batch_seqs: 32
+  global_ttt_chunk_tokens: 32768
+  global_ttt_epochs: 1
+  global_ttt_grad_clip: 1.0
+  global_ttt_lr: 0.001
+  global_ttt_momentum: 0.9
+  global_ttt_respect_doc_boundaries: True
+  global_ttt_warmup_chunks: 0
+  global_ttt_warmup_start_lr: 0.0
+  gptq_calibration_batches: 16
+  gptq_reserve_seconds: 4.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/PR1530_casefold_gates_s42.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.026
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_clip_sigmas: 12.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_momentum: 0.97
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_final_lane: mean
+  parallel_start_layer: 8
+  phased_ttt_enabled: True
+  phased_ttt_num_phases: 3
+  phased_ttt_prefix_docs: 2000
+  qk_gain_init: 5.0
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  rope_yarn: False
+  run_id: PR1530_casefold_gates_s42
+  scalar_lr: 0.02
+  seed: 42
+  skip_gates_enabled: True
+  sliding_window_enabled: False
+  smear_gate: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: /tmp/casefold_data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: /tmp/casefold_data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  ttt_batch_size: 64
+  ttt_beta1: 0.0
+  ttt_beta2: 0.999
+  ttt_chunk_size: 48
+  ttt_enabled: True
+  ttt_eval_batches: 
+  ttt_eval_seq_len: 2048
+  ttt_grad_steps: 1
+  ttt_k_lora: True
+  ttt_lora_lr: 0.0001
+  ttt_lora_rank: 96
+  ttt_mlp_lora: True
+  ttt_o_lora: True
+  ttt_optimizer: adam
+  ttt_weight_decay: 0.5
+  val_batch_tokens: 524288
+  val_doc_fraction: 1.0
+  val_files: /tmp/casefold_data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.75
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+train_shards: 72
+val_tokens: 36335616
+model_params:35945671
+gptq:reserving 4s, effective=596000ms
+warmup_cu_buckets:64,128,192,256 iters_each:3
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.0165 val_bpb: 3.1285
+1/20000 train_loss: 9.0164 train_time: 0.0m tok/s: 12164971
+2/20000 train_loss: 12.3353 train_time: 0.0m tok/s: 11332301
+3/20000 train_loss: 11.2214 train_time: 0.0m tok/s: 10188401
+4/20000 train_loss: 9.8791 train_time: 0.0m tok/s: 9731075
+5/20000 train_loss: 8.7194 train_time: 0.0m tok/s: 9450621
+500/20000 train_loss: 3.5511 train_time: 0.8m tok/s: 8219161
+1000/20000 train_loss: 3.4946 train_time: 1.6m tok/s: 8195742
+1500/20000 train_loss: 3.4057 train_time: 2.4m tok/s: 8188091
+2000/20000 train_loss: 3.3157 train_time: 3.2m tok/s: 8185329
+layer_loop:enabled step:2169 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+2500/20000 train_loss: 3.3822 train_time: 4.2m tok/s: 7711131
+3000/20000 train_loss: 3.1195 train_time: 5.4m tok/s: 7253288
+3500/20000 train_loss: 3.2426 train_time: 6.6m tok/s: 6958354
+4000/20000 train_loss: 3.2104 train_time: 7.8m tok/s: 6739340
+4000/20000 val_loss: 3.1630 val_bpb: 1.0975
+4500/20000 train_loss: 3.0411 train_time: 8.9m tok/s: 6590473
+4902/20000 val_loss: 3.0463 val_bpb: 1.0570
+stopping_early: wallclock_cap train_time: 596052ms step: 4902/20000
+peak memory allocated: 40372 MiB reserved: 44438 MiB
+ema:applying EMA weights
+diagnostic pre-quantization post-ema val_loss:3.04446985 val_bpb:1.05634111 eval_time:10321ms
+Serialized model: 135417533 bytes
+Code size (uncompressed): 124826 bytes
+Code size (compressed): 28060 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 67 Hessians in 3.4s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int7): tok_emb.weight
+  passthrough (float16): blocks.attn.attn_out_gate_w, blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights, smear_lambda, smear_w
+Serialized model quantized+brotli: 15908209 bytes
+Total submission size quantized+brotli: 15936269 bytes
+diagnostic quantized val_loss:3.07327027 val_bpb:1.06633401 eval_time:41297ms
+ttt_lora:warming up compile (random tokens, no val data)
+ttt_lora:compile warmup done (90.5s)
+
+beginning TTT eval timer
+ttt_phased: total_docs:50000 prefix_docs:2000 suffix_docs:48000 num_phases:3 boundaries:[666, 1333, 2000]
+ttp: b779/782 bl:2.9541 bb:1.0814 rl:2.9541 rb:1.0814 dl:8165-9926 gd:0
+ttp: b771/782 bl:3.0565 bb:1.0670 rl:2.9876 rb:1.0766 dl:4199-4431 gd:0
+ttp: b766/782 bl:2.8607 bb:1.0171 rl:2.9609 rb:1.0639 dl:3459-3554 gd:0
+ttpp: phase:1/3 pd:1104 gd:666 t:172.1s
+tttg: c1/85 lr:0.001000 t:0.3s
+tttg: c2/85 lr:0.001000 t:0.4s
+tttg: c3/85 lr:0.000999 t:0.5s
+tttg: c4/85 lr:0.000997 t:0.6s
+tttg: c5/85 lr:0.000994 t:0.6s
+tttg: c6/85 lr:0.000991 t:0.7s
+tttg: c7/85 lr:0.000987 t:0.8s
+tttg: c8/85 lr:0.000983 t:0.9s
+tttg: c9/85 lr:0.000978 t:1.0s
+tttg: c10/85 lr:0.000972 t:1.1s
+tttg: c11/85 lr:0.000965 t:1.1s
+tttg: c12/85 lr:0.000958 t:1.2s
+tttg: c13/85 lr:0.000950 t:1.3s
+tttg: c14/85 lr:0.000942 t:1.4s
+tttg: c15/85 lr:0.000933 t:1.4s
+tttg: c16/85 lr:0.000923 t:1.5s
+tttg: c17/85 lr:0.000913 t:1.6s
+tttg: c18/85 lr:0.000902 t:1.7s
+tttg: c19/85 lr:0.000891 t:1.8s
+tttg: c20/85 lr:0.000879 t:1.8s
+tttg: c21/85 lr:0.000867 t:1.9s
+tttg: c22/85 lr:0.000854 t:2.0s
+tttg: c23/85 lr:0.000840 t:2.1s
+tttg: c24/85 lr:0.000826 t:2.2s
+tttg: c25/85 lr:0.000812 t:2.2s
+tttg: c26/85 lr:0.000797 t:2.3s
+tttg: c27/85 lr:0.000782 t:2.4s
+tttg: c28/85 lr:0.000766 t:2.5s
+tttg: c29/85 lr:0.000750 t:2.6s
+tttg: c30/85 lr:0.000734 t:2.6s
+tttg: c31/85 lr:0.000717 t:2.7s
+tttg: c32/85 lr:0.000700 t:2.8s
+tttg: c33/85 lr:0.000683 t:2.9s
+tttg: c34/85 lr:0.000665 t:3.0s
+tttg: c35/85 lr:0.000647 t:3.0s
+tttg: c36/85 lr:0.000629 t:3.1s
+tttg: c37/85 lr:0.000611 t:3.2s
+tttg: c38/85 lr:0.000593 t:3.3s
+tttg: c39/85 lr:0.000575 t:3.3s
+tttg: c40/85 lr:0.000556 t:3.4s
+tttg: c41/85 lr:0.000537 t:3.5s
+tttg: c42/85 lr:0.000519 t:3.6s
+tttg: c43/85 lr:0.000500 t:3.7s
+tttg: c44/85 lr:0.000481 t:3.8s
+tttg: c45/85 lr:0.000463 t:3.8s
+tttg: c46/85 lr:0.000444 t:3.9s
+tttg: c47/85 lr:0.000425 t:4.0s
+tttg: c48/85 lr:0.000407 t:4.1s
+tttg: c49/85 lr:0.000389 t:4.2s
+tttg: c50/85 lr:0.000371 t:4.2s
+tttg: c51/85 lr:0.000353 t:4.3s
+tttg: c52/85 lr:0.000335 t:4.4s
+tttg: c53/85 lr:0.000317 t:4.5s
+tttg: c54/85 lr:0.000300 t:4.5s
+tttg: c55/85 lr:0.000283 t:4.6s
+tttg: c56/85 lr:0.000266 t:4.7s
+tttg: c57/85 lr:0.000250 t:4.8s
+tttg: c58/85 lr:0.000234 t:4.9s
+tttg: c59/85 lr:0.000218 t:5.0s
+tttg: c60/85 lr:0.000203 t:5.0s
+tttg: c61/85 lr:0.000188 t:5.1s
+tttg: c62/85 lr:0.000174 t:5.2s
+tttg: c63/85 lr:0.000160 t:5.3s
+tttg: c64/85 lr:0.000146 t:5.4s
+tttg: c65/85 lr:0.000133 t:5.4s
+tttg: c66/85 lr:0.000121 t:5.5s
+tttg: c67/85 lr:0.000109 t:5.6s
+tttg: c68/85 lr:0.000098 t:5.7s
+tttg: c69/85 lr:0.000087 t:5.8s
+tttg: c70/85 lr:0.000077 t:5.8s
+tttg: c71/85 lr:0.000067 t:5.9s
+tttg: c72/85 lr:0.000058 t:6.0s
+tttg: c73/85 lr:0.000050 t:6.1s
+tttg: c74/85 lr:0.000042 t:6.2s
+tttg: c75/85 lr:0.000035 t:6.2s
+tttg: c76/85 lr:0.000028 t:6.3s
+tttg: c77/85 lr:0.000022 t:6.4s
+tttg: c78/85 lr:0.000017 t:6.5s
+tttg: c79/85 lr:0.000013 t:6.6s
+tttg: c80/85 lr:0.000009 t:6.6s
+tttg: c81/85 lr:0.000006 t:6.7s
+tttg: c82/85 lr:0.000003 t:6.8s
+tttg: c83/85 lr:0.000001 t:6.9s
+tttg: c84/85 lr:0.000000 t:7.0s
+ttpr: phase:1/3 t:180.9s
+ttp: b758/782 bl:3.0260 bb:1.0578 rl:2.9703 rb:1.0630 dl:2798-2857 gd:0
+ttp: b756/782 bl:2.9572 bb:1.0477 rl:2.9687 rb:1.0612 dl:2686-2734 gd:0
+ttpp: phase:2/3 pd:1808 gd:1333 t:238.0s
+tttg: c1/142 lr:0.001000 t:0.1s
+tttg: c2/142 lr:0.001000 t:0.2s
+tttg: c3/142 lr:0.001000 t:0.2s
+tttg: c4/142 lr:0.000999 t:0.3s
+tttg: c5/142 lr:0.000998 t:0.4s
+tttg: c6/142 lr:0.000997 t:0.5s
+tttg: c7/142 lr:0.000996 t:0.5s
+tttg: c8/142 lr:0.000994 t:0.6s
+tttg: c9/142 lr:0.000992 t:0.7s
+tttg: c10/142 lr:0.000990 t:0.8s
+tttg: c11/142 lr:0.000988 t:0.8s
+tttg: c12/142 lr:0.000985 t:0.9s
+tttg: c13/142 lr:0.000982 t:1.0s
+tttg: c14/142 lr:0.000979 t:1.1s
+tttg: c15/142 lr:0.000976 t:1.2s
+tttg: c16/142 lr:0.000972 t:1.2s
+tttg: c17/142 lr:0.000969 t:1.3s
+tttg: c18/142 lr:0.000965 t:1.4s
+tttg: c19/142 lr:0.000960 t:1.5s
+tttg: c20/142 lr:0.000956 t:1.5s
+tttg: c21/142 lr:0.000951 t:1.6s
+tttg: c22/142 lr:0.000946 t:1.7s
+tttg: c23/142 lr:0.000941 t:1.8s
+tttg: c24/142 lr:0.000936 t:1.8s
+tttg: c25/142 lr:0.000930 t:1.9s
+tttg: c26/142 lr:0.000924 t:2.0s
+tttg: c27/142 lr:0.000918 t:2.1s
+tttg: c28/142 lr:0.000912 t:2.2s
+tttg: c29/142 lr:0.000906 t:2.2s
+tttg: c30/142 lr:0.000899 t:2.3s
+tttg: c31/142 lr:0.000892 t:2.4s
+tttg: c32/142 lr:0.000885 t:2.5s
+tttg: c33/142 lr:0.000878 t:2.6s
+tttg: c34/142 lr:0.000871 t:2.6s
+tttg: c35/142 lr:0.000863 t:2.7s
+tttg: c36/142 lr:0.000856 t:2.8s
+tttg: c37/142 lr:0.000848 t:2.9s
+tttg: c38/142 lr:0.000840 t:2.9s
+tttg: c39/142 lr:0.000831 t:3.0s
+tttg: c40/142 lr:0.000823 t:3.1s
+tttg: c41/142 lr:0.000814 t:3.2s
+tttg: c42/142 lr:0.000805 t:3.3s
+tttg: c43/142 lr:0.000797 t:3.3s
+tttg: c44/142 lr:0.000788 t:3.4s
+tttg: c45/142 lr:0.000778 t:3.5s
+tttg: c46/142 lr:0.000769 t:3.6s
+tttg: c47/142 lr:0.000760 t:3.6s
+tttg: c48/142 lr:0.000750 t:3.7s
+tttg: c49/142 lr:0.000740 t:3.8s
+tttg: c50/142 lr:0.000730 t:3.9s
+tttg: c51/142 lr:0.000721 t:3.9s
+tttg: c52/142 lr:0.000710 t:4.0s
+tttg: c53/142 lr:0.000700 t:4.1s
+tttg: c54/142 lr:0.000690 t:4.2s
+tttg: c55/142 lr:0.000680 t:4.2s
+tttg: c56/142 lr:0.000669 t:4.3s
+tttg: c57/142 lr:0.000659 t:4.4s
+tttg: c58/142 lr:0.000648 t:4.5s
+tttg: c59/142 lr:0.000637 t:4.6s
+tttg: c60/142 lr:0.000627 t:4.6s
+tttg: c61/142 lr:0.000616 t:4.7s
+tttg: c62/142 lr:0.000605 t:4.8s
+tttg: c63/142 lr:0.000594 t:4.9s
+tttg: c64/142 lr:0.000583 t:4.9s
+tttg: c65/142 lr:0.000572 t:5.0s
+tttg: c66/142 lr:0.000561 t:5.1s
+tttg: c67/142 lr:0.000550 t:5.2s
+tttg: c68/142 lr:0.000539 t:5.2s
+tttg: c69/142 lr:0.000528 t:5.3s
+tttg: c70/142 lr:0.000517 t:5.4s
+tttg: c71/142 lr:0.000506 t:5.5s
+tttg: c72/142 lr:0.000494 t:5.6s
+tttg: c73/142 lr:0.000483 t:5.6s
+tttg: c74/142 lr:0.000472 t:5.7s
+tttg: c75/142 lr:0.000461 t:5.8s
+tttg: c76/142 lr:0.000450 t:5.9s
+tttg: c77/142 lr:0.000439 t:6.0s
+tttg: c78/142 lr:0.000428 t:6.0s
+tttg: c79/142 lr:0.000417 t:6.1s
+tttg: c80/142 lr:0.000406 t:6.2s
+tttg: c81/142 lr:0.000395 t:6.3s
+tttg: c82/142 lr:0.000384 t:6.3s
+tttg: c83/142 lr:0.000373 t:6.4s
+tttg: c84/142 lr:0.000363 t:6.5s
+tttg: c85/142 lr:0.000352 t:6.6s
+tttg: c86/142 lr:0.000341 t:6.6s
+tttg: c87/142 lr:0.000331 t:6.7s
+tttg: c88/142 lr:0.000320 t:6.8s
+tttg: c89/142 lr:0.000310 t:6.9s
+tttg: c90/142 lr:0.000300 t:7.0s
+tttg: c91/142 lr:0.000290 t:7.0s
+tttg: c92/142 lr:0.000279 t:7.1s
+tttg: c93/142 lr:0.000270 t:7.2s
+tttg: c94/142 lr:0.000260 t:7.3s
+tttg: c95/142 lr:0.000250 t:7.3s
+tttg: c96/142 lr:0.000240 t:7.4s
+tttg: c97/142 lr:0.000231 t:7.5s
+tttg: c98/142 lr:0.000222 t:7.6s
+tttg: c99/142 lr:0.000212 t:7.7s
+tttg: c100/142 lr:0.000203 t:7.7s
+tttg: c101/142 lr:0.000195 t:7.8s
+tttg: c102/142 lr:0.000186 t:7.9s
+tttg: c103/142 lr:0.000177 t:8.0s
+tttg: c104/142 lr:0.000169 t:8.0s
+tttg: c105/142 lr:0.000160 t:8.1s
+tttg: c106/142 lr:0.000152 t:8.2s
+tttg: c107/142 lr:0.000144 t:8.3s
+tttg: c108/142 lr:0.000137 t:8.4s
+tttg: c109/142 lr:0.000129 t:8.4s
+tttg: c110/142 lr:0.000122 t:8.5s
+tttg: c111/142 lr:0.000115 t:8.6s
+tttg: c112/142 lr:0.000108 t:8.7s
+tttg: c113/142 lr:0.000101 t:8.7s
+tttg: c114/142 lr:0.000094 t:8.8s
+tttg: c115/142 lr:0.000088 t:8.9s
+tttg: c116/142 lr:0.000082 t:9.0s
+tttg: c117/142 lr:0.000076 t:9.0s
+tttg: c118/142 lr:0.000070 t:9.1s
+tttg: c119/142 lr:0.000064 t:9.2s
+tttg: c120/142 lr:0.000059 t:9.3s
+tttg: c121/142 lr:0.000054 t:9.4s
+tttg: c122/142 lr:0.000049 t:9.4s
+tttg: c123/142 lr:0.000044 t:9.5s
+tttg: c124/142 lr:0.000040 t:9.6s
+tttg: c125/142 lr:0.000035 t:9.7s
+tttg: c126/142 lr:0.000031 t:9.7s
+tttg: c127/142 lr:0.000028 t:9.8s
+tttg: c128/142 lr:0.000024 t:9.9s
+tttg: c129/142 lr:0.000021 t:10.0s
+tttg: c130/142 lr:0.000018 t:10.1s
+tttg: c131/142 lr:0.000015 t:10.1s
+tttg: c132/142 lr:0.000012 t:10.2s
+tttg: c133/142 lr:0.000010 t:10.3s
+tttg: c134/142 lr:0.000008 t:10.4s
+tttg: c135/142 lr:0.000006 t:10.4s
+tttg: c136/142 lr:0.000004 t:10.5s
+tttg: c137/142 lr:0.000003 t:10.6s
+tttg: c138/142 lr:0.000002 t:10.7s
+tttg: c139/142 lr:0.000001 t:10.8s
+tttg: c140/142 lr:0.000000 t:10.8s
+tttg: c141/142 lr:0.000000 t:10.9s
+ttpr: phase:2/3 t:250.8s
+ttp: b751/782 bl:2.9794 bb:1.0199 rl:2.9698 rb:1.0569 dl:2433-2475 gd:0
+ttpp: phase:3/3 pd:2448 gd:2000 t:261.8s
+tttg: c1/192 lr:0.001000 t:0.1s
+tttg: c2/192 lr:0.001000 t:0.2s
+tttg: c3/192 lr:0.001000 t:0.2s
+tttg: c4/192 lr:0.000999 t:0.3s
+tttg: c5/192 lr:0.000999 t:0.4s
+tttg: c6/192 lr:0.000998 t:0.5s
+tttg: c7/192 lr:0.000998 t:0.6s
+tttg: c8/192 lr:0.000997 t:0.6s
+tttg: c9/192 lr:0.000996 t:0.7s
+tttg: c10/192 lr:0.000995 t:0.8s
+tttg: c11/192 lr:0.000993 t:0.9s
+tttg: c12/192 lr:0.000992 t:0.9s
+tttg: c13/192 lr:0.000990 t:1.0s
+tttg: c14/192 lr:0.000989 t:1.1s
+tttg: c15/192 lr:0.000987 t:1.2s
+tttg: c16/192 lr:0.000985 t:1.2s
+tttg: c17/192 lr:0.000983 t:1.3s
+tttg: c18/192 lr:0.000981 t:1.4s
+tttg: c19/192 lr:0.000978 t:1.5s
+tttg: c20/192 lr:0.000976 t:1.5s
+tttg: c21/192 lr:0.000973 t:1.6s
+tttg: c22/192 lr:0.000970 t:1.7s
+tttg: c23/192 lr:0.000968 t:1.8s
+tttg: c24/192 lr:0.000965 t:1.9s
+tttg: c25/192 lr:0.000962 t:1.9s
+tttg: c26/192 lr:0.000958 t:2.0s
+tttg: c27/192 lr:0.000955 t:2.1s
+tttg: c28/192 lr:0.000951 t:2.2s
+tttg: c29/192 lr:0.000948 t:2.2s
+tttg: c30/192 lr:0.000944 t:2.3s
+tttg: c31/192 lr:0.000940 t:2.4s
+tttg: c32/192 lr:0.000936 t:2.5s
+tttg: c33/192 lr:0.000932 t:2.6s
+tttg: c34/192 lr:0.000928 t:2.6s
+tttg: c35/192 lr:0.000924 t:2.7s
+tttg: c36/192 lr:0.000919 t:2.8s
+tttg: c37/192 lr:0.000915 t:2.9s
+tttg: c38/192 lr:0.000910 t:2.9s
+tttg: c39/192 lr:0.000905 t:3.0s
+tttg: c40/192 lr:0.000901 t:3.1s
+tttg: c41/192 lr:0.000896 t:3.2s
+tttg: c42/192 lr:0.000891 t:3.3s
+tttg: c43/192 lr:0.000885 t:3.3s
+tttg: c44/192 lr:0.000880 t:3.4s
+tttg: c45/192 lr:0.000875 t:3.5s
+tttg: c46/192 lr:0.000869 t:3.6s
+tttg: c47/192 lr:0.000864 t:3.6s
+tttg: c48/192 lr:0.000858 t:3.7s
+tttg: c49/192 lr:0.000852 t:3.8s
+tttg: c50/192 lr:0.000846 t:3.9s
+tttg: c51/192 lr:0.000840 t:3.9s
+tttg: c52/192 lr:0.000834 t:4.0s
+tttg: c53/192 lr:0.000828 t:4.1s
+tttg: c54/192 lr:0.000822 t:4.2s
+tttg: c55/192 lr:0.000815 t:4.3s
+tttg: c56/192 lr:0.000809 t:4.3s
+tttg: c57/192 lr:0.000802 t:4.4s
+tttg: c58/192 lr:0.000796 t:4.5s
+tttg: c59/192 lr:0.000789 t:4.6s
+tttg: c60/192 lr:0.000782 t:4.6s
+tttg: c61/192 lr:0.000776 t:4.7s
+tttg: c62/192 lr:0.000769 t:4.8s
+tttg: c63/192 lr:0.000762 t:4.9s
+tttg: c64/192 lr:0.000755 t:4.9s
+tttg: c65/192 lr:0.000748 t:5.0s
+tttg: c66/192 lr:0.000740 t:5.1s
+tttg: c67/192 lr:0.000733 t:5.2s
+tttg: c68/192 lr:0.000726 t:5.3s
+tttg: c69/192 lr:0.000719 t:5.3s
+tttg: c70/192 lr:0.000711 t:5.4s
+tttg: c71/192 lr:0.000704 t:5.5s
+tttg: c72/192 lr:0.000696 t:5.6s
+tttg: c73/192 lr:0.000688 t:5.6s
+tttg: c74/192 lr:0.000681 t:5.7s
+tttg: c75/192 lr:0.000673 t:5.8s
+tttg: c76/192 lr:0.000665 t:5.9s
+tttg: c77/192 lr:0.000658 t:5.9s
+tttg: c78/192 lr:0.000650 t:6.0s
+tttg: c79/192 lr:0.000642 t:6.1s
+tttg: c80/192 lr:0.000634 t:6.2s
+tttg: c81/192 lr:0.000626 t:6.3s
+tttg: c82/192 lr:0.000618 t:6.3s
+tttg: c83/192 lr:0.000610 t:6.4s
+tttg: c84/192 lr:0.000602 t:6.5s
+tttg: c85/192 lr:0.000594 t:6.6s
+tttg: c86/192 lr:0.000586 t:6.6s
+tttg: c87/192 lr:0.000578 t:6.7s
+tttg: c88/192 lr:0.000570 t:6.8s
+tttg: c89/192 lr:0.000562 t:6.9s
+tttg: c90/192 lr:0.000553 t:6.9s
+tttg: c91/192 lr:0.000545 t:7.0s
+tttg: c92/192 lr:0.000537 t:7.1s
+tttg: c93/192 lr:0.000529 t:7.2s
+tttg: c94/192 lr:0.000521 t:7.3s
+tttg: c95/192 lr:0.000512 t:7.3s
+tttg: c96/192 lr:0.000504 t:7.4s
+tttg: c97/192 lr:0.000496 t:7.5s
+tttg: c98/192 lr:0.000488 t:7.6s
+tttg: c99/192 lr:0.000479 t:7.6s
+tttg: c100/192 lr:0.000471 t:7.7s
+tttg: c101/192 lr:0.000463 t:7.8s
+tttg: c102/192 lr:0.000455 t:7.9s
+tttg: c103/192 lr:0.000447 t:7.9s
+tttg: c104/192 lr:0.000438 t:8.0s
+tttg: c105/192 lr:0.000430 t:8.1s
+tttg: c106/192 lr:0.000422 t:8.2s
+tttg: c107/192 lr:0.000414 t:8.3s
+tttg: c108/192 lr:0.000406 t:8.3s
+tttg: c109/192 lr:0.000398 t:8.4s
+tttg: c110/192 lr:0.000390 t:8.5s
+tttg: c111/192 lr:0.000382 t:8.6s
+tttg: c112/192 lr:0.000374 t:8.6s
+tttg: c113/192 lr:0.000366 t:8.7s
+tttg: c114/192 lr:0.000358 t:8.8s
+tttg: c115/192 lr:0.000350 t:8.9s
+tttg: c116/192 lr:0.000342 t:8.9s
+tttg: c117/192 lr:0.000335 t:9.0s
+tttg: c118/192 lr:0.000327 t:9.1s
+tttg: c119/192 lr:0.000319 t:9.2s
+tttg: c120/192 lr:0.000312 t:9.3s
+tttg: c121/192 lr:0.000304 t:9.3s
+tttg: c122/192 lr:0.000296 t:9.4s
+tttg: c123/192 lr:0.000289 t:9.5s
+tttg: c124/192 lr:0.000281 t:9.6s
+tttg: c125/192 lr:0.000274 t:9.6s
+tttg: c126/192 lr:0.000267 t:9.7s
+tttg: c127/192 lr:0.000260 t:9.8s
+tttg: c128/192 lr:0.000252 t:9.9s
+tttg: c129/192 lr:0.000245 t:9.9s
+tttg: c130/192 lr:0.000238 t:10.0s
+tttg: c131/192 lr:0.000231 t:10.1s
+tttg: c132/192 lr:0.000224 t:10.2s
+tttg: c133/192 lr:0.000218 t:10.2s
+tttg: c134/192 lr:0.000211 t:10.3s
+tttg: c135/192 lr:0.000204 t:10.4s
+tttg: c136/192 lr:0.000198 t:10.5s
+tttg: c137/192 lr:0.000191 t:10.6s
+tttg: c138/192 lr:0.000185 t:10.6s
+tttg: c139/192 lr:0.000178 t:10.7s
+tttg: c140/192 lr:0.000172 t:10.8s
+tttg: c141/192 lr:0.000166 t:10.9s
+tttg: c142/192 lr:0.000160 t:10.9s
+tttg: c143/192 lr:0.000154 t:11.0s
+tttg: c144/192 lr:0.000148 t:11.1s
+tttg: c145/192 lr:0.000142 t:11.2s
+tttg: c146/192 lr:0.000136 t:11.3s
+tttg: c147/192 lr:0.000131 t:11.3s
+tttg: c148/192 lr:0.000125 t:11.4s
+tttg: c149/192 lr:0.000120 t:11.5s
+tttg: c150/192 lr:0.000115 t:11.6s
+tttg: c151/192 lr:0.000109 t:11.6s
+tttg: c152/192 lr:0.000104 t:11.7s
+tttg: c153/192 lr:0.000099 t:11.8s
+tttg: c154/192 lr:0.000095 t:11.9s
+tttg: c155/192 lr:0.000090 t:11.9s
+tttg: c156/192 lr:0.000085 t:12.0s
+tttg: c157/192 lr:0.000081 t:12.1s
+tttg: c158/192 lr:0.000076 t:12.2s
+tttg: c159/192 lr:0.000072 t:12.3s
+tttg: c160/192 lr:0.000068 t:12.3s
+tttg: c161/192 lr:0.000064 t:12.4s
+tttg: c162/192 lr:0.000060 t:12.5s
+tttg: c163/192 lr:0.000056 t:12.6s
+tttg: c164/192 lr:0.000052 t:12.6s
+tttg: c165/192 lr:0.000049 t:12.7s
+tttg: c166/192 lr:0.000045 t:12.8s
+tttg: c167/192 lr:0.000042 t:12.9s
+tttg: c168/192 lr:0.000038 t:13.0s
+tttg: c169/192 lr:0.000035 t:13.0s
+tttg: c170/192 lr:0.000032 t:13.1s
+tttg: c171/192 lr:0.000030 t:13.2s
+tttg: c172/192 lr:0.000027 t:13.3s
+tttg: c173/192 lr:0.000024 t:13.3s
+tttg: c174/192 lr:0.000022 t:13.4s
+tttg: c175/192 lr:0.000019 t:13.5s
+tttg: c176/192 lr:0.000017 t:13.6s
+tttg: c177/192 lr:0.000015 t:13.7s
+tttg: c178/192 lr:0.000013 t:13.7s
+tttg: c179/192 lr:0.000011 t:13.8s
+tttg: c180/192 lr:0.000010 t:13.9s
+tttg: c181/192 lr:0.000008 t:14.0s
+tttg: c182/192 lr:0.000007 t:14.0s
+tttg: c183/192 lr:0.000005 t:14.1s
+tttg: c184/192 lr:0.000004 t:14.2s
+tttg: c185/192 lr:0.000003 t:14.3s
+tttg: c186/192 lr:0.000002 t:14.3s
+tttg: c187/192 lr:0.000002 t:14.4s
+tttg: c188/192 lr:0.000001 t:14.5s
+tttg: c189/192 lr:0.000001 t:14.6s
+tttg: c190/192 lr:0.000000 t:14.7s
+tttg: c191/192 lr:0.000000 t:14.7s
+ttpr: phase:3/3 t:278.4s
+ttp: b739/782 bl:3.0137 bb:1.0185 rl:2.9731 rb:1.0539 dl:2000-2025 gd:1
+ttp: b733/782 bl:2.9941 bb:1.0270 rl:2.9745 rb:1.0521 dl:1855-1877 gd:1
+ttp: b723/782 bl:3.1453 bb:1.0729 rl:2.9840 rb:1.0533 dl:1676-1694 gd:1
+ttp: b714/782 bl:3.0217 bb:1.0264 rl:2.9858 rb:1.0519 dl:1547-1559 gd:1
+ttp: b709/782 bl:2.9916 bb:1.0320 rl:2.9861 rb:1.0510 dl:1483-1495 gd:1
+ttp: b699/782 bl:2.9865 bb:1.0085 rl:2.9861 rb:1.0492 dl:1388-1397 gd:1
+ttp: b690/782 bl:2.9737 bb:1.0177 rl:2.9856 rb:1.0480 dl:1313-1321 gd:1
+ttp: b683/782 bl:2.9603 bb:1.0114 rl:2.9848 rb:1.0468 dl:1261-1267 gd:1
+ttp: b676/782 bl:3.0845 bb:1.0648 rl:2.9879 rb:1.0473 dl:1211-1218 gd:1
+ttp: b668/782 bl:3.1159 bb:1.0675 rl:2.9917 rb:1.0479 dl:1165-1170 gd:1
+ttp: b660/782 bl:2.9982 bb:1.0259 rl:2.9919 rb:1.0473 dl:1120-1126 gd:1
+ttp: b652/782 bl:3.0124 bb:1.0403 rl:2.9924 rb:1.0471 dl:1080-1085 gd:1
+ttp: b644/782 bl:3.0599 bb:1.0473 rl:2.9940 rb:1.0471 dl:1040-1044 gd:1
+ttp: b638/782 bl:3.0081 bb:1.0302 rl:2.9944 rb:1.0467 dl:1011-1015 gd:1
+ttp: b628/782 bl:3.1001 bb:1.0598 rl:2.9967 rb:1.0470 dl:969-973 gd:1
+ttp: b620/782 bl:3.0366 bb:1.0420 rl:2.9975 rb:1.0469 dl:936-940 gd:1
+ttp: b612/782 bl:3.0469 bb:1.0327 rl:2.9984 rb:1.0466 dl:905-909 gd:1
+ttp: b606/782 bl:2.9683 bb:1.0204 rl:2.9979 rb:1.0462 dl:882-886 gd:1
+ttp: b596/782 bl:3.0134 bb:1.0121 rl:2.9981 rb:1.0455 dl:849-852 gd:1
+ttp: b591/782 bl:3.0607 bb:1.0496 rl:2.9992 rb:1.0456 dl:833-836 gd:1
+ttp: b583/782 bl:3.0777 bb:1.0289 rl:3.0005 rb:1.0453 dl:808-811 gd:1
+ttp: b574/782 bl:3.0607 bb:1.0451 rl:3.0014 rb:1.0453 dl:782-785 gd:1
+ttp: b566/782 bl:3.0073 bb:1.0214 rl:3.0015 rb:1.0450 dl:759-763 gd:1
+ttp: b559/782 bl:3.0704 bb:1.0344 rl:3.0025 rb:1.0448 dl:740-743 gd:1
+ttp: b551/782 bl:3.1119 bb:1.0448 rl:3.0039 rb:1.0448 dl:720-722 gd:1
+ttp: b539/782 bl:2.9835 bb:1.0075 rl:3.0037 rb:1.0443 dl:690-692 gd:1
+ttp: b532/782 bl:3.0535 bb:1.0347 rl:3.0043 rb:1.0442 dl:675-677 gd:1
+ttp: b525/782 bl:3.0142 bb:1.0344 rl:3.0044 rb:1.0441 dl:659-661 gd:1
+ttp: b518/782 bl:3.1328 bb:1.0684 rl:3.0059 rb:1.0444 dl:644-646 gd:1
+ttp: b510/782 bl:3.1318 bb:1.0785 rl:3.0073 rb:1.0447 dl:626-628 gd:1
+ttp: b502/782 bl:3.1042 bb:1.0500 rl:3.0083 rb:1.0448 dl:609-612 gd:1
+ttp: b494/782 bl:3.0765 bb:1.0632 rl:3.0090 rb:1.0450 dl:594-595 gd:1
+ttp: b486/782 bl:3.0844 bb:1.0536 rl:3.0098 rb:1.0451 dl:579-580 gd:1
+ttp: b479/782 bl:3.0528 bb:1.0470 rl:3.0102 rb:1.0451 dl:564-566 gd:1
+ttp: b472/782 bl:3.0618 bb:1.0571 rl:3.0107 rb:1.0452 dl:552-554 gd:1
+ttp: b465/782 bl:3.0988 bb:1.0590 rl:3.0115 rb:1.0453 dl:540-542 gd:1
+ttp: b458/782 bl:3.0151 bb:1.0359 rl:3.0115 rb:1.0452 dl:528-529 gd:1
+ttp: b452/782 bl:3.1454 bb:1.0731 rl:3.0126 rb:1.0455 dl:518-520 gd:1
+ttp: b444/782 bl:2.9621 bb:1.0378 rl:3.0122 rb:1.0454 dl:503-505 gd:1
+ttp: b436/782 bl:3.0254 bb:1.0394 rl:3.0123 rb:1.0454 dl:491-493 gd:1
+ttp: b428/782 bl:3.0633 bb:1.0305 rl:3.0127 rb:1.0453 dl:479-480 gd:1
+ttp: b422/782 bl:3.0797 bb:1.0805 rl:3.0132 rb:1.0455 dl:470-471 gd:1
+ttp: b414/782 bl:3.1735 bb:1.0790 rl:3.0144 rb:1.0458 dl:457-458 gd:1
+ttp: b406/782 bl:3.0264 bb:1.0278 rl:3.0145 rb:1.0456 dl:444-446 gd:1
+ttp: b398/782 bl:3.0567 bb:1.0339 rl:3.0148 rb:1.0456 dl:434-435 gd:1
+ttp: b387/782 bl:3.0393 bb:1.0520 rl:3.0149 rb:1.0456 dl:419-420 gd:1
+ttp: b379/782 bl:3.0648 bb:1.0487 rl:3.0152 rb:1.0456 dl:408-409 gd:1
+ttp: b370/782 bl:3.0792 bb:1.0583 rl:3.0156 rb:1.0457 dl:396-397 gd:1
+ttp: b362/782 bl:3.1221 bb:1.0855 rl:3.0162 rb:1.0459 dl:386-387 gd:1
+ttp: b354/782 bl:3.1055 bb:1.0740 rl:3.0167 rb:1.0461 dl:376-378 gd:1
+ttp: b346/782 bl:3.1257 bb:1.1174 rl:3.0173 rb:1.0465 dl:367-368 gd:1
+ttp: b338/782 bl:2.9554 bb:1.0044 rl:3.0170 rb:1.0462 dl:357-358 gd:1
+ttp: b330/782 bl:3.0115 bb:1.0449 rl:3.0170 rb:1.0462 dl:347-349 gd:1
+ttp: b322/782 bl:3.1380 bb:1.0935 rl:3.0176 rb:1.0465 dl:338-339 gd:1
+ttp: b314/782 bl:3.1637 bb:1.0954 rl:3.0183 rb:1.0467 dl:329-331 gd:1
+ttp: b306/782 bl:3.0732 bb:1.0500 rl:3.0186 rb:1.0467 dl:320-321 gd:1
+ttp: b298/782 bl:3.0948 bb:1.0661 rl:3.0189 rb:1.0468 dl:311-312 gd:1
+ttp: b290/782 bl:3.1377 bb:1.0859 rl:3.0194 rb:1.0470 dl:302-303 gd:1
+ttp: b282/782 bl:3.0698 bb:1.0636 rl:3.0196 rb:1.0471 dl:294-295 gd:1
+ttp: b274/782 bl:3.1214 bb:1.0919 rl:3.0201 rb:1.0472 dl:286-287 gd:1
+ttp: b266/782 bl:3.1860 bb:1.1113 rl:3.0207 rb:1.0475 dl:277-278 gd:1
+ttp: b257/782 bl:3.1907 bb:1.1197 rl:3.0214 rb:1.0478 dl:269-270 gd:1
+ttp: b250/782 bl:3.1486 bb:1.1042 rl:3.0218 rb:1.0480 dl:262-263 gd:1
+ttp: b242/782 bl:3.1175 bb:1.0799 rl:3.0222 rb:1.0481 dl:255-256 gd:1
+ttp: b234/782 bl:3.2320 bb:1.1427 rl:3.0229 rb:1.0484 dl:248-249 gd:1
+ttp: b227/782 bl:3.1024 bb:1.0566 rl:3.0232 rb:1.0484 dl:242-242 gd:1
+ttp: b216/782 bl:3.1839 bb:1.0776 rl:3.0237 rb:1.0485 dl:232-233 gd:1
+ttp: b209/782 bl:3.2401 bb:1.1774 rl:3.0244 rb:1.0489 dl:226-227 gd:1
+ttp: b200/782 bl:3.1234 bb:1.0955 rl:3.0247 rb:1.0491 dl:219-220 gd:1
+ttp: b191/782 bl:3.2359 bb:1.1202 rl:3.0253 rb:1.0493 dl:212-213 gd:1
+ttp: b184/782 bl:3.1886 bb:1.1129 rl:3.0258 rb:1.0495 dl:206-207 gd:1
+ttp: b177/782 bl:3.1880 bb:1.0937 rl:3.0262 rb:1.0496 dl:201-201 gd:1
+ttp: b167/782 bl:3.2139 bb:1.1337 rl:3.0267 rb:1.0498 dl:193-194 gd:1
+ttp: b158/782 bl:3.1923 bb:1.1253 rl:3.0272 rb:1.0500 dl:186-187 gd:1
+ttp: b151/782 bl:3.2883 bb:1.1566 rl:3.0278 rb:1.0502 dl:181-182 gd:1
+ttp: b143/782 bl:3.3078 bb:1.1295 rl:3.0285 rb:1.0504 dl:175-176 gd:1
+ttp: b135/782 bl:3.1750 bb:1.1235 rl:3.0288 rb:1.0506 dl:169-170 gd:1
+ttp: b126/782 bl:3.2553 bb:1.1068 rl:3.0293 rb:1.0507 dl:163-163 gd:1
+ttp: b118/782 bl:3.2545 bb:1.1455 rl:3.0298 rb:1.0509 dl:158-158 gd:1
+ttp: b108/782 bl:3.2236 bb:1.1134 rl:3.0302 rb:1.0511 dl:151-152 gd:1
+ttp: b100/782 bl:3.3273 bb:1.1564 rl:3.0308 rb:1.0513 dl:146-146 gd:1
+ttp: b90/782 bl:3.3397 bb:1.1563 rl:3.0313 rb:1.0515 dl:139-140 gd:1
+ttp: b82/782 bl:3.2967 bb:1.1567 rl:3.0318 rb:1.0516 dl:133-134 gd:1
+ttp: b73/782 bl:3.3164 bb:1.1551 rl:3.0323 rb:1.0518 dl:127-128 gd:1
+ttp: b67/782 bl:3.4988 bb:1.1968 rl:3.0331 rb:1.0521 dl:123-123 gd:1
+ttp: b58/782 bl:3.4199 bb:1.2184 rl:3.0337 rb:1.0523 dl:116-117 gd:1
+ttp: b51/782 bl:3.4810 bb:1.2378 rl:3.0343 rb:1.0526 dl:112-112 gd:1
+ttp: b44/782 bl:3.3887 bb:1.1722 rl:3.0348 rb:1.0528 dl:107-107 gd:1
+ttp: b35/782 bl:3.4602 bb:1.2026 rl:3.0354 rb:1.0530 dl:100-101 gd:1
+ttp: b28/782 bl:3.5024 bb:1.2306 rl:3.0360 rb:1.0532 dl:95-96 gd:1
+ttp: b21/782 bl:3.4213 bb:1.1919 rl:3.0365 rb:1.0533 dl:89-90 gd:1
+ttp: b15/782 bl:3.4893 bb:1.2315 rl:3.0370 rb:1.0535 dl:84-84 gd:1
+ttp: b6/782 bl:3.4899 bb:1.2031 rl:3.0374 rb:1.0537 dl:72-74 gd:1
+quantized_ttt_phased val_loss:3.04603859 val_bpb:1.05692941 eval_time:350855ms
+total_eval_time:350.9s