diff --git a/records/track_non_record_16mb/2026-03-31_LegendreGPT/README.md b/records/track_non_record_16mb/2026-03-31_LegendreGPT/README.md new file mode 100644 index 0000000000..4b71e1f3c8 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-31_LegendreGPT/README.md @@ -0,0 +1,151 @@ +# LegendreGPT + +LegendreGPT generates all transformer layer weights from a small set of Legendre polynomial coefficients, compressing 22 middle layers into 6 coefficient matrices per weight type. As far as I know, this is the first time anyone has applied orthogonal polynomial weight parameterization to transformer language models. The greatest news is that it learns. + +## Result + +| Metric | Value | +|--------|-------| +| Pre-quantization val_bpb | 1.2079 | +| Post-quantization val_bpb (INT7+zlib) | 1.2353 | +| Post-quantization val_bpb (mixed INT8/INT7+LZMA) | 1.2266 | +| Compressed model size | 15.70 MB | +| Architecture | dim=512, 24L (2 groups), g5/2, GQA 8/4 | +| Training | 60k steps, 80 shards, 1x RTX 5090 (~27h) | + +Note: The INT7+zlib number (1.2353) is from the training script's built-in roundtrip validation (see train.log). The mixed INT8/INT7+LZMA number (1.2266) comes from a separate post-hoc quantization where Legendre orders 0-1 use INT8 and the rest use INT7, compressed with LZMA instead of zlib. + +## How It Works + +Each weight matrix in the transformer is a function of depth: + +``` +W(layer_l) = sum_{k=0}^{K-1} C_k * P_k(t_l) +``` + +`P_k` are Legendre polynomials. `t_l` maps layer index to [-1, 1]. `C_k` are learned matrices. With K=6 (degree 5), I specify 11 unique layers from 6 coefficient matrices per weight type — and I have two independent groups of 11. + +Think of it like an equalizer: the polynomials are fixed frequencies (constant, linear, quadratic...), and the coefficients are the sliders. Training only adjusts the sliders, never the frequencies. + +**Why Legendre and not monomials?** Orthogonality. Monomials (1, t, t^2...) become catastrophically ill-conditioned at higher degrees. Legendre polynomials stay well-behaved. NANODE (Massaroli et al., 2020) showed this matters for Neural ODEs. I confirmed it matters for transformers too. + +## Architecture + +``` +[Factorized Embedding] <- ALBERT-style, 1024 -> 128 -> 512 +[Independent Block 0] <- own weights +[Legendre Group A: 11 layers] <- coefficients A (degree 5 attn, 2 FFN) +[Legendre Group B: 11 layers] <- coefficients B (independent) +[Independent Block 23] <- own weights +[RMSNorm -> Tied Logit Head] +``` + +The 2-group split is important. With 1 group, each coefficient affects all 22 layers. If layer 5 wants the coefficient to go up but layer 15 wants it to go down, the gradients cancel and nothing moves. With 2 groups, each coefficient only fights with ~11 layers instead of 22. + +Each layer also has cheap independent scalars: attention scale, MLP scale, residual mixing ratio, query gain, and RMSNorm params. These cost < 0.05 MB total and let each layer fine-tune its behavior without defeating the compression. + +Other details: GQA (8 heads, 4 KV heads), ReLU^2 MLP at 3x dim, RoPE, logit soft-capping at 30. + +## Parameter Budget + +| Component | Params | INT8 (MB) | +|-----------|--------|-----------| +| Factorized embedding | 196,608 | 0.20 | +| Sandwich block (first) | 2,361,352 | 2.36 | +| Legendre Group A coefficients | 9,437,250 | 9.44 | +| Legendre Group B coefficients | 9,437,250 | 9.44 | +| Per-layer lightweight params | 45,232 | 0.05 | +| Sandwich block (last) | 2,361,352 | 2.36 | +| **Total** | **23,839,044** | **15.70 MB compressed** | + +Compression: mixed precision quantization (INT8 for Legendre orders 0-1, INT7 for orders 2-5 and sandwich blocks) + LZMA. + +## Training + +- **Muon optimizer** for all 2D weight matrices, Adam for embeddings and scalars +- **Per-order learning rates:** 1.1x higher per polynomial order. Order 0 at 0.025, order 5 at 0.040. Higher orders capture finer detail and need more push. +- **Linear LR decay** from 0.2 to 0.0 over 60k steps +- **Momentum cooldown:** Muon momentum decays from 0.95 to 0.05 over steps 10k-60k. Discovered accidentally when a checkpoint resume zeroed the momentum buffers and the model learned 8x faster. High momentum dampens updates too much near convergence. +- **Batch size:** 393,216 tokens. Legendre coefficients serve 11+ layers and benefit from clean gradient estimates. +- **Data:** 80 FineWeb sp1024 shards (~8B tokens) +- **Hardware:** 1x RTX 5090, ~27 hours total + +## Training Curve + +| Step | val_bpb | lr_mul | +|------|---------|--------| +| 1,000 | 1.48 | 0.197 | +| 5,000 | 1.35 | 0.183 | +| 10,000 | 1.28 | 0.167 | +| 20,000 | 1.24 | 0.133 | +| 30,000 | 1.22 | 0.100 | +| 40,000 | 1.21 | 0.067 | +| 50,000 | 1.21 | 0.033 | +| 60,000 | 1.2054 | 0.000 | + +## What I Learned + +**Dimension matters most.** For a fixed byte budget, bumping dim from 512 to 640 gave 0.02 BPB. Raising polynomial degree from g5/2 to g8/4 gave ~0.01. Wider layers beat more variation between them. + +**2 groups beat 1 group.** Each Legendre coefficient affects all layers in its group. With 1 group of 22 layers, gradients from different layers partially cancel. Splitting into 2 groups of 11 halves the cancellation and improves convergence at the same parameter cost. + +**Wrap doesn't help.** Tested circular weight topology (W = W - round(W)) with standard init, 8x init, and wrap-aware gradient modifications. Smooth weights win consistently — the continuity prior between adjacent layers is correct. + +**LoRA is less efficient than higher degree.** Per-layer low-rank corrections (W = W_legendre + A*B) at rank 8 underperform simply raising Legendre degree. g6/3 without LoRA beats g5/2 + LoRA r8 — the bytes are better spent on polynomial expressivity. + +**Momentum cooldown helps late training.** High momentum (0.95) dampens updates too much near convergence. Decaying to 0.05 in the second half of training allows finer adjustments when the model is close to a good minimum. + +**Larger batches help disproportionately.** Going from 262k to 393k tokens/batch improved convergence visibly. + +**Mixed precision quantization is key.** Legendre order 0-1 (constant and linear components) carry most of the weight information and need INT8 precision. Higher orders (finer detail) tolerate INT7. This gives near-INT8 quality at near-INT7 size. + +## Experiments + +| Config | Steps | val_bpb | Takeaway | +|--------|-------|---------|----------| +| dim=256, 8L, g5/2, 1 shard | 3,000 | 1.67 | Architecture works | +| dim=512, 24L, g5/2, 1 shard | 2,000 | 1.39 | Full model, 8.6 MB | +| dim=640, 24L, g6/3, 80 shards | 30,000 | 1.214 | Best BPB but 18.4 MB | +| dim=576, 24L, g6/3, 80 shards | 30,000 | 1.22 | Fits budget, tight | +| dim=512, 2-group g5/2, wrap | 3,000 | 1.70 | Wrap hurts | +| dim=640, g5/2 + LoRA r8 | 5,000 | 1.30 | LoRA < higher degree | +| **dim=512, 2-group g5/2, 80 shards** | **60,000** | **1.2054** | **Final submission** | + +## Related Work + +**NANODE** (Massaroli et al., NeurIPS 2020) used Legendre polynomials to parameterize Neural ODE weights for PDE surrogate modeling. LegendreGPT extends this idea to transformer language models. + +**ALBERT** (Lan et al., 2020) shares identical weights across all layers. LegendreGPT generalizes this: degree 0 (single constant coefficient) is exactly ALBERT. Higher degrees let layers diverge smoothly. + +**Subformer** (Reid et al., 2021) showed that sandwich-style sharing (independent first/last layers, shared middle) works better than uniform sharing. I use the same structure. + +## What I'd Try Next + +- **2D compression:** Legendre polynomials for the depth axis, DCT for the width axis. Could push dim to 1024+ in 16 MB. +- **Learned basis:** PCA from a pretrained model's weights instead of fixed Legendre. The optimal basis probably isn't polynomial. +- **Low-rank high orders:** Full-rank for orders 0-2, low-rank for orders 3+. More expressivity per byte. +- **Learnable layer positions:** Let the model learn optimal spacing in [-1, 1] instead of uniform. +- **Proper 8xH100 run:** All my runs were on a single RTX 5090. The competition target is 8xH100 in 10 minutes. Larger batch, fewer steps, different schedule. + +## Reproducibility + +```bash +git clone https://github.com/openai/parameter-golf.git +cd parameter-golf +pip install sentencepiece huggingface-hub datasets +python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80 + +# Copy train_gpt_legendre.py to parameter-golf/ + +MODE=full RUN_ID=legendregpt \ + LEGENDRE_GROUPS=2 \ + NUM_VIRTUAL_LAYERS=24 MODEL_DIM=512 \ + LEGENDRE_DEGREE_ATTN=5 LEGENDRE_DEGREE_FFN=2 \ + ITERATIONS=60000 TRAIN_BATCH_TOKENS=393216 \ + MAX_WALLCLOCK_SECONDS=0 LR_SCHEDULE=linear \ + python3 train_gpt_legendre.py +``` + +## Author + +**Sergio Cernuda Cueto** diff --git a/records/track_non_record_16mb/2026-03-31_LegendreGPT/submission.json b/records/track_non_record_16mb/2026-03-31_LegendreGPT/submission.json new file mode 100644 index 0000000000..90a69eeb67 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-31_LegendreGPT/submission.json @@ -0,0 +1,18 @@ +{ + "author": "Sergio Cernuda Cueto", + "github_id": "sergimichi", + "name": "LegendreGPT: 2-Group Legendre Polynomial Depth Parameterization", + "blurb": "Transformer weights parameterized as smooth functions of depth via Legendre polynomials. 2-group architecture with independent coefficients per group halves gradient cancellation. 24 virtual layers in 15.7MB. Pre-quant 1.2079 BPB, post-quant 1.2266 BPB.", + "date": "2026-04-04T00:00:00Z", + "track": "non_record_16mb", + "val_loss": 2.0413, + "val_bpb": 1.2266, + "pre_quant_val_loss": 2.0322, + "pre_quant_val_bpb": 1.2079, + "step_stop": 60000, + "wallclock_seconds": 96562, + "bytes_total": 15702333, + "bytes_model_compressed": 15630808, + "bytes_code": 71525, + "gpu": "1xRTX5090" +} diff --git a/records/track_non_record_16mb/2026-03-31_LegendreGPT/train.log b/records/track_non_record_16mb/2026-03-31_LegendreGPT/train.log new file mode 100644 index 0000000000..5190afb788 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-31_LegendreGPT/train.log @@ -0,0 +1,1872 @@ +============================================================ +LEGENDRE GPT — Mode: full +============================================================ +val tokens: 62021632 + +--- Parameter Budget --- + embedding: 196,608 params (0.79 MB fp32, 0.20 MB int8) + first_block: 2,361,352 params (9.45 MB fp32, 2.36 MB int8) + legendre_coeffs: 18,874,500 params (75.50 MB fp32, 18.87 MB int8) + lora: 0 params (0.00 MB fp32, 0.00 MB int8) + legendre_norms_scales: 45,232 params (0.18 MB fp32, 0.05 MB int8) + last_block: 2,361,352 params (9.45 MB fp32, 2.36 MB int8) + final_norm: 0 params (0.00 MB fp32, 0.00 MB int8) + total: 23,839,044 params (95.36 MB fp32, 23.84 MB int8) + Virtual layers: 24 (2 independent + 22 Legendre) + Legendre degree: attn=5, ffn=2 + Wrap mode: False + Legendre groups: 2, ~11 layers each + +--- LR Schedule (linear decay) --- + Step 0: lr_mul=0.2 + Step 60000: lr_mul=0.0 + Coeff LR base=0.025, scale=1.1x per order + Effective start LR for C0: 0.0050 + +Optimizer groups: embed=2, coeff_orders=6 (total 60 params), indep=12, scalar=88 + Legendre order 0: 12 params, lr=0.0250 + Legendre order 1: 12 params, lr=0.0275 + Legendre order 2: 12 params, lr=0.0303 + Legendre order 3: 8 params, lr=0.0333 + Legendre order 4: 8 params, lr=0.0366 + Legendre order 5: 8 params, lr=0.0403 +Resuming from checkpoint: ./checkpoints/ckpt_step30000.pt + Resumed at step 30000, training_time=47870254ms, skipping 11796M tokens + LR schedule: linear +warmup done (20 steps) + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0368 + eval: batch 40/100 running_loss:2.0324 + eval: batch 60/100 running_loss:2.0252 + eval: batch 80/100 running_loss:2.0268 + eval: batch 100/100 running_loss:2.0349 +step:30000/60000 val_loss:2.0349 val_bpb:1.2215 train_time:47870254ms +step:30020/60000 train_loss:2.0055 lr_mul:0.0999 train_time:47902516ms step_avg:1595.69ms +step:30040/60000 train_loss:2.0595 lr_mul:0.0999 train_time:47934907ms step_avg:1595.70ms +step:30060/60000 train_loss:2.0055 lr_mul:0.0998 train_time:47967641ms step_avg:1595.73ms +step:30080/60000 train_loss:2.0616 lr_mul:0.0997 train_time:48000009ms step_avg:1595.74ms +step:30100/60000 train_loss:2.0802 lr_mul:0.0997 train_time:48032392ms step_avg:1595.76ms +step:30120/60000 train_loss:2.0567 lr_mul:0.0996 train_time:48064764ms step_avg:1595.78ms +step:30140/60000 train_loss:2.1211 lr_mul:0.0995 train_time:48097165ms step_avg:1595.79ms +step:30160/60000 train_loss:2.0506 lr_mul:0.0995 train_time:48129558ms step_avg:1595.81ms +step:30180/60000 train_loss:2.0410 lr_mul:0.0994 train_time:48161982ms step_avg:1595.82ms +step:30200/60000 train_loss:2.0242 lr_mul:0.0993 train_time:48194373ms step_avg:1595.84ms +step:30220/60000 train_loss:2.0338 lr_mul:0.0993 train_time:48226777ms step_avg:1595.86ms +step:30240/60000 train_loss:2.0466 lr_mul:0.0992 train_time:48259168ms step_avg:1595.87ms +step:30260/60000 train_loss:2.0605 lr_mul:0.0991 train_time:48291581ms step_avg:1595.89ms +step:30280/60000 train_loss:2.0773 lr_mul:0.0991 train_time:48323942ms step_avg:1595.90ms +step:30300/60000 train_loss:2.0401 lr_mul:0.0990 train_time:48356293ms step_avg:1595.92ms +step:30320/60000 train_loss:2.0373 lr_mul:0.0989 train_time:48388670ms step_avg:1595.93ms +step:30340/60000 train_loss:2.1124 lr_mul:0.0989 train_time:48421265ms step_avg:1595.95ms +step:30360/60000 train_loss:2.0473 lr_mul:0.0988 train_time:48453637ms step_avg:1595.97ms +step:30380/60000 train_loss:2.1506 lr_mul:0.0987 train_time:48486015ms step_avg:1595.98ms +step:30400/60000 train_loss:2.0346 lr_mul:0.0987 train_time:48518419ms step_avg:1596.00ms +step:30420/60000 train_loss:2.0190 lr_mul:0.0986 train_time:48550829ms step_avg:1596.02ms +step:30440/60000 train_loss:1.9973 lr_mul:0.0985 train_time:48583230ms step_avg:1596.03ms +step:30460/60000 train_loss:2.0716 lr_mul:0.0985 train_time:48615611ms step_avg:1596.05ms +step:30480/60000 train_loss:2.1218 lr_mul:0.0984 train_time:48647981ms step_avg:1596.06ms +step:30500/60000 train_loss:2.0374 lr_mul:0.0983 train_time:48680378ms step_avg:1596.08ms + >> Checkpoint saved: ./checkpoints/ckpt_step30500.pt +step:30520/60000 train_loss:2.0819 lr_mul:0.0983 train_time:48713002ms step_avg:1596.10ms +step:30540/60000 train_loss:2.0573 lr_mul:0.0982 train_time:48745379ms step_avg:1596.12ms +step:30560/60000 train_loss:2.1364 lr_mul:0.0981 train_time:48777722ms step_avg:1596.13ms +step:30580/60000 train_loss:2.0154 lr_mul:0.0981 train_time:48810061ms step_avg:1596.14ms +step:30600/60000 train_loss:2.0491 lr_mul:0.0980 train_time:48842410ms step_avg:1596.16ms +step:30620/60000 train_loss:2.1571 lr_mul:0.0979 train_time:48874799ms step_avg:1596.17ms +step:30640/60000 train_loss:2.0264 lr_mul:0.0979 train_time:48907347ms step_avg:1596.19ms +step:30660/60000 train_loss:2.1129 lr_mul:0.0978 train_time:48939744ms step_avg:1596.21ms +step:30680/60000 train_loss:2.0256 lr_mul:0.0977 train_time:48972143ms step_avg:1596.22ms +step:30700/60000 train_loss:2.1070 lr_mul:0.0977 train_time:49004535ms step_avg:1596.24ms +step:30720/60000 train_loss:2.1317 lr_mul:0.0976 train_time:49036949ms step_avg:1596.25ms +step:30740/60000 train_loss:2.0875 lr_mul:0.0975 train_time:49069348ms step_avg:1596.27ms +step:30760/60000 train_loss:2.0885 lr_mul:0.0975 train_time:49101746ms step_avg:1596.29ms +step:30780/60000 train_loss:2.0457 lr_mul:0.0974 train_time:49134166ms step_avg:1596.30ms +step:30800/60000 train_loss:2.0819 lr_mul:0.0973 train_time:49166565ms step_avg:1596.32ms +step:30820/60000 train_loss:2.0727 lr_mul:0.0973 train_time:49198966ms step_avg:1596.33ms +step:30840/60000 train_loss:2.0713 lr_mul:0.0972 train_time:49231369ms step_avg:1596.35ms +step:30860/60000 train_loss:2.0334 lr_mul:0.0971 train_time:49263817ms step_avg:1596.36ms +step:30880/60000 train_loss:2.0445 lr_mul:0.0971 train_time:49296250ms step_avg:1596.38ms +step:30900/60000 train_loss:2.0749 lr_mul:0.0970 train_time:49328651ms step_avg:1596.40ms +step:30920/60000 train_loss:2.0211 lr_mul:0.0969 train_time:49361269ms step_avg:1596.42ms +step:30940/60000 train_loss:2.0593 lr_mul:0.0969 train_time:49393618ms step_avg:1596.43ms +step:30960/60000 train_loss:2.0403 lr_mul:0.0968 train_time:49425976ms step_avg:1596.45ms +step:30980/60000 train_loss:2.0715 lr_mul:0.0967 train_time:49458367ms step_avg:1596.46ms +step:31000/60000 train_loss:2.0510 lr_mul:0.0967 train_time:49490771ms step_avg:1596.48ms + >> Checkpoint saved: ./checkpoints/ckpt_step31000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0490 + eval: batch 40/100 running_loss:2.0447 + eval: batch 60/100 running_loss:2.0374 + eval: batch 80/100 running_loss:2.0391 + eval: batch 100/100 running_loss:2.0472 +step:31000/60000 val_loss:2.0472 val_bpb:1.2289 train_time:49491249ms +step:31020/60000 train_loss:2.0343 lr_mul:0.0966 train_time:49523647ms step_avg:1596.51ms +step:31040/60000 train_loss:2.0493 lr_mul:0.0965 train_time:49556078ms step_avg:1596.52ms +step:31060/60000 train_loss:2.0841 lr_mul:0.0965 train_time:49588522ms step_avg:1596.54ms +step:31080/60000 train_loss:2.0673 lr_mul:0.0964 train_time:49620932ms step_avg:1596.56ms +step:31100/60000 train_loss:2.0752 lr_mul:0.0963 train_time:49653308ms step_avg:1596.57ms +step:31120/60000 train_loss:2.0978 lr_mul:0.0963 train_time:49685713ms step_avg:1596.58ms +step:31140/60000 train_loss:2.0221 lr_mul:0.0962 train_time:49718105ms step_avg:1596.60ms +step:31160/60000 train_loss:2.1205 lr_mul:0.0961 train_time:49750491ms step_avg:1596.61ms +step:31180/60000 train_loss:2.0737 lr_mul:0.0961 train_time:49782900ms step_avg:1596.63ms +step:31200/60000 train_loss:2.0799 lr_mul:0.0960 train_time:49815329ms step_avg:1596.65ms +step:31220/60000 train_loss:2.0721 lr_mul:0.0959 train_time:49848085ms step_avg:1596.67ms +step:31240/60000 train_loss:2.0430 lr_mul:0.0959 train_time:49880507ms step_avg:1596.69ms +step:31260/60000 train_loss:2.0706 lr_mul:0.0958 train_time:49912925ms step_avg:1596.70ms +step:31280/60000 train_loss:2.1040 lr_mul:0.0957 train_time:49945324ms step_avg:1596.72ms +step:31300/60000 train_loss:2.0271 lr_mul:0.0957 train_time:49977720ms step_avg:1596.73ms +step:31320/60000 train_loss:2.0482 lr_mul:0.0956 train_time:50010125ms step_avg:1596.75ms +step:31340/60000 train_loss:2.0583 lr_mul:0.0955 train_time:50042536ms step_avg:1596.76ms +step:31360/60000 train_loss:2.0661 lr_mul:0.0955 train_time:50074939ms step_avg:1596.78ms +step:31380/60000 train_loss:2.0096 lr_mul:0.0954 train_time:50107349ms step_avg:1596.79ms +step:31400/60000 train_loss:2.0220 lr_mul:0.0953 train_time:50139764ms step_avg:1596.81ms +step:31420/60000 train_loss:2.0581 lr_mul:0.0953 train_time:50172177ms step_avg:1596.82ms +step:31440/60000 train_loss:2.0906 lr_mul:0.0952 train_time:50204593ms step_avg:1596.84ms +step:31460/60000 train_loss:2.0961 lr_mul:0.0951 train_time:50237019ms step_avg:1596.85ms +step:31480/60000 train_loss:2.0688 lr_mul:0.0951 train_time:50269461ms step_avg:1596.87ms +step:31500/60000 train_loss:2.0653 lr_mul:0.0950 train_time:50302160ms step_avg:1596.89ms + >> Checkpoint saved: ./checkpoints/ckpt_step31500.pt +step:31520/60000 train_loss:2.0720 lr_mul:0.0949 train_time:50334832ms step_avg:1596.92ms +step:31540/60000 train_loss:2.0358 lr_mul:0.0949 train_time:50367280ms step_avg:1596.93ms +step:31560/60000 train_loss:2.0767 lr_mul:0.0948 train_time:50399738ms step_avg:1596.95ms +step:31580/60000 train_loss:2.0325 lr_mul:0.0947 train_time:50432190ms step_avg:1596.97ms +step:31600/60000 train_loss:2.1074 lr_mul:0.0947 train_time:50464620ms step_avg:1596.98ms +step:31620/60000 train_loss:2.0572 lr_mul:0.0946 train_time:50497059ms step_avg:1597.00ms +step:31640/60000 train_loss:2.0341 lr_mul:0.0945 train_time:50529483ms step_avg:1597.01ms +step:31660/60000 train_loss:2.1005 lr_mul:0.0945 train_time:50561905ms step_avg:1597.03ms +step:31680/60000 train_loss:2.0625 lr_mul:0.0944 train_time:50594296ms step_avg:1597.04ms +step:31700/60000 train_loss:2.1057 lr_mul:0.0943 train_time:50626716ms step_avg:1597.06ms +step:31720/60000 train_loss:2.0012 lr_mul:0.0943 train_time:50659123ms step_avg:1597.07ms +step:31740/60000 train_loss:2.0817 lr_mul:0.0942 train_time:50691541ms step_avg:1597.09ms +step:31760/60000 train_loss:2.0570 lr_mul:0.0941 train_time:50723981ms step_avg:1597.10ms +step:31780/60000 train_loss:2.1041 lr_mul:0.0941 train_time:50756580ms step_avg:1597.12ms +step:31800/60000 train_loss:2.0844 lr_mul:0.0940 train_time:50789059ms step_avg:1597.14ms +step:31820/60000 train_loss:2.0306 lr_mul:0.0939 train_time:50821485ms step_avg:1597.16ms +step:31840/60000 train_loss:2.0665 lr_mul:0.0939 train_time:50853892ms step_avg:1597.17ms +step:31860/60000 train_loss:2.0268 lr_mul:0.0938 train_time:50886312ms step_avg:1597.18ms +step:31880/60000 train_loss:2.1273 lr_mul:0.0937 train_time:50918737ms step_avg:1597.20ms +step:31900/60000 train_loss:2.0123 lr_mul:0.0937 train_time:50951157ms step_avg:1597.21ms +step:31920/60000 train_loss:2.0725 lr_mul:0.0936 train_time:50983561ms step_avg:1597.23ms +step:31940/60000 train_loss:2.1014 lr_mul:0.0935 train_time:51015958ms step_avg:1597.24ms +step:31960/60000 train_loss:1.9837 lr_mul:0.0935 train_time:51048356ms step_avg:1597.26ms +step:31980/60000 train_loss:2.1149 lr_mul:0.0934 train_time:51080758ms step_avg:1597.27ms +step:32000/60000 train_loss:2.0268 lr_mul:0.0933 train_time:51113170ms step_avg:1597.29ms + >> Checkpoint saved: ./checkpoints/ckpt_step32000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0460 + eval: batch 40/100 running_loss:2.0414 + eval: batch 60/100 running_loss:2.0342 + eval: batch 80/100 running_loss:2.0359 + eval: batch 100/100 running_loss:2.0441 +step:32000/60000 val_loss:2.0441 val_bpb:1.2271 train_time:51113459ms +step:32020/60000 train_loss:2.0356 lr_mul:0.0933 train_time:51145887ms step_avg:1597.31ms +step:32040/60000 train_loss:2.0485 lr_mul:0.0932 train_time:51178307ms step_avg:1597.33ms +step:32060/60000 train_loss:2.0110 lr_mul:0.0931 train_time:51210741ms step_avg:1597.34ms +step:32080/60000 train_loss:2.1112 lr_mul:0.0931 train_time:51243436ms step_avg:1597.36ms +step:32100/60000 train_loss:2.0343 lr_mul:0.0930 train_time:51275871ms step_avg:1597.38ms +step:32120/60000 train_loss:2.0956 lr_mul:0.0929 train_time:51308304ms step_avg:1597.39ms +step:32140/60000 train_loss:2.0714 lr_mul:0.0929 train_time:51340733ms step_avg:1597.41ms +step:32160/60000 train_loss:2.0463 lr_mul:0.0928 train_time:51373144ms step_avg:1597.42ms +step:32180/60000 train_loss:2.0368 lr_mul:0.0927 train_time:51405530ms step_avg:1597.44ms +step:32200/60000 train_loss:2.0627 lr_mul:0.0927 train_time:51437944ms step_avg:1597.45ms +step:32220/60000 train_loss:1.9947 lr_mul:0.0926 train_time:51470337ms step_avg:1597.47ms +step:32240/60000 train_loss:2.0336 lr_mul:0.0925 train_time:51502730ms step_avg:1597.48ms +step:32260/60000 train_loss:2.2059 lr_mul:0.0925 train_time:51535133ms step_avg:1597.49ms +step:32280/60000 train_loss:2.0317 lr_mul:0.0924 train_time:51567527ms step_avg:1597.51ms +step:32300/60000 train_loss:2.0262 lr_mul:0.0923 train_time:51599924ms step_avg:1597.52ms +step:32320/60000 train_loss:2.0073 lr_mul:0.0923 train_time:51632331ms step_avg:1597.54ms +step:32340/60000 train_loss:2.0509 lr_mul:0.0922 train_time:51664711ms step_avg:1597.55ms +step:32360/60000 train_loss:2.0595 lr_mul:0.0921 train_time:51697255ms step_avg:1597.57ms +step:32380/60000 train_loss:2.0831 lr_mul:0.0921 train_time:51729640ms step_avg:1597.58ms +step:32400/60000 train_loss:2.0560 lr_mul:0.0920 train_time:51762009ms step_avg:1597.59ms +step:32420/60000 train_loss:2.0752 lr_mul:0.0919 train_time:51794380ms step_avg:1597.61ms +step:32440/60000 train_loss:1.9677 lr_mul:0.0919 train_time:51826742ms step_avg:1597.62ms +step:32460/60000 train_loss:2.0673 lr_mul:0.0918 train_time:51859114ms step_avg:1597.63ms +step:32480/60000 train_loss:2.0448 lr_mul:0.0917 train_time:51891477ms step_avg:1597.64ms +step:32500/60000 train_loss:2.0994 lr_mul:0.0917 train_time:51923833ms step_avg:1597.66ms + >> Checkpoint saved: ./checkpoints/ckpt_step32500.pt +step:32520/60000 train_loss:2.0503 lr_mul:0.0916 train_time:51956498ms step_avg:1597.68ms +step:32540/60000 train_loss:2.0139 lr_mul:0.0915 train_time:51988903ms step_avg:1597.69ms +step:32560/60000 train_loss:2.0592 lr_mul:0.0915 train_time:52021331ms step_avg:1597.71ms +step:32580/60000 train_loss:2.0490 lr_mul:0.0914 train_time:52053714ms step_avg:1597.72ms +step:32600/60000 train_loss:2.0885 lr_mul:0.0913 train_time:52086084ms step_avg:1597.73ms +step:32620/60000 train_loss:2.0408 lr_mul:0.0913 train_time:52118458ms step_avg:1597.75ms +step:32640/60000 train_loss:2.0947 lr_mul:0.0912 train_time:52151092ms step_avg:1597.77ms +step:32660/60000 train_loss:2.0269 lr_mul:0.0911 train_time:52183456ms step_avg:1597.78ms +step:32680/60000 train_loss:2.0800 lr_mul:0.0911 train_time:52215846ms step_avg:1597.79ms +step:32700/60000 train_loss:2.0064 lr_mul:0.0910 train_time:52248225ms step_avg:1597.81ms +step:32720/60000 train_loss:2.1148 lr_mul:0.0909 train_time:52280580ms step_avg:1597.82ms +step:32740/60000 train_loss:2.0234 lr_mul:0.0909 train_time:52312933ms step_avg:1597.83ms +step:32760/60000 train_loss:2.1140 lr_mul:0.0908 train_time:52345307ms step_avg:1597.84ms +step:32780/60000 train_loss:2.0689 lr_mul:0.0907 train_time:52377651ms step_avg:1597.85ms +step:32800/60000 train_loss:2.0617 lr_mul:0.0907 train_time:52410017ms step_avg:1597.87ms +step:32820/60000 train_loss:2.0764 lr_mul:0.0906 train_time:52442358ms step_avg:1597.88ms +step:32840/60000 train_loss:1.9925 lr_mul:0.0905 train_time:52474734ms step_avg:1597.89ms +step:32860/60000 train_loss:2.0681 lr_mul:0.0905 train_time:52507102ms step_avg:1597.90ms +step:32880/60000 train_loss:2.1417 lr_mul:0.0904 train_time:52539459ms step_avg:1597.92ms +step:32900/60000 train_loss:2.0417 lr_mul:0.0903 train_time:52571844ms step_avg:1597.93ms +step:32920/60000 train_loss:2.0356 lr_mul:0.0903 train_time:52604245ms step_avg:1597.94ms +step:32940/60000 train_loss:2.0415 lr_mul:0.0902 train_time:52636824ms step_avg:1597.96ms +step:32960/60000 train_loss:2.0953 lr_mul:0.0901 train_time:52669206ms step_avg:1597.97ms +step:32980/60000 train_loss:2.0577 lr_mul:0.0901 train_time:52701593ms step_avg:1597.99ms +step:33000/60000 train_loss:2.1713 lr_mul:0.0900 train_time:52733975ms step_avg:1598.00ms + >> Checkpoint saved: ./checkpoints/ckpt_step33000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0464 + eval: batch 40/100 running_loss:2.0423 + eval: batch 60/100 running_loss:2.0352 + eval: batch 80/100 running_loss:2.0368 + eval: batch 100/100 running_loss:2.0450 +step:33000/60000 val_loss:2.0450 val_bpb:1.2276 train_time:52734255ms +step:33020/60000 train_loss:2.0457 lr_mul:0.0899 train_time:52766642ms step_avg:1598.02ms +step:33040/60000 train_loss:2.1715 lr_mul:0.0899 train_time:52799033ms step_avg:1598.03ms +step:33060/60000 train_loss:2.0136 lr_mul:0.0898 train_time:52831420ms step_avg:1598.05ms +step:33080/60000 train_loss:2.0828 lr_mul:0.0897 train_time:52863806ms step_avg:1598.06ms +step:33100/60000 train_loss:2.0296 lr_mul:0.0897 train_time:52896201ms step_avg:1598.07ms +step:33120/60000 train_loss:2.0488 lr_mul:0.0896 train_time:52928586ms step_avg:1598.09ms +step:33140/60000 train_loss:2.0150 lr_mul:0.0895 train_time:52960975ms step_avg:1598.10ms +step:33160/60000 train_loss:2.0655 lr_mul:0.0895 train_time:52993343ms step_avg:1598.11ms +step:33180/60000 train_loss:2.0474 lr_mul:0.0894 train_time:53025721ms step_avg:1598.12ms +step:33200/60000 train_loss:2.0369 lr_mul:0.0893 train_time:53058091ms step_avg:1598.14ms +step:33220/60000 train_loss:2.1076 lr_mul:0.0893 train_time:53090749ms step_avg:1598.16ms +step:33240/60000 train_loss:2.0329 lr_mul:0.0892 train_time:53123137ms step_avg:1598.17ms +step:33260/60000 train_loss:2.0713 lr_mul:0.0891 train_time:53155529ms step_avg:1598.18ms +step:33280/60000 train_loss:2.0670 lr_mul:0.0891 train_time:53187944ms step_avg:1598.20ms +step:33300/60000 train_loss:2.0727 lr_mul:0.0890 train_time:53220350ms step_avg:1598.21ms +step:33320/60000 train_loss:2.1149 lr_mul:0.0889 train_time:53252785ms step_avg:1598.22ms +step:33340/60000 train_loss:2.0740 lr_mul:0.0889 train_time:53285184ms step_avg:1598.24ms +step:33360/60000 train_loss:2.0871 lr_mul:0.0888 train_time:53317583ms step_avg:1598.25ms +step:33380/60000 train_loss:1.9946 lr_mul:0.0887 train_time:53350013ms step_avg:1598.26ms +step:33400/60000 train_loss:1.9395 lr_mul:0.0887 train_time:53382443ms step_avg:1598.28ms +step:33420/60000 train_loss:2.0896 lr_mul:0.0886 train_time:53414847ms step_avg:1598.29ms +step:33440/60000 train_loss:2.1394 lr_mul:0.0885 train_time:53447250ms step_avg:1598.30ms +step:33460/60000 train_loss:2.0731 lr_mul:0.0885 train_time:53479642ms step_avg:1598.32ms +step:33480/60000 train_loss:2.0719 lr_mul:0.0884 train_time:53512043ms step_avg:1598.33ms +step:33500/60000 train_loss:2.0336 lr_mul:0.0883 train_time:53544627ms step_avg:1598.35ms + >> Checkpoint saved: ./checkpoints/ckpt_step33500.pt +step:33520/60000 train_loss:2.0577 lr_mul:0.0883 train_time:53577298ms step_avg:1598.37ms +step:33540/60000 train_loss:1.8961 lr_mul:0.0882 train_time:53609730ms step_avg:1598.38ms +step:33560/60000 train_loss:2.0888 lr_mul:0.0881 train_time:53642163ms step_avg:1598.40ms +step:33580/60000 train_loss:2.0745 lr_mul:0.0881 train_time:53674606ms step_avg:1598.41ms +step:33600/60000 train_loss:2.0556 lr_mul:0.0880 train_time:53707034ms step_avg:1598.42ms +step:33620/60000 train_loss:2.1408 lr_mul:0.0879 train_time:53739457ms step_avg:1598.44ms +step:33640/60000 train_loss:2.0478 lr_mul:0.0879 train_time:53771863ms step_avg:1598.45ms +step:33660/60000 train_loss:2.1263 lr_mul:0.0878 train_time:53804299ms step_avg:1598.46ms +step:33680/60000 train_loss:1.9903 lr_mul:0.0877 train_time:53836726ms step_avg:1598.48ms +step:33700/60000 train_loss:2.0901 lr_mul:0.0877 train_time:53869150ms step_avg:1598.49ms +step:33720/60000 train_loss:2.0188 lr_mul:0.0876 train_time:53901560ms step_avg:1598.50ms +step:33740/60000 train_loss:2.0414 lr_mul:0.0875 train_time:53933972ms step_avg:1598.52ms +step:33760/60000 train_loss:2.0411 lr_mul:0.0875 train_time:53966360ms step_avg:1598.53ms +step:33780/60000 train_loss:1.9640 lr_mul:0.0874 train_time:53998750ms step_avg:1598.54ms +step:33800/60000 train_loss:2.1165 lr_mul:0.0873 train_time:54031319ms step_avg:1598.56ms +step:33820/60000 train_loss:2.0835 lr_mul:0.0873 train_time:54063711ms step_avg:1598.57ms +step:33840/60000 train_loss:2.0089 lr_mul:0.0872 train_time:54096107ms step_avg:1598.58ms +step:33860/60000 train_loss:2.0767 lr_mul:0.0871 train_time:54128477ms step_avg:1598.60ms +step:33880/60000 train_loss:2.1004 lr_mul:0.0871 train_time:54160818ms step_avg:1598.61ms +step:33900/60000 train_loss:2.0387 lr_mul:0.0870 train_time:54193181ms step_avg:1598.62ms +step:33920/60000 train_loss:2.0442 lr_mul:0.0869 train_time:54225576ms step_avg:1598.63ms +step:33940/60000 train_loss:2.1199 lr_mul:0.0869 train_time:54257938ms step_avg:1598.64ms +step:33960/60000 train_loss:2.0476 lr_mul:0.0868 train_time:54290321ms step_avg:1598.65ms +step:33980/60000 train_loss:2.0944 lr_mul:0.0867 train_time:54322692ms step_avg:1598.67ms +step:34000/60000 train_loss:2.1322 lr_mul:0.0867 train_time:54355085ms step_avg:1598.68ms + >> Checkpoint saved: ./checkpoints/ckpt_step34000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0449 + eval: batch 40/100 running_loss:2.0407 + eval: batch 60/100 running_loss:2.0334 + eval: batch 80/100 running_loss:2.0351 + eval: batch 100/100 running_loss:2.0433 +step:34000/60000 val_loss:2.0433 val_bpb:1.2266 train_time:54355353ms +step:34020/60000 train_loss:2.0590 lr_mul:0.0866 train_time:54387747ms step_avg:1598.70ms +step:34040/60000 train_loss:2.0478 lr_mul:0.0865 train_time:54420151ms step_avg:1598.71ms +step:34060/60000 train_loss:2.0181 lr_mul:0.0865 train_time:54452571ms step_avg:1598.72ms +step:34080/60000 train_loss:2.0790 lr_mul:0.0864 train_time:54485146ms step_avg:1598.74ms +step:34100/60000 train_loss:2.1104 lr_mul:0.0863 train_time:54517543ms step_avg:1598.75ms +step:34120/60000 train_loss:2.0708 lr_mul:0.0863 train_time:54549928ms step_avg:1598.77ms +step:34140/60000 train_loss:2.0291 lr_mul:0.0862 train_time:54582328ms step_avg:1598.78ms +step:34160/60000 train_loss:2.1055 lr_mul:0.0861 train_time:54614696ms step_avg:1598.79ms +step:34180/60000 train_loss:2.1221 lr_mul:0.0861 train_time:54647075ms step_avg:1598.80ms +step:34200/60000 train_loss:2.0703 lr_mul:0.0860 train_time:54679445ms step_avg:1598.81ms +step:34220/60000 train_loss:2.1443 lr_mul:0.0859 train_time:54711819ms step_avg:1598.83ms +step:34240/60000 train_loss:2.0883 lr_mul:0.0859 train_time:54744203ms step_avg:1598.84ms +step:34260/60000 train_loss:2.0489 lr_mul:0.0858 train_time:54776560ms step_avg:1598.85ms +step:34280/60000 train_loss:2.0306 lr_mul:0.0857 train_time:54808921ms step_avg:1598.86ms +step:34300/60000 train_loss:1.9867 lr_mul:0.0857 train_time:54841285ms step_avg:1598.87ms +step:34320/60000 train_loss:2.0943 lr_mul:0.0856 train_time:54873646ms step_avg:1598.88ms +step:34340/60000 train_loss:1.9981 lr_mul:0.0855 train_time:54905974ms step_avg:1598.89ms +step:34360/60000 train_loss:2.1184 lr_mul:0.0855 train_time:54938543ms step_avg:1598.91ms +step:34380/60000 train_loss:1.9419 lr_mul:0.0854 train_time:54970915ms step_avg:1598.92ms +step:34400/60000 train_loss:2.1951 lr_mul:0.0853 train_time:55003284ms step_avg:1598.93ms +step:34420/60000 train_loss:2.1252 lr_mul:0.0853 train_time:55035663ms step_avg:1598.94ms +step:34440/60000 train_loss:2.1620 lr_mul:0.0852 train_time:55068014ms step_avg:1598.96ms +step:34460/60000 train_loss:2.0245 lr_mul:0.0851 train_time:55100370ms step_avg:1598.97ms +step:34480/60000 train_loss:1.9879 lr_mul:0.0851 train_time:55132729ms step_avg:1598.98ms +step:34500/60000 train_loss:2.0659 lr_mul:0.0850 train_time:55165082ms step_avg:1598.99ms + >> Checkpoint saved: ./checkpoints/ckpt_step34500.pt +step:34520/60000 train_loss:2.0384 lr_mul:0.0849 train_time:55197679ms step_avg:1599.01ms +step:34540/60000 train_loss:2.1931 lr_mul:0.0849 train_time:55230036ms step_avg:1599.02ms +step:34560/60000 train_loss:2.0629 lr_mul:0.0848 train_time:55262414ms step_avg:1599.03ms +step:34580/60000 train_loss:2.1976 lr_mul:0.0847 train_time:55294805ms step_avg:1599.04ms +step:34600/60000 train_loss:2.0562 lr_mul:0.0847 train_time:55327198ms step_avg:1599.05ms +step:34620/60000 train_loss:1.9476 lr_mul:0.0846 train_time:55359575ms step_avg:1599.06ms +step:34640/60000 train_loss:2.0225 lr_mul:0.0845 train_time:55391946ms step_avg:1599.07ms +step:34660/60000 train_loss:2.0224 lr_mul:0.0845 train_time:55424523ms step_avg:1599.09ms +step:34680/60000 train_loss:2.0845 lr_mul:0.0844 train_time:55456876ms step_avg:1599.10ms +step:34700/60000 train_loss:2.1945 lr_mul:0.0843 train_time:55489233ms step_avg:1599.11ms +step:34720/60000 train_loss:2.2100 lr_mul:0.0843 train_time:55521595ms step_avg:1599.12ms +step:34740/60000 train_loss:2.0674 lr_mul:0.0842 train_time:55553949ms step_avg:1599.13ms +step:34760/60000 train_loss:2.0367 lr_mul:0.0841 train_time:55586314ms step_avg:1599.15ms +step:34780/60000 train_loss:2.1439 lr_mul:0.0841 train_time:55618677ms step_avg:1599.16ms +step:34800/60000 train_loss:2.0127 lr_mul:0.0840 train_time:55651040ms step_avg:1599.17ms +step:34820/60000 train_loss:2.0720 lr_mul:0.0839 train_time:55683381ms step_avg:1599.18ms +step:34840/60000 train_loss:2.0409 lr_mul:0.0839 train_time:55715732ms step_avg:1599.19ms +step:34860/60000 train_loss:2.0274 lr_mul:0.0838 train_time:55748060ms step_avg:1599.20ms +step:34880/60000 train_loss:2.0542 lr_mul:0.0837 train_time:55780413ms step_avg:1599.21ms +step:34900/60000 train_loss:2.0631 lr_mul:0.0837 train_time:55812756ms step_avg:1599.22ms +step:34920/60000 train_loss:2.1016 lr_mul:0.0836 train_time:55845100ms step_avg:1599.23ms +step:34940/60000 train_loss:2.0155 lr_mul:0.0835 train_time:55877701ms step_avg:1599.25ms +step:34960/60000 train_loss:2.0961 lr_mul:0.0835 train_time:55910044ms step_avg:1599.26ms +step:34980/60000 train_loss:2.0537 lr_mul:0.0834 train_time:55942411ms step_avg:1599.27ms +step:35000/60000 train_loss:2.0273 lr_mul:0.0833 train_time:55974786ms step_avg:1599.28ms + >> Checkpoint saved: ./checkpoints/ckpt_step35000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0444 + eval: batch 40/100 running_loss:2.0398 + eval: batch 60/100 running_loss:2.0324 + eval: batch 80/100 running_loss:2.0342 + eval: batch 100/100 running_loss:2.0424 +step:35000/60000 val_loss:2.0424 val_bpb:1.2260 train_time:55975057ms +step:35020/60000 train_loss:2.0547 lr_mul:0.0833 train_time:56007426ms step_avg:1599.30ms +step:35040/60000 train_loss:2.0239 lr_mul:0.0832 train_time:56039790ms step_avg:1599.31ms +step:35060/60000 train_loss:2.0939 lr_mul:0.0831 train_time:56072140ms step_avg:1599.32ms +step:35080/60000 train_loss:2.0375 lr_mul:0.0831 train_time:56104479ms step_avg:1599.33ms +step:35100/60000 train_loss:2.1059 lr_mul:0.0830 train_time:56136840ms step_avg:1599.34ms +step:35120/60000 train_loss:2.0025 lr_mul:0.0829 train_time:56169196ms step_avg:1599.35ms +step:35140/60000 train_loss:2.0953 lr_mul:0.0829 train_time:56201557ms step_avg:1599.36ms +step:35160/60000 train_loss:2.0375 lr_mul:0.0828 train_time:56233882ms step_avg:1599.37ms +step:35180/60000 train_loss:2.0341 lr_mul:0.0827 train_time:56266230ms step_avg:1599.38ms +step:35200/60000 train_loss:2.0671 lr_mul:0.0827 train_time:56298579ms step_avg:1599.39ms +step:35220/60000 train_loss:2.0295 lr_mul:0.0826 train_time:56331100ms step_avg:1599.41ms +step:35240/60000 train_loss:2.0710 lr_mul:0.0825 train_time:56363447ms step_avg:1599.42ms +step:35260/60000 train_loss:1.9893 lr_mul:0.0825 train_time:56395807ms step_avg:1599.43ms +step:35280/60000 train_loss:2.0503 lr_mul:0.0824 train_time:56428179ms step_avg:1599.44ms +step:35300/60000 train_loss:1.9589 lr_mul:0.0823 train_time:56460538ms step_avg:1599.45ms +step:35320/60000 train_loss:2.0735 lr_mul:0.0823 train_time:56492911ms step_avg:1599.46ms +step:35340/60000 train_loss:2.0628 lr_mul:0.0822 train_time:56525323ms step_avg:1599.47ms +step:35360/60000 train_loss:2.0049 lr_mul:0.0821 train_time:56557699ms step_avg:1599.48ms +step:35380/60000 train_loss:2.0706 lr_mul:0.0821 train_time:56590051ms step_avg:1599.49ms +step:35400/60000 train_loss:2.0409 lr_mul:0.0820 train_time:56622423ms step_avg:1599.50ms +step:35420/60000 train_loss:2.1266 lr_mul:0.0819 train_time:56654790ms step_avg:1599.51ms +step:35440/60000 train_loss:1.9822 lr_mul:0.0819 train_time:56687150ms step_avg:1599.52ms +step:35460/60000 train_loss:2.0786 lr_mul:0.0818 train_time:56719539ms step_avg:1599.54ms +step:35480/60000 train_loss:1.9455 lr_mul:0.0817 train_time:56751931ms step_avg:1599.55ms +step:35500/60000 train_loss:2.0612 lr_mul:0.0817 train_time:56784296ms step_avg:1599.56ms + >> Checkpoint saved: ./checkpoints/ckpt_step35500.pt +step:35520/60000 train_loss:2.0359 lr_mul:0.0816 train_time:56816978ms step_avg:1599.58ms +step:35540/60000 train_loss:2.0469 lr_mul:0.0815 train_time:56849336ms step_avg:1599.59ms +step:35560/60000 train_loss:2.0628 lr_mul:0.0815 train_time:56881691ms step_avg:1599.60ms +step:35580/60000 train_loss:1.9762 lr_mul:0.0814 train_time:56914034ms step_avg:1599.61ms +step:35600/60000 train_loss:2.0810 lr_mul:0.0813 train_time:56946389ms step_avg:1599.62ms +step:35620/60000 train_loss:1.9597 lr_mul:0.0813 train_time:56978758ms step_avg:1599.63ms +step:35640/60000 train_loss:2.1396 lr_mul:0.0812 train_time:57011122ms step_avg:1599.64ms +step:35660/60000 train_loss:2.0380 lr_mul:0.0811 train_time:57043479ms step_avg:1599.65ms +step:35680/60000 train_loss:2.1092 lr_mul:0.0811 train_time:57075823ms step_avg:1599.66ms +step:35700/60000 train_loss:1.9990 lr_mul:0.0810 train_time:57108166ms step_avg:1599.67ms +step:35720/60000 train_loss:2.0270 lr_mul:0.0809 train_time:57140526ms step_avg:1599.68ms +step:35740/60000 train_loss:2.0621 lr_mul:0.0809 train_time:57172877ms step_avg:1599.69ms +step:35760/60000 train_loss:2.0797 lr_mul:0.0808 train_time:57205217ms step_avg:1599.70ms +step:35780/60000 train_loss:2.1004 lr_mul:0.0807 train_time:57237563ms step_avg:1599.71ms +step:35800/60000 train_loss:2.0333 lr_mul:0.0807 train_time:57270112ms step_avg:1599.72ms +step:35820/60000 train_loss:2.0575 lr_mul:0.0806 train_time:57302478ms step_avg:1599.73ms +step:35840/60000 train_loss:2.0371 lr_mul:0.0805 train_time:57334864ms step_avg:1599.75ms +step:35860/60000 train_loss:2.0305 lr_mul:0.0805 train_time:57367285ms step_avg:1599.76ms +step:35880/60000 train_loss:2.0619 lr_mul:0.0804 train_time:57399661ms step_avg:1599.77ms +step:35900/60000 train_loss:2.1197 lr_mul:0.0803 train_time:57432005ms step_avg:1599.78ms +step:35920/60000 train_loss:2.0646 lr_mul:0.0803 train_time:57464323ms step_avg:1599.79ms +step:35940/60000 train_loss:2.0310 lr_mul:0.0802 train_time:57496667ms step_avg:1599.80ms +step:35960/60000 train_loss:2.0263 lr_mul:0.0801 train_time:57529032ms step_avg:1599.81ms +step:35980/60000 train_loss:2.0187 lr_mul:0.0801 train_time:57561370ms step_avg:1599.82ms +step:36000/60000 train_loss:2.1208 lr_mul:0.0800 train_time:57593730ms step_avg:1599.83ms + >> Checkpoint saved: ./checkpoints/ckpt_step36000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0428 + eval: batch 40/100 running_loss:2.0381 + eval: batch 60/100 running_loss:2.0309 + eval: batch 80/100 running_loss:2.0327 + eval: batch 100/100 running_loss:2.0408 +step:36000/60000 val_loss:2.0408 val_bpb:1.2251 train_time:57593931ms +step:36020/60000 train_loss:2.0750 lr_mul:0.0799 train_time:57626323ms step_avg:1599.84ms +step:36040/60000 train_loss:2.0909 lr_mul:0.0799 train_time:57658698ms step_avg:1599.85ms +step:36060/60000 train_loss:2.0222 lr_mul:0.0798 train_time:57691086ms step_avg:1599.86ms +step:36080/60000 train_loss:2.0754 lr_mul:0.0797 train_time:57723467ms step_avg:1599.87ms +step:36100/60000 train_loss:2.0419 lr_mul:0.0797 train_time:57756018ms step_avg:1599.89ms +step:36120/60000 train_loss:2.0961 lr_mul:0.0796 train_time:57788363ms step_avg:1599.90ms +step:36140/60000 train_loss:2.0277 lr_mul:0.0795 train_time:57820730ms step_avg:1599.91ms +step:36160/60000 train_loss:2.0003 lr_mul:0.0795 train_time:57853091ms step_avg:1599.92ms +step:36180/60000 train_loss:2.0066 lr_mul:0.0794 train_time:57885452ms step_avg:1599.93ms +step:36200/60000 train_loss:2.0603 lr_mul:0.0793 train_time:57917816ms step_avg:1599.94ms +step:36220/60000 train_loss:2.0584 lr_mul:0.0793 train_time:57950171ms step_avg:1599.95ms +step:36240/60000 train_loss:2.0575 lr_mul:0.0792 train_time:57982538ms step_avg:1599.96ms +step:36260/60000 train_loss:2.1055 lr_mul:0.0791 train_time:58014900ms step_avg:1599.97ms +step:36280/60000 train_loss:2.0020 lr_mul:0.0791 train_time:58047257ms step_avg:1599.98ms +step:36300/60000 train_loss:2.0537 lr_mul:0.0790 train_time:58079606ms step_avg:1599.99ms +step:36320/60000 train_loss:2.0678 lr_mul:0.0789 train_time:58111951ms step_avg:1600.00ms +step:36340/60000 train_loss:2.0465 lr_mul:0.0789 train_time:58144300ms step_avg:1600.01ms +step:36360/60000 train_loss:2.0035 lr_mul:0.0788 train_time:58176635ms step_avg:1600.02ms +step:36380/60000 train_loss:2.0453 lr_mul:0.0787 train_time:58209173ms step_avg:1600.03ms +step:36400/60000 train_loss:2.0695 lr_mul:0.0787 train_time:58241516ms step_avg:1600.04ms +step:36420/60000 train_loss:2.0106 lr_mul:0.0786 train_time:58273880ms step_avg:1600.05ms +step:36440/60000 train_loss:2.1102 lr_mul:0.0785 train_time:58306229ms step_avg:1600.06ms +step:36460/60000 train_loss:2.0498 lr_mul:0.0785 train_time:58338576ms step_avg:1600.07ms +step:36480/60000 train_loss:2.0983 lr_mul:0.0784 train_time:58370928ms step_avg:1600.08ms +step:36500/60000 train_loss:2.0736 lr_mul:0.0783 train_time:58403305ms step_avg:1600.09ms + >> Checkpoint saved: ./checkpoints/ckpt_step36500.pt +step:36520/60000 train_loss:1.9925 lr_mul:0.0783 train_time:58435858ms step_avg:1600.11ms +step:36540/60000 train_loss:2.0012 lr_mul:0.0782 train_time:58468243ms step_avg:1600.12ms +step:36560/60000 train_loss:2.1490 lr_mul:0.0781 train_time:58500611ms step_avg:1600.13ms +step:36580/60000 train_loss:2.1697 lr_mul:0.0781 train_time:58532978ms step_avg:1600.14ms +step:36600/60000 train_loss:2.0242 lr_mul:0.0780 train_time:58565347ms step_avg:1600.15ms +step:36620/60000 train_loss:2.0410 lr_mul:0.0779 train_time:58597705ms step_avg:1600.16ms +step:36640/60000 train_loss:2.0171 lr_mul:0.0779 train_time:58630064ms step_avg:1600.17ms +step:36660/60000 train_loss:2.0488 lr_mul:0.0778 train_time:58662734ms step_avg:1600.18ms +step:36680/60000 train_loss:2.0405 lr_mul:0.0777 train_time:58695088ms step_avg:1600.19ms +step:36700/60000 train_loss:2.1419 lr_mul:0.0777 train_time:58727492ms step_avg:1600.20ms +step:36720/60000 train_loss:1.9782 lr_mul:0.0776 train_time:58759863ms step_avg:1600.21ms +step:36740/60000 train_loss:1.9778 lr_mul:0.0775 train_time:58792236ms step_avg:1600.22ms +step:36760/60000 train_loss:2.0014 lr_mul:0.0775 train_time:58824591ms step_avg:1600.23ms +step:36780/60000 train_loss:2.0407 lr_mul:0.0774 train_time:58856908ms step_avg:1600.24ms +step:36800/60000 train_loss:1.9464 lr_mul:0.0773 train_time:58889256ms step_avg:1600.25ms +step:36820/60000 train_loss:2.1200 lr_mul:0.0773 train_time:58921608ms step_avg:1600.26ms +step:36840/60000 train_loss:1.9750 lr_mul:0.0772 train_time:58953937ms step_avg:1600.27ms +step:36860/60000 train_loss:2.0830 lr_mul:0.0771 train_time:58986253ms step_avg:1600.28ms +step:36880/60000 train_loss:1.9284 lr_mul:0.0771 train_time:59018572ms step_avg:1600.29ms +step:36900/60000 train_loss:2.0354 lr_mul:0.0770 train_time:59050901ms step_avg:1600.30ms +step:36920/60000 train_loss:1.9717 lr_mul:0.0769 train_time:59083245ms step_avg:1600.30ms +step:36940/60000 train_loss:2.1226 lr_mul:0.0769 train_time:59115585ms step_avg:1600.31ms +step:36960/60000 train_loss:1.9471 lr_mul:0.0768 train_time:59148208ms step_avg:1600.33ms +step:36980/60000 train_loss:2.0866 lr_mul:0.0767 train_time:59180536ms step_avg:1600.34ms +step:37000/60000 train_loss:1.9534 lr_mul:0.0767 train_time:59212876ms step_avg:1600.35ms + >> Checkpoint saved: ./checkpoints/ckpt_step37000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0440 + eval: batch 40/100 running_loss:2.0396 + eval: batch 60/100 running_loss:2.0324 + eval: batch 80/100 running_loss:2.0341 + eval: batch 100/100 running_loss:2.0422 +step:37000/60000 val_loss:2.0422 val_bpb:1.2259 train_time:59213088ms +step:37020/60000 train_loss:2.0602 lr_mul:0.0766 train_time:59245447ms step_avg:1600.36ms +step:37040/60000 train_loss:1.9852 lr_mul:0.0765 train_time:59277794ms step_avg:1600.37ms +step:37060/60000 train_loss:2.0331 lr_mul:0.0765 train_time:59310135ms step_avg:1600.38ms +step:37080/60000 train_loss:2.0302 lr_mul:0.0764 train_time:59342478ms step_avg:1600.39ms +step:37100/60000 train_loss:2.0497 lr_mul:0.0763 train_time:59374809ms step_avg:1600.40ms +step:37120/60000 train_loss:2.0342 lr_mul:0.0763 train_time:59407152ms step_avg:1600.41ms +step:37140/60000 train_loss:2.0543 lr_mul:0.0762 train_time:59439492ms step_avg:1600.42ms +step:37160/60000 train_loss:2.1001 lr_mul:0.0761 train_time:59471842ms step_avg:1600.43ms +step:37180/60000 train_loss:2.0645 lr_mul:0.0761 train_time:59504202ms step_avg:1600.44ms +step:37200/60000 train_loss:1.9392 lr_mul:0.0760 train_time:59536564ms step_avg:1600.45ms +step:37220/60000 train_loss:2.0484 lr_mul:0.0759 train_time:59568936ms step_avg:1600.46ms +step:37240/60000 train_loss:2.0524 lr_mul:0.0759 train_time:59601606ms step_avg:1600.47ms +step:37260/60000 train_loss:2.0644 lr_mul:0.0758 train_time:59633968ms step_avg:1600.48ms +step:37280/60000 train_loss:2.0412 lr_mul:0.0757 train_time:59666346ms step_avg:1600.49ms +step:37300/60000 train_loss:1.9749 lr_mul:0.0757 train_time:59698712ms step_avg:1600.50ms +step:37320/60000 train_loss:2.0138 lr_mul:0.0756 train_time:59731071ms step_avg:1600.51ms +step:37340/60000 train_loss:2.0358 lr_mul:0.0755 train_time:59763450ms step_avg:1600.52ms +step:37360/60000 train_loss:2.0221 lr_mul:0.0755 train_time:59795812ms step_avg:1600.53ms +step:37380/60000 train_loss:2.0241 lr_mul:0.0754 train_time:59828182ms step_avg:1600.54ms +step:37400/60000 train_loss:2.0635 lr_mul:0.0753 train_time:59860514ms step_avg:1600.55ms +step:37420/60000 train_loss:2.0397 lr_mul:0.0753 train_time:59892844ms step_avg:1600.56ms +step:37440/60000 train_loss:2.0483 lr_mul:0.0752 train_time:59925194ms step_avg:1600.57ms +step:37460/60000 train_loss:2.0672 lr_mul:0.0751 train_time:59957558ms step_avg:1600.58ms +step:37480/60000 train_loss:1.9856 lr_mul:0.0751 train_time:59989921ms step_avg:1600.58ms +step:37500/60000 train_loss:2.0693 lr_mul:0.0750 train_time:60022297ms step_avg:1600.59ms + >> Checkpoint saved: ./checkpoints/ckpt_step37500.pt +step:37520/60000 train_loss:2.0149 lr_mul:0.0749 train_time:60055084ms step_avg:1600.62ms +step:37540/60000 train_loss:2.1592 lr_mul:0.0749 train_time:60087491ms step_avg:1600.63ms +step:37560/60000 train_loss:2.0725 lr_mul:0.0748 train_time:60119923ms step_avg:1600.64ms +step:37580/60000 train_loss:1.9894 lr_mul:0.0747 train_time:60152329ms step_avg:1600.65ms +step:37600/60000 train_loss:2.0517 lr_mul:0.0747 train_time:60184731ms step_avg:1600.66ms +step:37620/60000 train_loss:2.0702 lr_mul:0.0746 train_time:60217132ms step_avg:1600.67ms +step:37640/60000 train_loss:2.0461 lr_mul:0.0745 train_time:60249521ms step_avg:1600.68ms +step:37660/60000 train_loss:1.9928 lr_mul:0.0745 train_time:60281913ms step_avg:1600.69ms +step:37680/60000 train_loss:2.0939 lr_mul:0.0744 train_time:60314318ms step_avg:1600.70ms +step:37700/60000 train_loss:2.1428 lr_mul:0.0743 train_time:60346714ms step_avg:1600.71ms +step:37720/60000 train_loss:1.9856 lr_mul:0.0743 train_time:60379084ms step_avg:1600.72ms +step:37740/60000 train_loss:2.0327 lr_mul:0.0742 train_time:60411453ms step_avg:1600.73ms +step:37760/60000 train_loss:2.0060 lr_mul:0.0741 train_time:60443822ms step_avg:1600.74ms +step:37780/60000 train_loss:2.0762 lr_mul:0.0741 train_time:60476202ms step_avg:1600.75ms +step:37800/60000 train_loss:2.0069 lr_mul:0.0740 train_time:60508582ms step_avg:1600.76ms +step:37820/60000 train_loss:2.0645 lr_mul:0.0739 train_time:60541155ms step_avg:1600.77ms +step:37840/60000 train_loss:1.9901 lr_mul:0.0739 train_time:60573553ms step_avg:1600.78ms +step:37860/60000 train_loss:2.1090 lr_mul:0.0738 train_time:60605944ms step_avg:1600.79ms +step:37880/60000 train_loss:2.0051 lr_mul:0.0737 train_time:60638300ms step_avg:1600.80ms +step:37900/60000 train_loss:1.9836 lr_mul:0.0737 train_time:60670659ms step_avg:1600.81ms +step:37920/60000 train_loss:2.0985 lr_mul:0.0736 train_time:60703011ms step_avg:1600.82ms +step:37940/60000 train_loss:2.0581 lr_mul:0.0735 train_time:60735375ms step_avg:1600.83ms +step:37960/60000 train_loss:1.9919 lr_mul:0.0735 train_time:60767750ms step_avg:1600.84ms +step:37980/60000 train_loss:2.0923 lr_mul:0.0734 train_time:60800114ms step_avg:1600.85ms +step:38000/60000 train_loss:1.9591 lr_mul:0.0733 train_time:60832492ms step_avg:1600.86ms + >> Checkpoint saved: ./checkpoints/ckpt_step38000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0428 + eval: batch 40/100 running_loss:2.0382 + eval: batch 60/100 running_loss:2.0311 + eval: batch 80/100 running_loss:2.0328 + eval: batch 100/100 running_loss:2.0410 +step:38000/60000 val_loss:2.0410 val_bpb:1.2252 train_time:60832730ms +step:38020/60000 train_loss:2.1415 lr_mul:0.0733 train_time:60865124ms step_avg:1600.87ms +step:38040/60000 train_loss:1.9594 lr_mul:0.0732 train_time:60897501ms step_avg:1600.88ms +step:38060/60000 train_loss:2.0584 lr_mul:0.0731 train_time:60929865ms step_avg:1600.89ms +step:38080/60000 train_loss:2.0132 lr_mul:0.0731 train_time:60962220ms step_avg:1600.90ms +step:38100/60000 train_loss:2.0218 lr_mul:0.0730 train_time:60994762ms step_avg:1600.91ms +step:38120/60000 train_loss:2.0055 lr_mul:0.0729 train_time:61027139ms step_avg:1600.92ms +step:38140/60000 train_loss:2.0528 lr_mul:0.0729 train_time:61059553ms step_avg:1600.93ms +step:38160/60000 train_loss:1.9961 lr_mul:0.0728 train_time:61091937ms step_avg:1600.94ms +step:38180/60000 train_loss:2.0389 lr_mul:0.0727 train_time:61124316ms step_avg:1600.95ms +step:38200/60000 train_loss:1.9822 lr_mul:0.0727 train_time:61156697ms step_avg:1600.96ms +step:38220/60000 train_loss:2.0614 lr_mul:0.0726 train_time:61189093ms step_avg:1600.97ms +step:38240/60000 train_loss:1.9253 lr_mul:0.0725 train_time:61221490ms step_avg:1600.98ms +step:38260/60000 train_loss:2.0518 lr_mul:0.0725 train_time:61253872ms step_avg:1600.99ms +step:38280/60000 train_loss:1.9150 lr_mul:0.0724 train_time:61286249ms step_avg:1601.00ms +step:38300/60000 train_loss:2.0528 lr_mul:0.0723 train_time:61318608ms step_avg:1601.01ms +step:38320/60000 train_loss:2.0747 lr_mul:0.0723 train_time:61350980ms step_avg:1601.02ms +step:38340/60000 train_loss:2.0919 lr_mul:0.0722 train_time:61383335ms step_avg:1601.03ms +step:38360/60000 train_loss:2.0504 lr_mul:0.0721 train_time:61415678ms step_avg:1601.03ms +step:38380/60000 train_loss:2.0239 lr_mul:0.0721 train_time:61448034ms step_avg:1601.04ms +step:38400/60000 train_loss:2.0975 lr_mul:0.0720 train_time:61480632ms step_avg:1601.06ms +step:38420/60000 train_loss:2.0868 lr_mul:0.0719 train_time:61513036ms step_avg:1601.07ms +step:38440/60000 train_loss:2.0931 lr_mul:0.0719 train_time:61545396ms step_avg:1601.08ms +step:38460/60000 train_loss:2.0800 lr_mul:0.0718 train_time:61577739ms step_avg:1601.09ms +step:38480/60000 train_loss:2.0534 lr_mul:0.0717 train_time:61610074ms step_avg:1601.09ms +step:38500/60000 train_loss:2.0439 lr_mul:0.0717 train_time:61642421ms step_avg:1601.10ms + >> Checkpoint saved: ./checkpoints/ckpt_step38500.pt +step:38520/60000 train_loss:2.0280 lr_mul:0.0716 train_time:61674942ms step_avg:1601.11ms +step:38540/60000 train_loss:2.0208 lr_mul:0.0715 train_time:61707283ms step_avg:1601.12ms +step:38560/60000 train_loss:2.0840 lr_mul:0.0715 train_time:61739639ms step_avg:1601.13ms +step:38580/60000 train_loss:2.0099 lr_mul:0.0714 train_time:61771998ms step_avg:1601.14ms +step:38600/60000 train_loss:2.0695 lr_mul:0.0713 train_time:61804353ms step_avg:1601.15ms +step:38620/60000 train_loss:2.0610 lr_mul:0.0713 train_time:61836719ms step_avg:1601.16ms +step:38640/60000 train_loss:1.9727 lr_mul:0.0712 train_time:61869084ms step_avg:1601.17ms +step:38660/60000 train_loss:2.0869 lr_mul:0.0711 train_time:61901455ms step_avg:1601.18ms +step:38680/60000 train_loss:2.0153 lr_mul:0.0711 train_time:61933998ms step_avg:1601.19ms +step:38700/60000 train_loss:2.1492 lr_mul:0.0710 train_time:61966351ms step_avg:1601.20ms +step:38720/60000 train_loss:2.0147 lr_mul:0.0709 train_time:61998720ms step_avg:1601.21ms +step:38740/60000 train_loss:2.0606 lr_mul:0.0709 train_time:62031067ms step_avg:1601.21ms +step:38760/60000 train_loss:2.0407 lr_mul:0.0708 train_time:62063303ms step_avg:1601.22ms +step:38780/60000 train_loss:2.0538 lr_mul:0.0707 train_time:62095437ms step_avg:1601.22ms +step:38800/60000 train_loss:2.1252 lr_mul:0.0707 train_time:62127482ms step_avg:1601.22ms +step:38820/60000 train_loss:2.6255 lr_mul:0.0706 train_time:62159495ms step_avg:1601.22ms +step:38840/60000 train_loss:2.1091 lr_mul:0.0705 train_time:62191523ms step_avg:1601.22ms +step:38860/60000 train_loss:2.0525 lr_mul:0.0705 train_time:62223544ms step_avg:1601.22ms +step:38880/60000 train_loss:2.1258 lr_mul:0.0704 train_time:62255548ms step_avg:1601.22ms +step:38900/60000 train_loss:2.0039 lr_mul:0.0703 train_time:62287542ms step_avg:1601.22ms +step:38920/60000 train_loss:1.9974 lr_mul:0.0703 train_time:62319549ms step_avg:1601.22ms +step:38940/60000 train_loss:2.0610 lr_mul:0.0702 train_time:62351540ms step_avg:1601.22ms +step:38960/60000 train_loss:2.3425 lr_mul:0.0701 train_time:62383769ms step_avg:1601.23ms +step:38980/60000 train_loss:2.0353 lr_mul:0.0701 train_time:62415801ms step_avg:1601.23ms +step:39000/60000 train_loss:2.2371 lr_mul:0.0700 train_time:62447843ms step_avg:1601.23ms + >> Checkpoint saved: ./checkpoints/ckpt_step39000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0409 + eval: batch 40/100 running_loss:2.0365 + eval: batch 60/100 running_loss:2.0291 + eval: batch 80/100 running_loss:2.0308 + eval: batch 100/100 running_loss:2.0390 +step:39000/60000 val_loss:2.0390 val_bpb:1.2240 train_time:62448060ms +step:39020/60000 train_loss:2.0612 lr_mul:0.0699 train_time:62480083ms step_avg:1601.23ms +step:39040/60000 train_loss:2.0555 lr_mul:0.0699 train_time:62512091ms step_avg:1601.23ms +step:39060/60000 train_loss:2.1729 lr_mul:0.0698 train_time:62544125ms step_avg:1601.23ms +step:39080/60000 train_loss:2.1045 lr_mul:0.0697 train_time:62576133ms step_avg:1601.23ms +step:39100/60000 train_loss:2.0892 lr_mul:0.0697 train_time:62608115ms step_avg:1601.23ms +step:39120/60000 train_loss:1.9716 lr_mul:0.0696 train_time:62640132ms step_avg:1601.23ms +step:39140/60000 train_loss:2.0480 lr_mul:0.0695 train_time:62672129ms step_avg:1601.23ms +step:39160/60000 train_loss:2.0152 lr_mul:0.0695 train_time:62704141ms step_avg:1601.23ms +step:39180/60000 train_loss:2.0666 lr_mul:0.0694 train_time:62736221ms step_avg:1601.23ms +step:39200/60000 train_loss:2.0407 lr_mul:0.0693 train_time:62768239ms step_avg:1601.23ms +step:39220/60000 train_loss:2.0033 lr_mul:0.0693 train_time:62800283ms step_avg:1601.23ms +step:39240/60000 train_loss:2.0731 lr_mul:0.0692 train_time:62832321ms step_avg:1601.23ms +step:39260/60000 train_loss:2.0761 lr_mul:0.0691 train_time:62864531ms step_avg:1601.24ms +step:39280/60000 train_loss:2.0095 lr_mul:0.0691 train_time:62896570ms step_avg:1601.24ms +step:39300/60000 train_loss:2.0475 lr_mul:0.0690 train_time:62928619ms step_avg:1601.24ms +step:39320/60000 train_loss:2.0140 lr_mul:0.0689 train_time:62960660ms step_avg:1601.24ms +step:39340/60000 train_loss:2.0779 lr_mul:0.0689 train_time:62992688ms step_avg:1601.24ms +step:39360/60000 train_loss:2.0229 lr_mul:0.0688 train_time:63024719ms step_avg:1601.24ms +step:39380/60000 train_loss:2.1656 lr_mul:0.0687 train_time:63056745ms step_avg:1601.24ms +step:39400/60000 train_loss:2.0800 lr_mul:0.0687 train_time:63088745ms step_avg:1601.24ms +step:39420/60000 train_loss:2.0725 lr_mul:0.0686 train_time:63120733ms step_avg:1601.24ms +step:39440/60000 train_loss:2.0747 lr_mul:0.0685 train_time:63152729ms step_avg:1601.24ms +step:39460/60000 train_loss:2.0930 lr_mul:0.0685 train_time:63184758ms step_avg:1601.24ms +step:39480/60000 train_loss:2.1026 lr_mul:0.0684 train_time:63216772ms step_avg:1601.24ms +step:39500/60000 train_loss:2.0361 lr_mul:0.0683 train_time:63248794ms step_avg:1601.24ms + >> Checkpoint saved: ./checkpoints/ckpt_step39500.pt +step:39520/60000 train_loss:2.0865 lr_mul:0.0683 train_time:63281001ms step_avg:1601.24ms +step:39540/60000 train_loss:2.0610 lr_mul:0.0682 train_time:63313367ms step_avg:1601.25ms +step:39560/60000 train_loss:2.1148 lr_mul:0.0681 train_time:63345390ms step_avg:1601.25ms +step:39580/60000 train_loss:2.0487 lr_mul:0.0681 train_time:63377443ms step_avg:1601.25ms +step:39600/60000 train_loss:2.0839 lr_mul:0.0680 train_time:63409456ms step_avg:1601.25ms +step:39620/60000 train_loss:2.0383 lr_mul:0.0679 train_time:63441423ms step_avg:1601.25ms +step:39640/60000 train_loss:2.1504 lr_mul:0.0679 train_time:63473404ms step_avg:1601.25ms +step:39660/60000 train_loss:2.0360 lr_mul:0.0678 train_time:63505385ms step_avg:1601.25ms +step:39680/60000 train_loss:2.0730 lr_mul:0.0677 train_time:63537374ms step_avg:1601.24ms +step:39700/60000 train_loss:2.0600 lr_mul:0.0677 train_time:63569401ms step_avg:1601.24ms +step:39720/60000 train_loss:2.0781 lr_mul:0.0676 train_time:63601410ms step_avg:1601.24ms +step:39740/60000 train_loss:2.0934 lr_mul:0.0675 train_time:63633418ms step_avg:1601.24ms +step:39760/60000 train_loss:2.1314 lr_mul:0.0675 train_time:63665464ms step_avg:1601.24ms +step:39780/60000 train_loss:2.0641 lr_mul:0.0674 train_time:63697525ms step_avg:1601.24ms +step:39800/60000 train_loss:2.0280 lr_mul:0.0673 train_time:63729562ms step_avg:1601.25ms +step:39820/60000 train_loss:2.0937 lr_mul:0.0673 train_time:63761589ms step_avg:1601.25ms +step:39840/60000 train_loss:2.0550 lr_mul:0.0672 train_time:63793802ms step_avg:1601.25ms +step:39860/60000 train_loss:2.0645 lr_mul:0.0671 train_time:63825822ms step_avg:1601.25ms +step:39880/60000 train_loss:2.0522 lr_mul:0.0671 train_time:63857869ms step_avg:1601.25ms +step:39900/60000 train_loss:2.1319 lr_mul:0.0670 train_time:63889877ms step_avg:1601.25ms +step:39920/60000 train_loss:2.0540 lr_mul:0.0669 train_time:63921869ms step_avg:1601.25ms +step:39940/60000 train_loss:2.0407 lr_mul:0.0669 train_time:63953919ms step_avg:1601.25ms +step:39960/60000 train_loss:2.0552 lr_mul:0.0668 train_time:63985941ms step_avg:1601.25ms +step:39980/60000 train_loss:2.0535 lr_mul:0.0667 train_time:64017981ms step_avg:1601.25ms +step:40000/60000 train_loss:2.0586 lr_mul:0.0667 train_time:64050032ms step_avg:1601.25ms + >> Checkpoint saved: ./checkpoints/ckpt_step40000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0377 + eval: batch 40/100 running_loss:2.0335 + eval: batch 60/100 running_loss:2.0262 + eval: batch 80/100 running_loss:2.0278 + eval: batch 100/100 running_loss:2.0359 +step:40000/60000 val_loss:2.0359 val_bpb:1.2222 train_time:64050238ms +step:40020/60000 train_loss:2.0558 lr_mul:0.0666 train_time:64082283ms step_avg:1601.26ms +step:40040/60000 train_loss:1.9858 lr_mul:0.0665 train_time:64114364ms step_avg:1601.26ms +step:40060/60000 train_loss:2.0356 lr_mul:0.0665 train_time:64146438ms step_avg:1601.26ms +step:40080/60000 train_loss:2.0175 lr_mul:0.0664 train_time:64178431ms step_avg:1601.26ms +step:40100/60000 train_loss:2.0850 lr_mul:0.0663 train_time:64210444ms step_avg:1601.26ms +step:40120/60000 train_loss:2.0348 lr_mul:0.0663 train_time:64242661ms step_avg:1601.26ms +step:40140/60000 train_loss:2.0537 lr_mul:0.0662 train_time:64274699ms step_avg:1601.26ms +step:40160/60000 train_loss:2.0949 lr_mul:0.0661 train_time:64306725ms step_avg:1601.26ms +step:40180/60000 train_loss:2.0450 lr_mul:0.0661 train_time:64338750ms step_avg:1601.26ms +step:40200/60000 train_loss:2.0274 lr_mul:0.0660 train_time:64370752ms step_avg:1601.26ms +step:40220/60000 train_loss:2.0703 lr_mul:0.0659 train_time:64402765ms step_avg:1601.26ms +step:40240/60000 train_loss:2.0564 lr_mul:0.0659 train_time:64434799ms step_avg:1601.26ms +step:40260/60000 train_loss:2.0915 lr_mul:0.0658 train_time:64466858ms step_avg:1601.26ms +step:40280/60000 train_loss:2.0484 lr_mul:0.0657 train_time:64498890ms step_avg:1601.26ms +step:40300/60000 train_loss:2.3368 lr_mul:0.0657 train_time:64530958ms step_avg:1601.26ms +step:40320/60000 train_loss:2.0183 lr_mul:0.0656 train_time:64563007ms step_avg:1601.27ms +step:40340/60000 train_loss:2.1027 lr_mul:0.0655 train_time:64595036ms step_avg:1601.27ms +step:40360/60000 train_loss:2.0637 lr_mul:0.0655 train_time:64627076ms step_avg:1601.27ms +step:40380/60000 train_loss:2.1412 lr_mul:0.0654 train_time:64659128ms step_avg:1601.27ms +step:40400/60000 train_loss:2.0105 lr_mul:0.0653 train_time:64691449ms step_avg:1601.27ms +step:40420/60000 train_loss:2.0979 lr_mul:0.0653 train_time:64723540ms step_avg:1601.28ms +step:40440/60000 train_loss:2.0372 lr_mul:0.0652 train_time:64756271ms step_avg:1601.29ms +step:40460/60000 train_loss:2.0740 lr_mul:0.0651 train_time:64788438ms step_avg:1601.30ms +step:40480/60000 train_loss:2.0242 lr_mul:0.0651 train_time:64820729ms step_avg:1601.30ms +step:40500/60000 train_loss:2.1069 lr_mul:0.0650 train_time:64853053ms step_avg:1601.31ms + >> Checkpoint saved: ./checkpoints/ckpt_step40500.pt +step:40520/60000 train_loss:2.0658 lr_mul:0.0649 train_time:64885651ms step_avg:1601.32ms +step:40540/60000 train_loss:2.0832 lr_mul:0.0649 train_time:64918095ms step_avg:1601.33ms +step:40560/60000 train_loss:2.0448 lr_mul:0.0648 train_time:64950516ms step_avg:1601.34ms +step:40580/60000 train_loss:2.0302 lr_mul:0.0647 train_time:64982972ms step_avg:1601.35ms +step:40600/60000 train_loss:2.0438 lr_mul:0.0647 train_time:65015446ms step_avg:1601.37ms +step:40620/60000 train_loss:2.0903 lr_mul:0.0646 train_time:65047930ms step_avg:1601.38ms +step:40640/60000 train_loss:1.9810 lr_mul:0.0645 train_time:65080409ms step_avg:1601.39ms +step:40660/60000 train_loss:1.9912 lr_mul:0.0645 train_time:65112916ms step_avg:1601.40ms +step:40680/60000 train_loss:2.1299 lr_mul:0.0644 train_time:65145417ms step_avg:1601.41ms +step:40700/60000 train_loss:2.0482 lr_mul:0.0643 train_time:65178594ms step_avg:1601.44ms +step:40720/60000 train_loss:2.0607 lr_mul:0.0643 train_time:65211120ms step_avg:1601.45ms +step:40740/60000 train_loss:2.0031 lr_mul:0.0642 train_time:65243654ms step_avg:1601.46ms +step:40760/60000 train_loss:2.0755 lr_mul:0.0641 train_time:65276154ms step_avg:1601.48ms +step:40780/60000 train_loss:2.0677 lr_mul:0.0641 train_time:65308617ms step_avg:1601.49ms +step:40800/60000 train_loss:2.0359 lr_mul:0.0640 train_time:65341101ms step_avg:1601.50ms +step:40820/60000 train_loss:2.0280 lr_mul:0.0639 train_time:65373553ms step_avg:1601.51ms +step:40840/60000 train_loss:2.0286 lr_mul:0.0639 train_time:65406037ms step_avg:1601.52ms +step:40860/60000 train_loss:2.0317 lr_mul:0.0638 train_time:65438565ms step_avg:1601.53ms +step:40880/60000 train_loss:2.0775 lr_mul:0.0637 train_time:65471069ms step_avg:1601.54ms +step:40900/60000 train_loss:2.0260 lr_mul:0.0637 train_time:65503595ms step_avg:1601.55ms +step:40920/60000 train_loss:2.0480 lr_mul:0.0636 train_time:65536112ms step_avg:1601.57ms +step:40940/60000 train_loss:2.0562 lr_mul:0.0635 train_time:65568611ms step_avg:1601.58ms +step:40960/60000 train_loss:2.0480 lr_mul:0.0635 train_time:65601129ms step_avg:1601.59ms +step:40980/60000 train_loss:1.9817 lr_mul:0.0634 train_time:65633929ms step_avg:1601.61ms +step:41000/60000 train_loss:2.0590 lr_mul:0.0633 train_time:65666466ms step_avg:1601.62ms + >> Checkpoint saved: ./checkpoints/ckpt_step41000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0371 + eval: batch 40/100 running_loss:2.0325 + eval: batch 60/100 running_loss:2.0252 + eval: batch 80/100 running_loss:2.0268 + eval: batch 100/100 running_loss:2.0350 +step:41000/60000 val_loss:2.0350 val_bpb:1.2216 train_time:65666666ms +step:41020/60000 train_loss:2.0648 lr_mul:0.0633 train_time:65699229ms step_avg:1601.64ms +step:41040/60000 train_loss:2.0719 lr_mul:0.0632 train_time:65731774ms step_avg:1601.65ms +step:41060/60000 train_loss:2.0857 lr_mul:0.0631 train_time:65764321ms step_avg:1601.66ms +step:41080/60000 train_loss:2.1324 lr_mul:0.0631 train_time:65796874ms step_avg:1601.68ms +step:41100/60000 train_loss:2.0901 lr_mul:0.0630 train_time:65829406ms step_avg:1601.69ms +step:41120/60000 train_loss:2.0107 lr_mul:0.0629 train_time:65861909ms step_avg:1601.70ms +step:41140/60000 train_loss:2.0710 lr_mul:0.0629 train_time:65894458ms step_avg:1601.71ms +step:41160/60000 train_loss:2.0216 lr_mul:0.0628 train_time:65926967ms step_avg:1601.72ms +step:41180/60000 train_loss:2.0967 lr_mul:0.0627 train_time:65959474ms step_avg:1601.74ms +step:41200/60000 train_loss:2.0744 lr_mul:0.0627 train_time:65991987ms step_avg:1601.75ms +step:41220/60000 train_loss:2.0076 lr_mul:0.0626 train_time:66024448ms step_avg:1601.76ms +step:41240/60000 train_loss:2.1046 lr_mul:0.0625 train_time:66056765ms step_avg:1601.76ms +step:41260/60000 train_loss:1.9198 lr_mul:0.0625 train_time:66088994ms step_avg:1601.77ms +step:41280/60000 train_loss:2.0838 lr_mul:0.0624 train_time:66121531ms step_avg:1601.78ms +step:41300/60000 train_loss:2.0048 lr_mul:0.0623 train_time:66153715ms step_avg:1601.78ms +step:41320/60000 train_loss:2.0502 lr_mul:0.0623 train_time:66185963ms step_avg:1601.79ms +step:41340/60000 train_loss:2.0250 lr_mul:0.0622 train_time:66218277ms step_avg:1601.80ms +step:41360/60000 train_loss:2.0645 lr_mul:0.0621 train_time:66250664ms step_avg:1601.81ms +step:41380/60000 train_loss:2.0516 lr_mul:0.0621 train_time:66283104ms step_avg:1601.81ms +step:41400/60000 train_loss:2.0436 lr_mul:0.0620 train_time:66315566ms step_avg:1601.83ms +step:41420/60000 train_loss:2.0078 lr_mul:0.0619 train_time:66348014ms step_avg:1601.84ms +step:41440/60000 train_loss:2.0536 lr_mul:0.0619 train_time:66380442ms step_avg:1601.84ms +step:41460/60000 train_loss:2.1157 lr_mul:0.0618 train_time:66412910ms step_avg:1601.86ms +step:41480/60000 train_loss:2.0964 lr_mul:0.0617 train_time:66445369ms step_avg:1601.87ms +step:41500/60000 train_loss:2.0773 lr_mul:0.0617 train_time:66477817ms step_avg:1601.88ms + >> Checkpoint saved: ./checkpoints/ckpt_step41500.pt +step:41520/60000 train_loss:1.9325 lr_mul:0.0616 train_time:66510465ms step_avg:1601.89ms +step:41540/60000 train_loss:2.0917 lr_mul:0.0615 train_time:66542962ms step_avg:1601.90ms +step:41560/60000 train_loss:2.0546 lr_mul:0.0615 train_time:66575645ms step_avg:1601.92ms +step:41580/60000 train_loss:2.0733 lr_mul:0.0614 train_time:66608157ms step_avg:1601.93ms +step:41600/60000 train_loss:2.0431 lr_mul:0.0613 train_time:66640664ms step_avg:1601.94ms +step:41620/60000 train_loss:2.0647 lr_mul:0.0613 train_time:66673202ms step_avg:1601.95ms +step:41640/60000 train_loss:2.0292 lr_mul:0.0612 train_time:66705713ms step_avg:1601.96ms +step:41660/60000 train_loss:2.0116 lr_mul:0.0611 train_time:66738215ms step_avg:1601.97ms +step:41680/60000 train_loss:2.0899 lr_mul:0.0611 train_time:66770727ms step_avg:1601.98ms +step:41700/60000 train_loss:2.2216 lr_mul:0.0610 train_time:66803246ms step_avg:1602.00ms +step:41720/60000 train_loss:2.0076 lr_mul:0.0609 train_time:66835835ms step_avg:1602.01ms +step:41740/60000 train_loss:2.0509 lr_mul:0.0609 train_time:66868357ms step_avg:1602.02ms +step:41760/60000 train_loss:2.0138 lr_mul:0.0608 train_time:66900902ms step_avg:1602.03ms +step:41780/60000 train_loss:2.0734 lr_mul:0.0607 train_time:66933419ms step_avg:1602.04ms +step:41800/60000 train_loss:2.1457 lr_mul:0.0607 train_time:66965929ms step_avg:1602.06ms +step:41820/60000 train_loss:2.0373 lr_mul:0.0606 train_time:66998424ms step_avg:1602.07ms +step:41840/60000 train_loss:2.0273 lr_mul:0.0605 train_time:67031194ms step_avg:1602.08ms +step:41860/60000 train_loss:2.0098 lr_mul:0.0605 train_time:67063719ms step_avg:1602.10ms +step:41880/60000 train_loss:2.0416 lr_mul:0.0604 train_time:67096262ms step_avg:1602.11ms +step:41900/60000 train_loss:2.0751 lr_mul:0.0603 train_time:67128798ms step_avg:1602.12ms +step:41920/60000 train_loss:2.0602 lr_mul:0.0603 train_time:67161325ms step_avg:1602.13ms +step:41940/60000 train_loss:2.0610 lr_mul:0.0602 train_time:67193860ms step_avg:1602.14ms +step:41960/60000 train_loss:2.0790 lr_mul:0.0601 train_time:67226433ms step_avg:1602.16ms +step:41980/60000 train_loss:2.0317 lr_mul:0.0601 train_time:67260944ms step_avg:1602.21ms +step:42000/60000 train_loss:1.9890 lr_mul:0.0600 train_time:67293530ms step_avg:1602.23ms + >> Checkpoint saved: ./checkpoints/ckpt_step42000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0355 + eval: batch 40/100 running_loss:2.0314 + eval: batch 60/100 running_loss:2.0239 + eval: batch 80/100 running_loss:2.0254 + eval: batch 100/100 running_loss:2.0336 +step:42000/60000 val_loss:2.0336 val_bpb:1.2208 train_time:67293740ms +step:42020/60000 train_loss:2.0189 lr_mul:0.0599 train_time:67326306ms step_avg:1602.24ms +step:42040/60000 train_loss:2.0669 lr_mul:0.0599 train_time:67358838ms step_avg:1602.26ms +step:42060/60000 train_loss:2.0655 lr_mul:0.0598 train_time:67391363ms step_avg:1602.27ms +step:42080/60000 train_loss:1.9919 lr_mul:0.0597 train_time:67423898ms step_avg:1602.28ms +step:42100/60000 train_loss:2.1293 lr_mul:0.0597 train_time:67456433ms step_avg:1602.29ms +step:42120/60000 train_loss:1.9338 lr_mul:0.0596 train_time:67488975ms step_avg:1602.30ms +step:42140/60000 train_loss:2.0877 lr_mul:0.0595 train_time:67521672ms step_avg:1602.32ms +step:42160/60000 train_loss:2.0498 lr_mul:0.0595 train_time:67554180ms step_avg:1602.33ms +step:42180/60000 train_loss:2.0716 lr_mul:0.0594 train_time:67586686ms step_avg:1602.34ms +step:42200/60000 train_loss:2.0073 lr_mul:0.0593 train_time:67619184ms step_avg:1602.35ms +step:42220/60000 train_loss:2.0096 lr_mul:0.0593 train_time:67651722ms step_avg:1602.36ms +step:42240/60000 train_loss:2.0852 lr_mul:0.0592 train_time:67684238ms step_avg:1602.37ms +step:42260/60000 train_loss:1.9686 lr_mul:0.0591 train_time:67716751ms step_avg:1602.38ms +step:42280/60000 train_loss:2.0699 lr_mul:0.0591 train_time:67749289ms step_avg:1602.40ms +step:42300/60000 train_loss:2.0242 lr_mul:0.0590 train_time:67781817ms step_avg:1602.41ms +step:42320/60000 train_loss:2.0229 lr_mul:0.0589 train_time:67814343ms step_avg:1602.42ms +step:42340/60000 train_loss:2.0340 lr_mul:0.0589 train_time:67846881ms step_avg:1602.43ms +step:42360/60000 train_loss:2.0308 lr_mul:0.0588 train_time:67879432ms step_avg:1602.44ms +step:42380/60000 train_loss:2.0827 lr_mul:0.0587 train_time:67911959ms step_avg:1602.45ms +step:42400/60000 train_loss:1.9252 lr_mul:0.0587 train_time:67944481ms step_avg:1602.46ms +step:42420/60000 train_loss:2.0825 lr_mul:0.0586 train_time:67977299ms step_avg:1602.48ms +step:42440/60000 train_loss:2.0097 lr_mul:0.0585 train_time:68009810ms step_avg:1602.49ms +step:42460/60000 train_loss:2.0313 lr_mul:0.0585 train_time:68042359ms step_avg:1602.50ms +step:42480/60000 train_loss:2.0528 lr_mul:0.0584 train_time:68075094ms step_avg:1602.52ms +step:42500/60000 train_loss:2.0001 lr_mul:0.0583 train_time:68107618ms step_avg:1602.53ms + >> Checkpoint saved: ./checkpoints/ckpt_step42500.pt +step:42520/60000 train_loss:2.0262 lr_mul:0.0583 train_time:68140361ms step_avg:1602.55ms +step:42540/60000 train_loss:2.0564 lr_mul:0.0582 train_time:68172916ms step_avg:1602.56ms +step:42560/60000 train_loss:2.0107 lr_mul:0.0581 train_time:68205485ms step_avg:1602.57ms +step:42580/60000 train_loss:2.0166 lr_mul:0.0581 train_time:68238034ms step_avg:1602.58ms +step:42600/60000 train_loss:2.0701 lr_mul:0.0580 train_time:68270589ms step_avg:1602.60ms +step:42620/60000 train_loss:2.0556 lr_mul:0.0579 train_time:68303096ms step_avg:1602.61ms +step:42640/60000 train_loss:2.0490 lr_mul:0.0579 train_time:68335579ms step_avg:1602.62ms +step:42660/60000 train_loss:2.0346 lr_mul:0.0578 train_time:68368102ms step_avg:1602.63ms +step:42680/60000 train_loss:2.0689 lr_mul:0.0577 train_time:68400656ms step_avg:1602.64ms +step:42700/60000 train_loss:2.1207 lr_mul:0.0577 train_time:68433183ms step_avg:1602.65ms +step:42720/60000 train_loss:2.0426 lr_mul:0.0576 train_time:68465893ms step_avg:1602.67ms +step:42740/60000 train_loss:2.0276 lr_mul:0.0575 train_time:68499608ms step_avg:1602.70ms +step:42760/60000 train_loss:2.0899 lr_mul:0.0575 train_time:68532186ms step_avg:1602.72ms +step:42780/60000 train_loss:2.0494 lr_mul:0.0574 train_time:68564758ms step_avg:1602.73ms +step:42800/60000 train_loss:2.0658 lr_mul:0.0573 train_time:68597308ms step_avg:1602.74ms +step:42820/60000 train_loss:2.0401 lr_mul:0.0573 train_time:68629884ms step_avg:1602.75ms +step:42840/60000 train_loss:2.0694 lr_mul:0.0572 train_time:68662426ms step_avg:1602.76ms +step:42860/60000 train_loss:2.0986 lr_mul:0.0571 train_time:68694959ms step_avg:1602.78ms +step:42880/60000 train_loss:2.0690 lr_mul:0.0571 train_time:68727484ms step_avg:1602.79ms +step:42900/60000 train_loss:2.0309 lr_mul:0.0570 train_time:68760033ms step_avg:1602.80ms +step:42920/60000 train_loss:2.0813 lr_mul:0.0569 train_time:68792569ms step_avg:1602.81ms +step:42940/60000 train_loss:2.0562 lr_mul:0.0569 train_time:68825122ms step_avg:1602.82ms +step:42960/60000 train_loss:1.9862 lr_mul:0.0568 train_time:68857664ms step_avg:1602.83ms +step:42980/60000 train_loss:2.1392 lr_mul:0.0567 train_time:68890228ms step_avg:1602.84ms +step:43000/60000 train_loss:2.0982 lr_mul:0.0567 train_time:68922923ms step_avg:1602.86ms + >> Checkpoint saved: ./checkpoints/ckpt_step43000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0361 + eval: batch 40/100 running_loss:2.0319 + eval: batch 60/100 running_loss:2.0245 + eval: batch 80/100 running_loss:2.0260 + eval: batch 100/100 running_loss:2.0341 +step:43000/60000 val_loss:2.0341 val_bpb:1.2211 train_time:68923150ms +step:43020/60000 train_loss:1.9670 lr_mul:0.0566 train_time:68955696ms step_avg:1602.88ms +step:43040/60000 train_loss:2.0559 lr_mul:0.0565 train_time:68988207ms step_avg:1602.89ms +step:43060/60000 train_loss:2.1868 lr_mul:0.0565 train_time:69020747ms step_avg:1602.90ms +step:43080/60000 train_loss:2.1630 lr_mul:0.0564 train_time:69053276ms step_avg:1602.91ms +step:43100/60000 train_loss:2.0814 lr_mul:0.0563 train_time:69085755ms step_avg:1602.92ms +step:43120/60000 train_loss:1.9961 lr_mul:0.0563 train_time:69118277ms step_avg:1602.93ms +step:43140/60000 train_loss:2.0687 lr_mul:0.0562 train_time:69150763ms step_avg:1602.94ms +step:43160/60000 train_loss:2.0377 lr_mul:0.0561 train_time:69183264ms step_avg:1602.95ms +step:43180/60000 train_loss:2.0441 lr_mul:0.0561 train_time:69215759ms step_avg:1602.96ms +step:43200/60000 train_loss:1.9221 lr_mul:0.0560 train_time:69248271ms step_avg:1602.97ms +step:43220/60000 train_loss:2.0378 lr_mul:0.0559 train_time:69280793ms step_avg:1602.98ms +step:43240/60000 train_loss:2.0540 lr_mul:0.0559 train_time:69313311ms step_avg:1602.99ms +step:43260/60000 train_loss:2.0082 lr_mul:0.0558 train_time:69345800ms step_avg:1603.00ms +step:43280/60000 train_loss:2.0859 lr_mul:0.0557 train_time:69378459ms step_avg:1603.01ms +step:43300/60000 train_loss:1.9999 lr_mul:0.0557 train_time:69410940ms step_avg:1603.02ms +step:43320/60000 train_loss:2.0300 lr_mul:0.0556 train_time:69443409ms step_avg:1603.03ms +step:43340/60000 train_loss:2.0661 lr_mul:0.0555 train_time:69475905ms step_avg:1603.04ms +step:43360/60000 train_loss:1.9893 lr_mul:0.0555 train_time:69508376ms step_avg:1603.05ms +step:43380/60000 train_loss:2.1019 lr_mul:0.0554 train_time:69540869ms step_avg:1603.06ms +step:43400/60000 train_loss:2.0430 lr_mul:0.0553 train_time:69573363ms step_avg:1603.07ms +step:43420/60000 train_loss:2.0207 lr_mul:0.0553 train_time:69605843ms step_avg:1603.08ms +step:43440/60000 train_loss:2.0646 lr_mul:0.0552 train_time:69638331ms step_avg:1603.09ms +step:43460/60000 train_loss:2.0333 lr_mul:0.0551 train_time:69670802ms step_avg:1603.10ms +step:43480/60000 train_loss:1.9674 lr_mul:0.0551 train_time:69703325ms step_avg:1603.11ms +step:43500/60000 train_loss:2.0308 lr_mul:0.0550 train_time:69735858ms step_avg:1603.12ms + >> Checkpoint saved: ./checkpoints/ckpt_step43500.pt +step:43520/60000 train_loss:1.9391 lr_mul:0.0549 train_time:69768591ms step_avg:1603.14ms +step:43540/60000 train_loss:2.0760 lr_mul:0.0549 train_time:69801148ms step_avg:1603.15ms +step:43560/60000 train_loss:2.0384 lr_mul:0.0548 train_time:69833694ms step_avg:1603.16ms +step:43580/60000 train_loss:2.0464 lr_mul:0.0547 train_time:69866500ms step_avg:1603.18ms +step:43600/60000 train_loss:2.0773 lr_mul:0.0547 train_time:69899127ms step_avg:1603.19ms +step:43620/60000 train_loss:2.0371 lr_mul:0.0546 train_time:69931697ms step_avg:1603.20ms +step:43640/60000 train_loss:2.0322 lr_mul:0.0545 train_time:69964278ms step_avg:1603.21ms +step:43660/60000 train_loss:2.0257 lr_mul:0.0545 train_time:69996859ms step_avg:1603.23ms +step:43680/60000 train_loss:1.9906 lr_mul:0.0544 train_time:70029421ms step_avg:1603.24ms +step:43700/60000 train_loss:2.0747 lr_mul:0.0543 train_time:70061955ms step_avg:1603.25ms +step:43720/60000 train_loss:2.0531 lr_mul:0.0543 train_time:70094514ms step_avg:1603.26ms +step:43740/60000 train_loss:2.0519 lr_mul:0.0542 train_time:70127083ms step_avg:1603.27ms +step:43760/60000 train_loss:2.0347 lr_mul:0.0541 train_time:70159650ms step_avg:1603.28ms +step:43780/60000 train_loss:2.0447 lr_mul:0.0541 train_time:70192193ms step_avg:1603.29ms +step:43800/60000 train_loss:2.0326 lr_mul:0.0540 train_time:70224764ms step_avg:1603.31ms +step:43820/60000 train_loss:2.0071 lr_mul:0.0539 train_time:70257335ms step_avg:1603.32ms +step:43840/60000 train_loss:2.1266 lr_mul:0.0539 train_time:70289912ms step_avg:1603.33ms +step:43860/60000 train_loss:1.9393 lr_mul:0.0538 train_time:70322759ms step_avg:1603.35ms +step:43880/60000 train_loss:2.0820 lr_mul:0.0537 train_time:70355315ms step_avg:1603.36ms +step:43900/60000 train_loss:2.0679 lr_mul:0.0537 train_time:70387872ms step_avg:1603.37ms +step:43920/60000 train_loss:2.0622 lr_mul:0.0536 train_time:70420412ms step_avg:1603.38ms +step:43940/60000 train_loss:1.9837 lr_mul:0.0535 train_time:70452979ms step_avg:1603.39ms +step:43960/60000 train_loss:2.1447 lr_mul:0.0535 train_time:70485523ms step_avg:1603.40ms +step:43980/60000 train_loss:2.0173 lr_mul:0.0534 train_time:70518065ms step_avg:1603.41ms +step:44000/60000 train_loss:2.1133 lr_mul:0.0533 train_time:70550632ms step_avg:1603.42ms + >> Checkpoint saved: ./checkpoints/ckpt_step44000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0350 + eval: batch 40/100 running_loss:2.0306 + eval: batch 60/100 running_loss:2.0233 + eval: batch 80/100 running_loss:2.0249 + eval: batch 100/100 running_loss:2.0330 +step:44000/60000 val_loss:2.0330 val_bpb:1.2204 train_time:70550857ms +step:44020/60000 train_loss:2.0593 lr_mul:0.0533 train_time:70583387ms step_avg:1603.44ms +step:44040/60000 train_loss:2.0993 lr_mul:0.0532 train_time:70615944ms step_avg:1603.45ms +step:44060/60000 train_loss:2.1234 lr_mul:0.0531 train_time:70648491ms step_avg:1603.46ms +step:44080/60000 train_loss:2.0910 lr_mul:0.0531 train_time:70681066ms step_avg:1603.47ms +step:44100/60000 train_loss:2.0515 lr_mul:0.0530 train_time:70713649ms step_avg:1603.48ms +step:44120/60000 train_loss:2.1182 lr_mul:0.0529 train_time:70746210ms step_avg:1603.50ms +step:44140/60000 train_loss:2.0520 lr_mul:0.0529 train_time:70778987ms step_avg:1603.51ms +step:44160/60000 train_loss:2.1119 lr_mul:0.0528 train_time:70811531ms step_avg:1603.52ms +step:44180/60000 train_loss:1.9915 lr_mul:0.0527 train_time:70844100ms step_avg:1603.53ms +step:44200/60000 train_loss:2.1261 lr_mul:0.0527 train_time:70876655ms step_avg:1603.54ms +step:44220/60000 train_loss:2.0284 lr_mul:0.0526 train_time:70909222ms step_avg:1603.56ms +step:44240/60000 train_loss:2.1355 lr_mul:0.0525 train_time:70941796ms step_avg:1603.57ms +step:44260/60000 train_loss:2.1042 lr_mul:0.0525 train_time:70974413ms step_avg:1603.58ms +step:44280/60000 train_loss:2.0512 lr_mul:0.0524 train_time:71006955ms step_avg:1603.59ms +step:44300/60000 train_loss:2.0928 lr_mul:0.0523 train_time:71039513ms step_avg:1603.60ms +step:44320/60000 train_loss:2.0311 lr_mul:0.0523 train_time:71072048ms step_avg:1603.61ms +step:44340/60000 train_loss:2.0412 lr_mul:0.0522 train_time:71104606ms step_avg:1603.62ms +step:44360/60000 train_loss:2.1156 lr_mul:0.0521 train_time:71137153ms step_avg:1603.63ms +step:44380/60000 train_loss:2.0530 lr_mul:0.0521 train_time:71169675ms step_avg:1603.64ms +step:44400/60000 train_loss:2.0578 lr_mul:0.0520 train_time:71202209ms step_avg:1603.65ms +step:44420/60000 train_loss:2.0368 lr_mul:0.0519 train_time:71234750ms step_avg:1603.66ms +step:44440/60000 train_loss:2.0512 lr_mul:0.0519 train_time:71267588ms step_avg:1603.68ms +step:44460/60000 train_loss:2.0118 lr_mul:0.0518 train_time:71300164ms step_avg:1603.69ms +step:44480/60000 train_loss:2.0362 lr_mul:0.0517 train_time:71332723ms step_avg:1603.70ms +step:44500/60000 train_loss:2.0172 lr_mul:0.0517 train_time:71365310ms step_avg:1603.71ms + >> Checkpoint saved: ./checkpoints/ckpt_step44500.pt +step:44520/60000 train_loss:2.0098 lr_mul:0.0516 train_time:71398158ms step_avg:1603.73ms +step:44540/60000 train_loss:2.0398 lr_mul:0.0515 train_time:71430678ms step_avg:1603.74ms +step:44560/60000 train_loss:2.0216 lr_mul:0.0515 train_time:71463228ms step_avg:1603.75ms +step:44580/60000 train_loss:1.9799 lr_mul:0.0514 train_time:71495792ms step_avg:1603.76ms +step:44600/60000 train_loss:2.0713 lr_mul:0.0513 train_time:71528334ms step_avg:1603.77ms +step:44620/60000 train_loss:2.0226 lr_mul:0.0513 train_time:71560875ms step_avg:1603.78ms +step:44640/60000 train_loss:2.0938 lr_mul:0.0512 train_time:71593398ms step_avg:1603.79ms +step:44660/60000 train_loss:2.1818 lr_mul:0.0511 train_time:71625952ms step_avg:1603.81ms +step:44680/60000 train_loss:2.0171 lr_mul:0.0511 train_time:71658507ms step_avg:1603.82ms +step:44700/60000 train_loss:2.0703 lr_mul:0.0510 train_time:71691080ms step_avg:1603.83ms +step:44720/60000 train_loss:2.0218 lr_mul:0.0509 train_time:71723815ms step_avg:1603.84ms +step:44740/60000 train_loss:2.1304 lr_mul:0.0509 train_time:71756361ms step_avg:1603.85ms +step:44760/60000 train_loss:2.0024 lr_mul:0.0508 train_time:71789077ms step_avg:1603.87ms +step:44780/60000 train_loss:2.0949 lr_mul:0.0507 train_time:71821648ms step_avg:1603.88ms +step:44800/60000 train_loss:2.0491 lr_mul:0.0507 train_time:71854220ms step_avg:1603.89ms +step:44820/60000 train_loss:2.1259 lr_mul:0.0506 train_time:71886790ms step_avg:1603.90ms +step:44840/60000 train_loss:1.9598 lr_mul:0.0505 train_time:71919346ms step_avg:1603.91ms +step:44860/60000 train_loss:2.0589 lr_mul:0.0505 train_time:71951939ms step_avg:1603.92ms +step:44880/60000 train_loss:2.0505 lr_mul:0.0504 train_time:71984513ms step_avg:1603.93ms +step:44900/60000 train_loss:2.0330 lr_mul:0.0503 train_time:72017090ms step_avg:1603.94ms +step:44920/60000 train_loss:1.9851 lr_mul:0.0503 train_time:72049652ms step_avg:1603.95ms +step:44940/60000 train_loss:2.0283 lr_mul:0.0502 train_time:72082211ms step_avg:1603.97ms +step:44960/60000 train_loss:2.1470 lr_mul:0.0501 train_time:72114779ms step_avg:1603.98ms +step:44980/60000 train_loss:1.9488 lr_mul:0.0501 train_time:72147310ms step_avg:1603.99ms +step:45000/60000 train_loss:2.1733 lr_mul:0.0500 train_time:72180060ms step_avg:1604.00ms + >> Checkpoint saved: ./checkpoints/ckpt_step45000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0336 + eval: batch 40/100 running_loss:2.0293 + eval: batch 60/100 running_loss:2.0220 + eval: batch 80/100 running_loss:2.0236 + eval: batch 100/100 running_loss:2.0318 +step:45000/60000 val_loss:2.0318 val_bpb:1.2197 train_time:72180273ms +step:45020/60000 train_loss:2.0401 lr_mul:0.0499 train_time:72213247ms step_avg:1604.03ms +step:45040/60000 train_loss:2.0622 lr_mul:0.0499 train_time:72245862ms step_avg:1604.04ms +step:45060/60000 train_loss:2.0157 lr_mul:0.0498 train_time:72278479ms step_avg:1604.05ms +step:45080/60000 train_loss:2.0420 lr_mul:0.0497 train_time:72311065ms step_avg:1604.06ms +step:45100/60000 train_loss:2.1023 lr_mul:0.0497 train_time:72343632ms step_avg:1604.07ms +step:45120/60000 train_loss:1.9391 lr_mul:0.0496 train_time:72376184ms step_avg:1604.08ms +step:45140/60000 train_loss:2.0685 lr_mul:0.0495 train_time:72408740ms step_avg:1604.09ms +step:45160/60000 train_loss:1.9690 lr_mul:0.0495 train_time:72441300ms step_avg:1604.10ms +step:45180/60000 train_loss:2.1008 lr_mul:0.0494 train_time:72473847ms step_avg:1604.11ms +step:45200/60000 train_loss:2.0217 lr_mul:0.0493 train_time:72506391ms step_avg:1604.12ms +step:45220/60000 train_loss:2.0165 lr_mul:0.0493 train_time:72538950ms step_avg:1604.13ms +step:45240/60000 train_loss:2.0270 lr_mul:0.0492 train_time:72571508ms step_avg:1604.14ms +step:45260/60000 train_loss:2.0467 lr_mul:0.0491 train_time:72604067ms step_avg:1604.16ms +step:45280/60000 train_loss:2.0534 lr_mul:0.0491 train_time:72636851ms step_avg:1604.17ms +step:45300/60000 train_loss:2.0400 lr_mul:0.0490 train_time:72669574ms step_avg:1604.18ms +step:45320/60000 train_loss:2.0602 lr_mul:0.0489 train_time:72702119ms step_avg:1604.20ms +step:45340/60000 train_loss:2.0684 lr_mul:0.0489 train_time:72734691ms step_avg:1604.21ms +step:45360/60000 train_loss:2.0511 lr_mul:0.0488 train_time:72767249ms step_avg:1604.22ms +step:45380/60000 train_loss:2.0291 lr_mul:0.0487 train_time:72799800ms step_avg:1604.23ms +step:45400/60000 train_loss:2.0703 lr_mul:0.0487 train_time:72832343ms step_avg:1604.24ms +step:45420/60000 train_loss:2.1822 lr_mul:0.0486 train_time:72864882ms step_avg:1604.25ms +step:45440/60000 train_loss:1.9940 lr_mul:0.0485 train_time:72897408ms step_avg:1604.26ms +step:45460/60000 train_loss:2.0845 lr_mul:0.0485 train_time:72929964ms step_avg:1604.27ms +step:45480/60000 train_loss:2.0440 lr_mul:0.0484 train_time:72962518ms step_avg:1604.28ms +step:45500/60000 train_loss:2.0359 lr_mul:0.0483 train_time:72995075ms step_avg:1604.29ms + >> Checkpoint saved: ./checkpoints/ckpt_step45500.pt +step:45520/60000 train_loss:2.0479 lr_mul:0.0483 train_time:73027824ms step_avg:1604.30ms +step:45540/60000 train_loss:2.0819 lr_mul:0.0482 train_time:73060438ms step_avg:1604.31ms +step:45560/60000 train_loss:2.0369 lr_mul:0.0481 train_time:73093000ms step_avg:1604.32ms +step:45580/60000 train_loss:2.0380 lr_mul:0.0481 train_time:73125739ms step_avg:1604.34ms +step:45600/60000 train_loss:2.1721 lr_mul:0.0480 train_time:73158279ms step_avg:1604.35ms +step:45620/60000 train_loss:2.0241 lr_mul:0.0479 train_time:73190834ms step_avg:1604.36ms +step:45640/60000 train_loss:2.0784 lr_mul:0.0479 train_time:73223387ms step_avg:1604.37ms +step:45660/60000 train_loss:2.0409 lr_mul:0.0478 train_time:73255932ms step_avg:1604.38ms +step:45680/60000 train_loss:2.0836 lr_mul:0.0477 train_time:73288471ms step_avg:1604.39ms +step:45700/60000 train_loss:2.0977 lr_mul:0.0477 train_time:73321022ms step_avg:1604.40ms +step:45720/60000 train_loss:2.1219 lr_mul:0.0476 train_time:73353599ms step_avg:1604.41ms +step:45740/60000 train_loss:2.0066 lr_mul:0.0475 train_time:73386168ms step_avg:1604.42ms +step:45760/60000 train_loss:2.0913 lr_mul:0.0475 train_time:73418748ms step_avg:1604.43ms +step:45780/60000 train_loss:2.0208 lr_mul:0.0474 train_time:73451534ms step_avg:1604.45ms +step:45800/60000 train_loss:2.0922 lr_mul:0.0473 train_time:73484123ms step_avg:1604.46ms +step:45820/60000 train_loss:2.0189 lr_mul:0.0473 train_time:73516732ms step_avg:1604.47ms +step:45840/60000 train_loss:2.0657 lr_mul:0.0472 train_time:73549294ms step_avg:1604.48ms +step:45860/60000 train_loss:2.0139 lr_mul:0.0471 train_time:73582091ms step_avg:1604.49ms +step:45880/60000 train_loss:2.1370 lr_mul:0.0471 train_time:73614668ms step_avg:1604.50ms +step:45900/60000 train_loss:2.1081 lr_mul:0.0470 train_time:73647232ms step_avg:1604.51ms +step:45920/60000 train_loss:2.0589 lr_mul:0.0469 train_time:73679805ms step_avg:1604.53ms +step:45940/60000 train_loss:2.0240 lr_mul:0.0469 train_time:73712377ms step_avg:1604.54ms +step:45960/60000 train_loss:2.0885 lr_mul:0.0468 train_time:73744957ms step_avg:1604.55ms +step:45980/60000 train_loss:2.0416 lr_mul:0.0467 train_time:73777508ms step_avg:1604.56ms +step:46000/60000 train_loss:1.9761 lr_mul:0.0467 train_time:73810060ms step_avg:1604.57ms + >> Checkpoint saved: ./checkpoints/ckpt_step46000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0325 + eval: batch 40/100 running_loss:2.0283 + eval: batch 60/100 running_loss:2.0208 + eval: batch 80/100 running_loss:2.0223 + eval: batch 100/100 running_loss:2.0305 +step:46000/60000 val_loss:2.0305 val_bpb:1.2189 train_time:73810271ms +step:46020/60000 train_loss:2.0454 lr_mul:0.0466 train_time:73842816ms step_avg:1604.58ms +step:46040/60000 train_loss:2.4636 lr_mul:0.0465 train_time:73875433ms step_avg:1604.59ms +step:46060/60000 train_loss:2.0480 lr_mul:0.0465 train_time:73907977ms step_avg:1604.60ms +step:46080/60000 train_loss:2.0245 lr_mul:0.0464 train_time:73940518ms step_avg:1604.61ms +step:46100/60000 train_loss:2.0365 lr_mul:0.0463 train_time:73973058ms step_avg:1604.62ms +step:46120/60000 train_loss:2.0690 lr_mul:0.0463 train_time:74005595ms step_avg:1604.63ms +step:46140/60000 train_loss:2.0298 lr_mul:0.0462 train_time:74038146ms step_avg:1604.64ms +step:46160/60000 train_loss:2.0622 lr_mul:0.0461 train_time:74071004ms step_avg:1604.66ms +step:46180/60000 train_loss:2.0216 lr_mul:0.0461 train_time:74103573ms step_avg:1604.67ms +step:46200/60000 train_loss:2.0224 lr_mul:0.0460 train_time:74136158ms step_avg:1604.68ms +step:46220/60000 train_loss:2.0198 lr_mul:0.0459 train_time:74168706ms step_avg:1604.69ms +step:46240/60000 train_loss:2.0425 lr_mul:0.0459 train_time:74201261ms step_avg:1604.70ms +step:46260/60000 train_loss:2.0359 lr_mul:0.0458 train_time:74233813ms step_avg:1604.71ms +step:46280/60000 train_loss:2.0350 lr_mul:0.0457 train_time:74266386ms step_avg:1604.72ms +step:46300/60000 train_loss:1.9957 lr_mul:0.0457 train_time:74299101ms step_avg:1604.73ms +step:46320/60000 train_loss:2.0817 lr_mul:0.0456 train_time:74331684ms step_avg:1604.74ms +step:46340/60000 train_loss:2.0034 lr_mul:0.0455 train_time:74364242ms step_avg:1604.75ms +step:46360/60000 train_loss:2.0255 lr_mul:0.0455 train_time:74396815ms step_avg:1604.76ms +step:46380/60000 train_loss:2.0543 lr_mul:0.0454 train_time:74429373ms step_avg:1604.77ms +step:46400/60000 train_loss:2.0443 lr_mul:0.0453 train_time:74461929ms step_avg:1604.78ms +step:46420/60000 train_loss:2.0278 lr_mul:0.0453 train_time:74494489ms step_avg:1604.79ms +step:46440/60000 train_loss:2.1452 lr_mul:0.0452 train_time:74527390ms step_avg:1604.81ms +step:46460/60000 train_loss:2.0164 lr_mul:0.0451 train_time:74559974ms step_avg:1604.82ms +step:46480/60000 train_loss:2.2028 lr_mul:0.0451 train_time:74592575ms step_avg:1604.83ms +step:46500/60000 train_loss:2.0410 lr_mul:0.0450 train_time:74625170ms step_avg:1604.84ms + >> Checkpoint saved: ./checkpoints/ckpt_step46500.pt +step:46520/60000 train_loss:2.0850 lr_mul:0.0449 train_time:74657931ms step_avg:1604.86ms +step:46540/60000 train_loss:2.0304 lr_mul:0.0449 train_time:74690965ms step_avg:1604.88ms +step:46560/60000 train_loss:2.0135 lr_mul:0.0448 train_time:74723513ms step_avg:1604.89ms +step:46580/60000 train_loss:2.0499 lr_mul:0.0447 train_time:74756111ms step_avg:1604.90ms +step:46600/60000 train_loss:2.1203 lr_mul:0.0447 train_time:74788698ms step_avg:1604.91ms +step:46620/60000 train_loss:1.9623 lr_mul:0.0446 train_time:74821281ms step_avg:1604.92ms +step:46640/60000 train_loss:2.0312 lr_mul:0.0445 train_time:74853845ms step_avg:1604.93ms +step:46660/60000 train_loss:2.0424 lr_mul:0.0445 train_time:74886397ms step_avg:1604.94ms +step:46680/60000 train_loss:2.0180 lr_mul:0.0444 train_time:74918955ms step_avg:1604.95ms +step:46700/60000 train_loss:2.0238 lr_mul:0.0443 train_time:74951516ms step_avg:1604.96ms +step:46720/60000 train_loss:2.0136 lr_mul:0.0443 train_time:74984285ms step_avg:1604.97ms +step:46740/60000 train_loss:1.9592 lr_mul:0.0442 train_time:75016867ms step_avg:1604.98ms +step:46760/60000 train_loss:2.0713 lr_mul:0.0441 train_time:75049464ms step_avg:1604.99ms +step:46780/60000 train_loss:1.9587 lr_mul:0.0441 train_time:75082034ms step_avg:1605.00ms +step:46800/60000 train_loss:2.0009 lr_mul:0.0440 train_time:75114907ms step_avg:1605.02ms +step:46820/60000 train_loss:2.0228 lr_mul:0.0439 train_time:75147460ms step_avg:1605.03ms +step:46840/60000 train_loss:2.0569 lr_mul:0.0439 train_time:75180015ms step_avg:1605.04ms +step:46860/60000 train_loss:2.0218 lr_mul:0.0438 train_time:75212592ms step_avg:1605.05ms +step:46880/60000 train_loss:2.0606 lr_mul:0.0437 train_time:75245171ms step_avg:1605.06ms +step:46900/60000 train_loss:2.0259 lr_mul:0.0437 train_time:75277754ms step_avg:1605.07ms +step:46920/60000 train_loss:2.2107 lr_mul:0.0436 train_time:75310322ms step_avg:1605.08ms +step:46940/60000 train_loss:2.0805 lr_mul:0.0435 train_time:75342884ms step_avg:1605.09ms +step:46960/60000 train_loss:2.0740 lr_mul:0.0435 train_time:75375465ms step_avg:1605.10ms +step:46980/60000 train_loss:2.0127 lr_mul:0.0434 train_time:75408037ms step_avg:1605.11ms +step:47000/60000 train_loss:2.0962 lr_mul:0.0433 train_time:75440618ms step_avg:1605.12ms + >> Checkpoint saved: ./checkpoints/ckpt_step47000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0329 + eval: batch 40/100 running_loss:2.0286 + eval: batch 60/100 running_loss:2.0212 + eval: batch 80/100 running_loss:2.0227 + eval: batch 100/100 running_loss:2.0309 +step:47000/60000 val_loss:2.0309 val_bpb:1.2191 train_time:75440818ms +step:47020/60000 train_loss:2.0188 lr_mul:0.0433 train_time:75473679ms step_avg:1605.14ms +step:47040/60000 train_loss:2.0813 lr_mul:0.0432 train_time:75506257ms step_avg:1605.15ms +step:47060/60000 train_loss:2.0140 lr_mul:0.0431 train_time:75539927ms step_avg:1605.18ms +step:47080/60000 train_loss:2.0678 lr_mul:0.0431 train_time:75572528ms step_avg:1605.19ms +step:47100/60000 train_loss:2.1626 lr_mul:0.0430 train_time:75605145ms step_avg:1605.20ms +step:47120/60000 train_loss:2.0784 lr_mul:0.0429 train_time:75637720ms step_avg:1605.21ms +step:47140/60000 train_loss:2.0274 lr_mul:0.0429 train_time:75670275ms step_avg:1605.22ms +step:47160/60000 train_loss:2.0453 lr_mul:0.0428 train_time:75702835ms step_avg:1605.23ms +step:47180/60000 train_loss:2.0277 lr_mul:0.0427 train_time:75735424ms step_avg:1605.24ms +step:47200/60000 train_loss:2.0613 lr_mul:0.0427 train_time:75768012ms step_avg:1605.25ms +step:47220/60000 train_loss:2.0086 lr_mul:0.0426 train_time:75800596ms step_avg:1605.26ms +step:47240/60000 train_loss:2.0712 lr_mul:0.0425 train_time:75833187ms step_avg:1605.27ms +step:47260/60000 train_loss:2.0406 lr_mul:0.0425 train_time:75865750ms step_avg:1605.28ms +step:47280/60000 train_loss:2.0697 lr_mul:0.0424 train_time:75898316ms step_avg:1605.29ms +step:47300/60000 train_loss:2.0952 lr_mul:0.0423 train_time:75931187ms step_avg:1605.31ms +step:47320/60000 train_loss:2.0849 lr_mul:0.0423 train_time:75964544ms step_avg:1605.34ms +step:47340/60000 train_loss:2.0183 lr_mul:0.0422 train_time:75997196ms step_avg:1605.35ms +step:47360/60000 train_loss:2.0529 lr_mul:0.0421 train_time:76029826ms step_avg:1605.36ms +step:47380/60000 train_loss:2.0284 lr_mul:0.0421 train_time:76062401ms step_avg:1605.37ms +step:47400/60000 train_loss:2.0588 lr_mul:0.0420 train_time:76094869ms step_avg:1605.38ms +step:47420/60000 train_loss:1.9905 lr_mul:0.0419 train_time:76127290ms step_avg:1605.38ms +step:47440/60000 train_loss:2.0288 lr_mul:0.0419 train_time:76159691ms step_avg:1605.39ms +step:47460/60000 train_loss:2.0444 lr_mul:0.0418 train_time:76192084ms step_avg:1605.40ms +step:47480/60000 train_loss:2.0882 lr_mul:0.0417 train_time:76224491ms step_avg:1605.40ms +step:47500/60000 train_loss:2.0821 lr_mul:0.0417 train_time:76256900ms step_avg:1605.41ms + >> Checkpoint saved: ./checkpoints/ckpt_step47500.pt +step:47520/60000 train_loss:2.1155 lr_mul:0.0416 train_time:76289514ms step_avg:1605.42ms +step:47540/60000 train_loss:2.0302 lr_mul:0.0415 train_time:76321953ms step_avg:1605.43ms +step:47560/60000 train_loss:2.0142 lr_mul:0.0415 train_time:76355453ms step_avg:1605.46ms +step:47580/60000 train_loss:2.0789 lr_mul:0.0414 train_time:76387874ms step_avg:1605.46ms +step:47600/60000 train_loss:2.1114 lr_mul:0.0413 train_time:76420494ms step_avg:1605.47ms +step:47620/60000 train_loss:2.0537 lr_mul:0.0413 train_time:76452899ms step_avg:1605.48ms +step:47640/60000 train_loss:2.0952 lr_mul:0.0412 train_time:76485253ms step_avg:1605.48ms +step:47660/60000 train_loss:2.0522 lr_mul:0.0411 train_time:76517619ms step_avg:1605.49ms +step:47680/60000 train_loss:2.0723 lr_mul:0.0411 train_time:76550006ms step_avg:1605.50ms +step:47700/60000 train_loss:1.9671 lr_mul:0.0410 train_time:76582414ms step_avg:1605.50ms +step:47720/60000 train_loss:2.0411 lr_mul:0.0409 train_time:76614846ms step_avg:1605.51ms +step:47740/60000 train_loss:2.0094 lr_mul:0.0409 train_time:76647285ms step_avg:1605.51ms +step:47760/60000 train_loss:2.0635 lr_mul:0.0408 train_time:76679722ms step_avg:1605.52ms +step:47780/60000 train_loss:1.9741 lr_mul:0.0407 train_time:76712138ms step_avg:1605.53ms +step:47800/60000 train_loss:2.2177 lr_mul:0.0407 train_time:76744547ms step_avg:1605.53ms +step:47820/60000 train_loss:2.0753 lr_mul:0.0406 train_time:76779042ms step_avg:1605.58ms +step:47840/60000 train_loss:2.3477 lr_mul:0.0405 train_time:76811511ms step_avg:1605.59ms +step:47860/60000 train_loss:2.0450 lr_mul:0.0405 train_time:76843969ms step_avg:1605.60ms +step:47880/60000 train_loss:2.0394 lr_mul:0.0404 train_time:76876589ms step_avg:1605.61ms +step:47900/60000 train_loss:2.0330 lr_mul:0.0403 train_time:76909000ms step_avg:1605.62ms +step:47920/60000 train_loss:2.0127 lr_mul:0.0403 train_time:76941410ms step_avg:1605.62ms +step:47940/60000 train_loss:2.0155 lr_mul:0.0402 train_time:76973840ms step_avg:1605.63ms +step:47960/60000 train_loss:2.1222 lr_mul:0.0401 train_time:77006288ms step_avg:1605.64ms +step:47980/60000 train_loss:2.0262 lr_mul:0.0401 train_time:77038756ms step_avg:1605.64ms +step:48000/60000 train_loss:2.0279 lr_mul:0.0400 train_time:77071227ms step_avg:1605.65ms + >> Checkpoint saved: ./checkpoints/ckpt_step48000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0327 + eval: batch 40/100 running_loss:2.0285 + eval: batch 60/100 running_loss:2.0210 + eval: batch 80/100 running_loss:2.0225 + eval: batch 100/100 running_loss:2.0306 +step:48000/60000 val_loss:2.0306 val_bpb:1.2190 train_time:77071434ms +step:48020/60000 train_loss:2.0423 lr_mul:0.0399 train_time:77103905ms step_avg:1605.66ms +step:48040/60000 train_loss:2.0603 lr_mul:0.0399 train_time:77136366ms step_avg:1605.67ms +step:48060/60000 train_loss:1.9977 lr_mul:0.0398 train_time:77168837ms step_avg:1605.68ms +step:48080/60000 train_loss:2.0193 lr_mul:0.0397 train_time:77202262ms step_avg:1605.70ms +step:48100/60000 train_loss:2.0505 lr_mul:0.0397 train_time:77234770ms step_avg:1605.71ms +step:48120/60000 train_loss:2.0205 lr_mul:0.0396 train_time:77267236ms step_avg:1605.72ms +step:48140/60000 train_loss:2.0504 lr_mul:0.0395 train_time:77299685ms step_avg:1605.73ms +step:48160/60000 train_loss:2.0615 lr_mul:0.0395 train_time:77332341ms step_avg:1605.74ms +step:48180/60000 train_loss:2.0426 lr_mul:0.0394 train_time:77364775ms step_avg:1605.74ms +step:48200/60000 train_loss:2.0736 lr_mul:0.0393 train_time:77397247ms step_avg:1605.75ms +step:48220/60000 train_loss:2.0477 lr_mul:0.0393 train_time:77429697ms step_avg:1605.76ms +step:48240/60000 train_loss:2.0331 lr_mul:0.0392 train_time:77462170ms step_avg:1605.77ms +step:48260/60000 train_loss:1.9748 lr_mul:0.0391 train_time:77494660ms step_avg:1605.77ms +step:48280/60000 train_loss:2.0684 lr_mul:0.0391 train_time:77527123ms step_avg:1605.78ms +step:48300/60000 train_loss:2.0485 lr_mul:0.0390 train_time:77559564ms step_avg:1605.79ms +step:48320/60000 train_loss:2.0166 lr_mul:0.0389 train_time:77593137ms step_avg:1605.82ms +step:48340/60000 train_loss:2.0027 lr_mul:0.0389 train_time:77625591ms step_avg:1605.83ms +step:48360/60000 train_loss:2.0196 lr_mul:0.0388 train_time:77658115ms step_avg:1605.83ms +step:48380/60000 train_loss:2.0662 lr_mul:0.0387 train_time:77690619ms step_avg:1605.84ms +step:48400/60000 train_loss:2.1268 lr_mul:0.0387 train_time:77723103ms step_avg:1605.85ms +step:48420/60000 train_loss:2.0804 lr_mul:0.0386 train_time:77755573ms step_avg:1605.86ms +step:48440/60000 train_loss:2.0689 lr_mul:0.0385 train_time:77788020ms step_avg:1605.86ms +step:48460/60000 train_loss:2.0206 lr_mul:0.0385 train_time:77820756ms step_avg:1605.88ms +step:48480/60000 train_loss:2.0214 lr_mul:0.0384 train_time:77853249ms step_avg:1605.88ms +step:48500/60000 train_loss:2.0283 lr_mul:0.0383 train_time:77885768ms step_avg:1605.89ms + >> Checkpoint saved: ./checkpoints/ckpt_step48500.pt +step:48520/60000 train_loss:2.0096 lr_mul:0.0383 train_time:77918440ms step_avg:1605.90ms +step:48540/60000 train_loss:2.0465 lr_mul:0.0382 train_time:77950940ms step_avg:1605.91ms +step:48560/60000 train_loss:2.0300 lr_mul:0.0381 train_time:77983451ms step_avg:1605.92ms +step:48580/60000 train_loss:2.0731 lr_mul:0.0381 train_time:78016618ms step_avg:1605.94ms +step:48600/60000 train_loss:2.2271 lr_mul:0.0380 train_time:78049146ms step_avg:1605.95ms +step:48620/60000 train_loss:2.0465 lr_mul:0.0379 train_time:78081694ms step_avg:1605.96ms +step:48640/60000 train_loss:2.0767 lr_mul:0.0379 train_time:78114210ms step_avg:1605.97ms +step:48660/60000 train_loss:2.0534 lr_mul:0.0378 train_time:78146707ms step_avg:1605.97ms +step:48680/60000 train_loss:2.0470 lr_mul:0.0377 train_time:78179205ms step_avg:1605.98ms +step:48700/60000 train_loss:2.0883 lr_mul:0.0377 train_time:78211684ms step_avg:1605.99ms +step:48720/60000 train_loss:2.0777 lr_mul:0.0376 train_time:78244177ms step_avg:1606.00ms +step:48740/60000 train_loss:2.0573 lr_mul:0.0375 train_time:78277005ms step_avg:1606.01ms +step:48760/60000 train_loss:2.0141 lr_mul:0.0375 train_time:78309525ms step_avg:1606.02ms +step:48780/60000 train_loss:2.0724 lr_mul:0.0374 train_time:78342037ms step_avg:1606.03ms +step:48800/60000 train_loss:2.0498 lr_mul:0.0373 train_time:78374531ms step_avg:1606.04ms +step:48820/60000 train_loss:1.9844 lr_mul:0.0373 train_time:78407022ms step_avg:1606.04ms +step:48840/60000 train_loss:2.0634 lr_mul:0.0372 train_time:78440348ms step_avg:1606.07ms +step:48860/60000 train_loss:2.0858 lr_mul:0.0371 train_time:78472870ms step_avg:1606.08ms +step:48880/60000 train_loss:2.0871 lr_mul:0.0371 train_time:78505388ms step_avg:1606.08ms +step:48900/60000 train_loss:1.9862 lr_mul:0.0370 train_time:78537853ms step_avg:1606.09ms +step:48920/60000 train_loss:2.1077 lr_mul:0.0369 train_time:78570307ms step_avg:1606.10ms +step:48940/60000 train_loss:2.0235 lr_mul:0.0369 train_time:78602756ms step_avg:1606.10ms +step:48960/60000 train_loss:2.0643 lr_mul:0.0368 train_time:78635196ms step_avg:1606.11ms +step:48980/60000 train_loss:2.0019 lr_mul:0.0367 train_time:78667639ms step_avg:1606.12ms +step:49000/60000 train_loss:2.0305 lr_mul:0.0367 train_time:78700062ms step_avg:1606.12ms + >> Checkpoint saved: ./checkpoints/ckpt_step49000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0317 + eval: batch 40/100 running_loss:2.0273 + eval: batch 60/100 running_loss:2.0199 + eval: batch 80/100 running_loss:2.0214 + eval: batch 100/100 running_loss:2.0295 +step:49000/60000 val_loss:2.0295 val_bpb:1.2183 train_time:78700271ms +step:49020/60000 train_loss:2.0289 lr_mul:0.0366 train_time:78732882ms step_avg:1606.14ms +step:49040/60000 train_loss:1.9766 lr_mul:0.0365 train_time:78765313ms step_avg:1606.14ms +step:49060/60000 train_loss:2.0503 lr_mul:0.0365 train_time:78797756ms step_avg:1606.15ms +step:49080/60000 train_loss:2.0473 lr_mul:0.0364 train_time:78830194ms step_avg:1606.16ms +step:49100/60000 train_loss:2.1306 lr_mul:0.0363 train_time:78863795ms step_avg:1606.19ms +step:49120/60000 train_loss:2.0253 lr_mul:0.0363 train_time:78896319ms step_avg:1606.20ms +step:49140/60000 train_loss:2.0745 lr_mul:0.0362 train_time:78928837ms step_avg:1606.20ms +step:49160/60000 train_loss:2.0342 lr_mul:0.0361 train_time:78961308ms step_avg:1606.21ms +step:49180/60000 train_loss:2.0188 lr_mul:0.0361 train_time:78993736ms step_avg:1606.22ms +step:49200/60000 train_loss:2.0829 lr_mul:0.0360 train_time:79026153ms step_avg:1606.22ms +step:49220/60000 train_loss:2.0060 lr_mul:0.0359 train_time:79058603ms step_avg:1606.23ms +step:49240/60000 train_loss:1.9808 lr_mul:0.0359 train_time:79091042ms step_avg:1606.24ms +step:49260/60000 train_loss:1.9438 lr_mul:0.0358 train_time:79123459ms step_avg:1606.24ms +step:49280/60000 train_loss:2.0794 lr_mul:0.0357 train_time:79155885ms step_avg:1606.25ms +step:49300/60000 train_loss:2.1056 lr_mul:0.0357 train_time:79188298ms step_avg:1606.25ms +step:49320/60000 train_loss:2.0359 lr_mul:0.0356 train_time:79220899ms step_avg:1606.26ms +step:49340/60000 train_loss:1.9222 lr_mul:0.0355 train_time:79254540ms step_avg:1606.29ms +step:49360/60000 train_loss:2.0328 lr_mul:0.0355 train_time:79286979ms step_avg:1606.30ms +step:49380/60000 train_loss:2.0719 lr_mul:0.0354 train_time:79319493ms step_avg:1606.31ms +step:49400/60000 train_loss:2.0471 lr_mul:0.0353 train_time:79351964ms step_avg:1606.32ms +step:49420/60000 train_loss:2.1167 lr_mul:0.0353 train_time:79384424ms step_avg:1606.32ms +step:49440/60000 train_loss:1.9526 lr_mul:0.0352 train_time:79416867ms step_avg:1606.33ms +step:49460/60000 train_loss:2.0919 lr_mul:0.0351 train_time:79449319ms step_avg:1606.33ms +step:49480/60000 train_loss:2.0065 lr_mul:0.0351 train_time:79481792ms step_avg:1606.34ms +step:49500/60000 train_loss:2.0897 lr_mul:0.0350 train_time:79514261ms step_avg:1606.35ms + >> Checkpoint saved: ./checkpoints/ckpt_step49500.pt +step:49520/60000 train_loss:2.1098 lr_mul:0.0349 train_time:79546924ms step_avg:1606.36ms +step:49540/60000 train_loss:1.9686 lr_mul:0.0349 train_time:79579401ms step_avg:1606.37ms +step:49560/60000 train_loss:2.0617 lr_mul:0.0348 train_time:79611882ms step_avg:1606.37ms +step:49580/60000 train_loss:2.1067 lr_mul:0.0347 train_time:79644381ms step_avg:1606.38ms +step:49600/60000 train_loss:2.0150 lr_mul:0.0347 train_time:79677782ms step_avg:1606.41ms +step:49620/60000 train_loss:2.1095 lr_mul:0.0346 train_time:79710315ms step_avg:1606.42ms +step:49640/60000 train_loss:2.0353 lr_mul:0.0345 train_time:79742874ms step_avg:1606.42ms +step:49660/60000 train_loss:2.0743 lr_mul:0.0345 train_time:79775344ms step_avg:1606.43ms +step:49680/60000 train_loss:1.9657 lr_mul:0.0344 train_time:79807806ms step_avg:1606.44ms +step:49700/60000 train_loss:2.0436 lr_mul:0.0343 train_time:79840275ms step_avg:1606.44ms +step:49720/60000 train_loss:2.0664 lr_mul:0.0343 train_time:79872737ms step_avg:1606.45ms +step:49740/60000 train_loss:2.0796 lr_mul:0.0342 train_time:79905213ms step_avg:1606.46ms +step:49760/60000 train_loss:2.0220 lr_mul:0.0341 train_time:79937652ms step_avg:1606.46ms +step:49780/60000 train_loss:2.0987 lr_mul:0.0341 train_time:79970099ms step_avg:1606.47ms +step:49800/60000 train_loss:2.0382 lr_mul:0.0340 train_time:80002541ms step_avg:1606.48ms +step:49820/60000 train_loss:1.9676 lr_mul:0.0339 train_time:80035013ms step_avg:1606.48ms +step:49840/60000 train_loss:1.9514 lr_mul:0.0339 train_time:80067470ms step_avg:1606.49ms +step:49860/60000 train_loss:2.0975 lr_mul:0.0338 train_time:80102077ms step_avg:1606.54ms +step:49880/60000 train_loss:1.9305 lr_mul:0.0337 train_time:80134549ms step_avg:1606.55ms +step:49900/60000 train_loss:2.0117 lr_mul:0.0337 train_time:80167319ms step_avg:1606.56ms +step:49920/60000 train_loss:2.0337 lr_mul:0.0336 train_time:80199761ms step_avg:1606.57ms +step:49940/60000 train_loss:2.0296 lr_mul:0.0335 train_time:80232182ms step_avg:1606.57ms +step:49960/60000 train_loss:2.0602 lr_mul:0.0335 train_time:80264610ms step_avg:1606.58ms +step:49980/60000 train_loss:1.9974 lr_mul:0.0334 train_time:80297008ms step_avg:1606.58ms +step:50000/60000 train_loss:2.0737 lr_mul:0.0333 train_time:80329441ms step_avg:1606.59ms + >> Checkpoint saved: ./checkpoints/ckpt_step50000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0304 + eval: batch 40/100 running_loss:2.0260 + eval: batch 60/100 running_loss:2.0186 + eval: batch 80/100 running_loss:2.0200 + eval: batch 100/100 running_loss:2.0281 +step:50000/60000 val_loss:2.0281 val_bpb:1.2175 train_time:80329648ms +step:50020/60000 train_loss:1.9556 lr_mul:0.0333 train_time:80362068ms step_avg:1606.60ms +step:50040/60000 train_loss:2.1210 lr_mul:0.0332 train_time:80394497ms step_avg:1606.60ms +step:50060/60000 train_loss:2.0519 lr_mul:0.0331 train_time:80426921ms step_avg:1606.61ms +step:50080/60000 train_loss:2.0360 lr_mul:0.0331 train_time:80459360ms step_avg:1606.62ms +step:50100/60000 train_loss:2.0124 lr_mul:0.0330 train_time:80492820ms step_avg:1606.64ms +step:50120/60000 train_loss:2.0122 lr_mul:0.0329 train_time:80525285ms step_avg:1606.65ms +step:50140/60000 train_loss:2.0648 lr_mul:0.0329 train_time:80557794ms step_avg:1606.66ms +step:50160/60000 train_loss:1.9822 lr_mul:0.0328 train_time:80590310ms step_avg:1606.66ms +step:50180/60000 train_loss:2.0707 lr_mul:0.0327 train_time:80623117ms step_avg:1606.68ms +step:50200/60000 train_loss:2.0411 lr_mul:0.0327 train_time:80655562ms step_avg:1606.68ms +step:50220/60000 train_loss:2.0834 lr_mul:0.0326 train_time:80688038ms step_avg:1606.69ms +step:50240/60000 train_loss:2.0601 lr_mul:0.0325 train_time:80720509ms step_avg:1606.70ms +step:50260/60000 train_loss:2.0275 lr_mul:0.0325 train_time:80752966ms step_avg:1606.70ms +step:50280/60000 train_loss:1.9959 lr_mul:0.0324 train_time:80785425ms step_avg:1606.71ms +step:50300/60000 train_loss:2.0223 lr_mul:0.0323 train_time:80817860ms step_avg:1606.72ms +step:50320/60000 train_loss:2.0514 lr_mul:0.0323 train_time:80850296ms step_avg:1606.72ms +step:50340/60000 train_loss:2.0252 lr_mul:0.0322 train_time:80882701ms step_avg:1606.73ms +step:50360/60000 train_loss:2.0834 lr_mul:0.0321 train_time:80916342ms step_avg:1606.76ms +step:50380/60000 train_loss:2.0713 lr_mul:0.0321 train_time:80948777ms step_avg:1606.76ms +step:50400/60000 train_loss:2.0017 lr_mul:0.0320 train_time:80981242ms step_avg:1606.77ms +step:50420/60000 train_loss:2.0622 lr_mul:0.0319 train_time:81013693ms step_avg:1606.78ms +step:50440/60000 train_loss:1.9837 lr_mul:0.0319 train_time:81046114ms step_avg:1606.78ms +step:50460/60000 train_loss:1.9298 lr_mul:0.0318 train_time:81078734ms step_avg:1606.79ms +step:50480/60000 train_loss:2.0174 lr_mul:0.0317 train_time:81111178ms step_avg:1606.80ms +step:50500/60000 train_loss:2.0558 lr_mul:0.0317 train_time:81143633ms step_avg:1606.80ms + >> Checkpoint saved: ./checkpoints/ckpt_step50500.pt +step:50520/60000 train_loss:2.0063 lr_mul:0.0316 train_time:81176279ms step_avg:1606.81ms +step:50540/60000 train_loss:2.0274 lr_mul:0.0315 train_time:81208743ms step_avg:1606.82ms +step:50560/60000 train_loss:2.0455 lr_mul:0.0315 train_time:81241190ms step_avg:1606.83ms +step:50580/60000 train_loss:2.1044 lr_mul:0.0314 train_time:81273629ms step_avg:1606.83ms +step:50600/60000 train_loss:2.0273 lr_mul:0.0313 train_time:81306064ms step_avg:1606.84ms +step:50620/60000 train_loss:2.0187 lr_mul:0.0313 train_time:81339771ms step_avg:1606.87ms +step:50640/60000 train_loss:2.1431 lr_mul:0.0312 train_time:81372272ms step_avg:1606.88ms +step:50660/60000 train_loss:2.0322 lr_mul:0.0311 train_time:81404743ms step_avg:1606.88ms +step:50680/60000 train_loss:2.0233 lr_mul:0.0311 train_time:81437188ms step_avg:1606.89ms +step:50700/60000 train_loss:2.0423 lr_mul:0.0310 train_time:81469636ms step_avg:1606.90ms +step:50720/60000 train_loss:2.0565 lr_mul:0.0309 train_time:81502085ms step_avg:1606.90ms +step:50740/60000 train_loss:2.0525 lr_mul:0.0309 train_time:81534547ms step_avg:1606.91ms +step:50760/60000 train_loss:1.9619 lr_mul:0.0308 train_time:81567240ms step_avg:1606.92ms +step:50780/60000 train_loss:2.0225 lr_mul:0.0307 train_time:81599664ms step_avg:1606.93ms +step:50800/60000 train_loss:2.1485 lr_mul:0.0307 train_time:81632099ms step_avg:1606.93ms +step:50820/60000 train_loss:2.0344 lr_mul:0.0306 train_time:81664552ms step_avg:1606.94ms +step:50840/60000 train_loss:2.0145 lr_mul:0.0305 train_time:81697019ms step_avg:1606.94ms +step:50860/60000 train_loss:2.0012 lr_mul:0.0305 train_time:81729467ms step_avg:1606.95ms +step:50880/60000 train_loss:2.0150 lr_mul:0.0304 train_time:81761968ms step_avg:1606.96ms +step:50900/60000 train_loss:2.0856 lr_mul:0.0303 train_time:81794415ms step_avg:1606.96ms +step:50920/60000 train_loss:2.1707 lr_mul:0.0303 train_time:81826848ms step_avg:1606.97ms +step:50940/60000 train_loss:2.0641 lr_mul:0.0302 train_time:81859278ms step_avg:1606.97ms +step:50960/60000 train_loss:2.0798 lr_mul:0.0301 train_time:81891695ms step_avg:1606.98ms +step:50980/60000 train_loss:2.0224 lr_mul:0.0301 train_time:81924124ms step_avg:1606.99ms +step:51000/60000 train_loss:2.0800 lr_mul:0.0300 train_time:81956571ms step_avg:1606.99ms + >> Checkpoint saved: ./checkpoints/ckpt_step51000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0301 + eval: batch 40/100 running_loss:2.0258 + eval: batch 60/100 running_loss:2.0183 + eval: batch 80/100 running_loss:2.0198 + eval: batch 100/100 running_loss:2.0279 +step:51000/60000 val_loss:2.0279 val_bpb:1.2174 train_time:81956770ms +step:51020/60000 train_loss:2.0677 lr_mul:0.0299 train_time:81989260ms step_avg:1607.00ms +step:51040/60000 train_loss:2.0463 lr_mul:0.0299 train_time:82021971ms step_avg:1607.01ms +step:51060/60000 train_loss:2.0035 lr_mul:0.0298 train_time:82054451ms step_avg:1607.02ms +step:51080/60000 train_loss:2.0387 lr_mul:0.0297 train_time:82086899ms step_avg:1607.03ms +step:51100/60000 train_loss:2.0226 lr_mul:0.0297 train_time:82119354ms step_avg:1607.03ms +step:51120/60000 train_loss:2.0197 lr_mul:0.0296 train_time:82151839ms step_avg:1607.04ms +step:51140/60000 train_loss:2.0544 lr_mul:0.0295 train_time:82184294ms step_avg:1607.05ms +step:51160/60000 train_loss:2.0330 lr_mul:0.0295 train_time:82216734ms step_avg:1607.05ms +step:51180/60000 train_loss:2.0879 lr_mul:0.0294 train_time:82249202ms step_avg:1607.06ms +step:51200/60000 train_loss:2.0258 lr_mul:0.0293 train_time:82281631ms step_avg:1607.06ms +step:51220/60000 train_loss:2.0594 lr_mul:0.0293 train_time:82314056ms step_avg:1607.07ms +step:51240/60000 train_loss:2.0362 lr_mul:0.0292 train_time:82346482ms step_avg:1607.07ms +step:51260/60000 train_loss:2.0177 lr_mul:0.0291 train_time:82378928ms step_avg:1607.08ms +step:51280/60000 train_loss:2.0985 lr_mul:0.0291 train_time:82411366ms step_avg:1607.09ms +step:51300/60000 train_loss:1.9835 lr_mul:0.0290 train_time:82443804ms step_avg:1607.09ms +step:51320/60000 train_loss:2.0924 lr_mul:0.0289 train_time:82476282ms step_avg:1607.10ms +step:51340/60000 train_loss:2.0192 lr_mul:0.0289 train_time:82508994ms step_avg:1607.11ms +step:51360/60000 train_loss:1.9702 lr_mul:0.0288 train_time:82541464ms step_avg:1607.12ms +step:51380/60000 train_loss:2.0963 lr_mul:0.0287 train_time:82573942ms step_avg:1607.12ms +step:51400/60000 train_loss:2.0570 lr_mul:0.0287 train_time:82606409ms step_avg:1607.13ms +step:51420/60000 train_loss:2.0420 lr_mul:0.0286 train_time:82638885ms step_avg:1607.14ms +step:51440/60000 train_loss:1.9827 lr_mul:0.0285 train_time:82671350ms step_avg:1607.14ms +step:51460/60000 train_loss:1.9964 lr_mul:0.0285 train_time:82703763ms step_avg:1607.15ms +step:51480/60000 train_loss:2.1105 lr_mul:0.0284 train_time:82736206ms step_avg:1607.15ms +step:51500/60000 train_loss:2.0290 lr_mul:0.0283 train_time:82768649ms step_avg:1607.16ms + >> Checkpoint saved: ./checkpoints/ckpt_step51500.pt +step:51520/60000 train_loss:2.0265 lr_mul:0.0283 train_time:82801297ms step_avg:1607.17ms +step:51540/60000 train_loss:2.0099 lr_mul:0.0282 train_time:82833728ms step_avg:1607.17ms +step:51560/60000 train_loss:1.9917 lr_mul:0.0281 train_time:82866150ms step_avg:1607.18ms +step:51580/60000 train_loss:2.0377 lr_mul:0.0281 train_time:82898565ms step_avg:1607.18ms +step:51600/60000 train_loss:2.0168 lr_mul:0.0280 train_time:82930980ms step_avg:1607.19ms +step:51620/60000 train_loss:1.9640 lr_mul:0.0279 train_time:82963690ms step_avg:1607.20ms +step:51640/60000 train_loss:2.0684 lr_mul:0.0279 train_time:82996144ms step_avg:1607.21ms +step:51660/60000 train_loss:2.0866 lr_mul:0.0278 train_time:83028605ms step_avg:1607.21ms +step:51680/60000 train_loss:2.0051 lr_mul:0.0277 train_time:83061055ms step_avg:1607.22ms +step:51700/60000 train_loss:2.0746 lr_mul:0.0277 train_time:83093514ms step_avg:1607.22ms +step:51720/60000 train_loss:2.0249 lr_mul:0.0276 train_time:83125998ms step_avg:1607.23ms +step:51740/60000 train_loss:2.0634 lr_mul:0.0275 train_time:83158461ms step_avg:1607.24ms +step:51760/60000 train_loss:2.1308 lr_mul:0.0275 train_time:83190962ms step_avg:1607.24ms +step:51780/60000 train_loss:1.9707 lr_mul:0.0274 train_time:83223418ms step_avg:1607.25ms +step:51800/60000 train_loss:2.1067 lr_mul:0.0273 train_time:83255871ms step_avg:1607.26ms +step:51820/60000 train_loss:2.0243 lr_mul:0.0273 train_time:83288342ms step_avg:1607.26ms +step:51840/60000 train_loss:2.1114 lr_mul:0.0272 train_time:83320831ms step_avg:1607.27ms +step:51860/60000 train_loss:2.0439 lr_mul:0.0271 train_time:83353301ms step_avg:1607.28ms +step:51880/60000 train_loss:2.0056 lr_mul:0.0271 train_time:83385780ms step_avg:1607.28ms +step:51900/60000 train_loss:2.1042 lr_mul:0.0270 train_time:83418541ms step_avg:1607.29ms +step:51920/60000 train_loss:2.1491 lr_mul:0.0269 train_time:83451028ms step_avg:1607.30ms +step:51940/60000 train_loss:2.0791 lr_mul:0.0269 train_time:83483515ms step_avg:1607.31ms +step:51960/60000 train_loss:2.0269 lr_mul:0.0268 train_time:83516013ms step_avg:1607.31ms +step:51980/60000 train_loss:2.0860 lr_mul:0.0267 train_time:83548498ms step_avg:1607.32ms +step:52000/60000 train_loss:2.0878 lr_mul:0.0267 train_time:83580968ms step_avg:1607.33ms + >> Checkpoint saved: ./checkpoints/ckpt_step52000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0296 + eval: batch 40/100 running_loss:2.0254 + eval: batch 60/100 running_loss:2.0179 + eval: batch 80/100 running_loss:2.0194 + eval: batch 100/100 running_loss:2.0275 +step:52000/60000 val_loss:2.0275 val_bpb:1.2171 train_time:83581168ms +step:52020/60000 train_loss:2.0444 lr_mul:0.0266 train_time:83613642ms step_avg:1607.34ms +step:52040/60000 train_loss:2.0838 lr_mul:0.0265 train_time:83646107ms step_avg:1607.34ms +step:52060/60000 train_loss:2.0662 lr_mul:0.0265 train_time:83678568ms step_avg:1607.35ms +step:52080/60000 train_loss:2.0400 lr_mul:0.0264 train_time:83711037ms step_avg:1607.35ms +step:52100/60000 train_loss:2.0263 lr_mul:0.0263 train_time:83743489ms step_avg:1607.36ms +step:52120/60000 train_loss:2.0607 lr_mul:0.0263 train_time:83775932ms step_avg:1607.37ms +step:52140/60000 train_loss:2.0039 lr_mul:0.0262 train_time:83808432ms step_avg:1607.37ms +step:52160/60000 train_loss:2.0387 lr_mul:0.0261 train_time:83840862ms step_avg:1607.38ms +step:52180/60000 train_loss:2.0524 lr_mul:0.0261 train_time:83873300ms step_avg:1607.38ms +step:52200/60000 train_loss:2.0128 lr_mul:0.0260 train_time:83906019ms step_avg:1607.39ms +step:52220/60000 train_loss:2.0379 lr_mul:0.0259 train_time:83938459ms step_avg:1607.40ms +step:52240/60000 train_loss:2.0599 lr_mul:0.0259 train_time:83970899ms step_avg:1607.41ms +step:52260/60000 train_loss:2.0584 lr_mul:0.0258 train_time:84003354ms step_avg:1607.41ms +step:52280/60000 train_loss:2.0451 lr_mul:0.0257 train_time:84035793ms step_avg:1607.42ms +step:52300/60000 train_loss:2.0603 lr_mul:0.0257 train_time:84068239ms step_avg:1607.42ms +step:52320/60000 train_loss:2.0629 lr_mul:0.0256 train_time:84100684ms step_avg:1607.43ms +step:52340/60000 train_loss:2.4891 lr_mul:0.0255 train_time:84133112ms step_avg:1607.43ms +step:52360/60000 train_loss:2.1211 lr_mul:0.0255 train_time:84165548ms step_avg:1607.44ms +step:52380/60000 train_loss:2.0162 lr_mul:0.0254 train_time:84197981ms step_avg:1607.45ms +step:52400/60000 train_loss:2.0489 lr_mul:0.0253 train_time:84230474ms step_avg:1607.45ms +step:52420/60000 train_loss:2.2360 lr_mul:0.0253 train_time:84262905ms step_avg:1607.46ms +step:52440/60000 train_loss:2.0307 lr_mul:0.0252 train_time:84295335ms step_avg:1607.46ms +step:52460/60000 train_loss:1.9898 lr_mul:0.0251 train_time:84327757ms step_avg:1607.47ms +step:52480/60000 train_loss:2.0564 lr_mul:0.0251 train_time:84360462ms step_avg:1607.48ms +step:52500/60000 train_loss:2.0282 lr_mul:0.0250 train_time:84392879ms step_avg:1607.48ms + >> Checkpoint saved: ./checkpoints/ckpt_step52500.pt +step:52520/60000 train_loss:2.0354 lr_mul:0.0249 train_time:84425467ms step_avg:1607.49ms +step:52540/60000 train_loss:2.0577 lr_mul:0.0249 train_time:84457876ms step_avg:1607.50ms +step:52560/60000 train_loss:2.0448 lr_mul:0.0248 train_time:84490296ms step_avg:1607.50ms +step:52580/60000 train_loss:2.0317 lr_mul:0.0247 train_time:84522726ms step_avg:1607.51ms +step:52600/60000 train_loss:2.0534 lr_mul:0.0247 train_time:84555142ms step_avg:1607.51ms +step:52620/60000 train_loss:2.0764 lr_mul:0.0246 train_time:84587560ms step_avg:1607.52ms +step:52640/60000 train_loss:2.0328 lr_mul:0.0245 train_time:84619961ms step_avg:1607.52ms +step:52660/60000 train_loss:2.0583 lr_mul:0.0245 train_time:84652384ms step_avg:1607.53ms +step:52680/60000 train_loss:2.0653 lr_mul:0.0244 train_time:84684783ms step_avg:1607.53ms +step:52700/60000 train_loss:1.9898 lr_mul:0.0243 train_time:84717192ms step_avg:1607.54ms +step:52720/60000 train_loss:2.0488 lr_mul:0.0243 train_time:84749607ms step_avg:1607.54ms +step:52740/60000 train_loss:2.1183 lr_mul:0.0242 train_time:84782055ms step_avg:1607.55ms +step:52760/60000 train_loss:2.0487 lr_mul:0.0241 train_time:84814508ms step_avg:1607.55ms +step:52780/60000 train_loss:1.9977 lr_mul:0.0241 train_time:84847137ms step_avg:1607.56ms +step:52800/60000 train_loss:2.0506 lr_mul:0.0240 train_time:84879564ms step_avg:1607.57ms +step:52820/60000 train_loss:2.0264 lr_mul:0.0239 train_time:84911984ms step_avg:1607.57ms +step:52840/60000 train_loss:2.0603 lr_mul:0.0239 train_time:84944420ms step_avg:1607.58ms +step:52860/60000 train_loss:2.0149 lr_mul:0.0238 train_time:84976842ms step_avg:1607.58ms +step:52880/60000 train_loss:1.9652 lr_mul:0.0237 train_time:85009268ms step_avg:1607.59ms +step:52900/60000 train_loss:2.0716 lr_mul:0.0237 train_time:85041731ms step_avg:1607.59ms +step:52920/60000 train_loss:2.0506 lr_mul:0.0236 train_time:85074167ms step_avg:1607.60ms +step:52940/60000 train_loss:2.0908 lr_mul:0.0235 train_time:85106582ms step_avg:1607.60ms +step:52960/60000 train_loss:2.0007 lr_mul:0.0235 train_time:85139024ms step_avg:1607.61ms +step:52980/60000 train_loss:2.0231 lr_mul:0.0234 train_time:85171439ms step_avg:1607.61ms +step:53000/60000 train_loss:2.0248 lr_mul:0.0233 train_time:85203864ms step_avg:1607.62ms + >> Checkpoint saved: ./checkpoints/ckpt_step53000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0300 + eval: batch 40/100 running_loss:2.0256 + eval: batch 60/100 running_loss:2.0181 + eval: batch 80/100 running_loss:2.0196 + eval: batch 100/100 running_loss:2.0278 +step:53000/60000 val_loss:2.0278 val_bpb:1.2173 train_time:85204064ms +step:53020/60000 train_loss:2.0072 lr_mul:0.0233 train_time:85236533ms step_avg:1607.63ms +step:53040/60000 train_loss:2.0062 lr_mul:0.0232 train_time:85268985ms step_avg:1607.64ms +step:53060/60000 train_loss:2.0831 lr_mul:0.0231 train_time:85301609ms step_avg:1607.64ms +step:53080/60000 train_loss:1.9955 lr_mul:0.0231 train_time:85334054ms step_avg:1607.65ms +step:53100/60000 train_loss:2.0892 lr_mul:0.0230 train_time:85366495ms step_avg:1607.66ms +step:53120/60000 train_loss:1.9482 lr_mul:0.0229 train_time:85398941ms step_avg:1607.66ms +step:53140/60000 train_loss:2.0425 lr_mul:0.0229 train_time:85431367ms step_avg:1607.67ms +step:53160/60000 train_loss:1.9813 lr_mul:0.0228 train_time:85463802ms step_avg:1607.67ms +step:53180/60000 train_loss:2.0317 lr_mul:0.0227 train_time:85496227ms step_avg:1607.68ms +step:53200/60000 train_loss:2.0134 lr_mul:0.0227 train_time:85528648ms step_avg:1607.68ms +step:53220/60000 train_loss:2.0777 lr_mul:0.0226 train_time:85561103ms step_avg:1607.69ms +step:53240/60000 train_loss:2.0856 lr_mul:0.0225 train_time:85593554ms step_avg:1607.69ms +step:53260/60000 train_loss:2.0669 lr_mul:0.0225 train_time:85625975ms step_avg:1607.70ms +step:53280/60000 train_loss:2.0222 lr_mul:0.0224 train_time:85658389ms step_avg:1607.70ms +step:53300/60000 train_loss:2.0466 lr_mul:0.0223 train_time:85690774ms step_avg:1607.71ms +step:53320/60000 train_loss:2.1420 lr_mul:0.0223 train_time:85723177ms step_avg:1607.71ms +step:53340/60000 train_loss:1.9687 lr_mul:0.0222 train_time:85755825ms step_avg:1607.72ms +step:53360/60000 train_loss:2.0808 lr_mul:0.0221 train_time:85788221ms step_avg:1607.73ms +step:53380/60000 train_loss:2.0880 lr_mul:0.0221 train_time:85820622ms step_avg:1607.73ms +step:53400/60000 train_loss:2.0085 lr_mul:0.0220 train_time:85853070ms step_avg:1607.74ms +step:53420/60000 train_loss:2.0996 lr_mul:0.0219 train_time:85885525ms step_avg:1607.74ms +step:53440/60000 train_loss:2.0676 lr_mul:0.0219 train_time:85917948ms step_avg:1607.75ms +step:53460/60000 train_loss:2.0028 lr_mul:0.0218 train_time:85950359ms step_avg:1607.75ms +step:53480/60000 train_loss:2.0202 lr_mul:0.0217 train_time:85982783ms step_avg:1607.76ms +step:53500/60000 train_loss:2.0360 lr_mul:0.0217 train_time:86015195ms step_avg:1607.76ms + >> Checkpoint saved: ./checkpoints/ckpt_step53500.pt +step:53520/60000 train_loss:2.0762 lr_mul:0.0216 train_time:86047804ms step_avg:1607.77ms +step:53540/60000 train_loss:2.0364 lr_mul:0.0215 train_time:86080220ms step_avg:1607.77ms +step:53560/60000 train_loss:2.0450 lr_mul:0.0215 train_time:86112640ms step_avg:1607.78ms +step:53580/60000 train_loss:1.9909 lr_mul:0.0214 train_time:86145061ms step_avg:1607.78ms +step:53600/60000 train_loss:2.0508 lr_mul:0.0213 train_time:86177472ms step_avg:1607.79ms +step:53620/60000 train_loss:2.0063 lr_mul:0.0213 train_time:86209910ms step_avg:1607.79ms +step:53640/60000 train_loss:2.0510 lr_mul:0.0212 train_time:86242579ms step_avg:1607.80ms +step:53660/60000 train_loss:2.0306 lr_mul:0.0211 train_time:86275029ms step_avg:1607.81ms +step:53680/60000 train_loss:2.0674 lr_mul:0.0211 train_time:86307450ms step_avg:1607.81ms +step:53700/60000 train_loss:2.0397 lr_mul:0.0210 train_time:86339895ms step_avg:1607.82ms +step:53720/60000 train_loss:2.0756 lr_mul:0.0209 train_time:86372341ms step_avg:1607.82ms +step:53740/60000 train_loss:2.1029 lr_mul:0.0209 train_time:86404767ms step_avg:1607.83ms +step:53760/60000 train_loss:2.0918 lr_mul:0.0208 train_time:86437209ms step_avg:1607.83ms +step:53780/60000 train_loss:1.9087 lr_mul:0.0207 train_time:86469624ms step_avg:1607.84ms +step:53800/60000 train_loss:2.0567 lr_mul:0.0207 train_time:86502044ms step_avg:1607.84ms +step:53820/60000 train_loss:2.0792 lr_mul:0.0206 train_time:86534486ms step_avg:1607.85ms +step:53840/60000 train_loss:2.0034 lr_mul:0.0205 train_time:86566918ms step_avg:1607.86ms +step:53860/60000 train_loss:2.0712 lr_mul:0.0205 train_time:86599349ms step_avg:1607.86ms +step:53880/60000 train_loss:2.0454 lr_mul:0.0204 train_time:86631800ms step_avg:1607.87ms +step:53900/60000 train_loss:2.0715 lr_mul:0.0203 train_time:86664215ms step_avg:1607.87ms +step:53920/60000 train_loss:1.9689 lr_mul:0.0203 train_time:86696850ms step_avg:1607.88ms +step:53940/60000 train_loss:2.0834 lr_mul:0.0202 train_time:86729259ms step_avg:1607.88ms +step:53960/60000 train_loss:1.9786 lr_mul:0.0201 train_time:86761688ms step_avg:1607.89ms +step:53980/60000 train_loss:1.9836 lr_mul:0.0201 train_time:86794109ms step_avg:1607.89ms +step:54000/60000 train_loss:2.0130 lr_mul:0.0200 train_time:86826511ms step_avg:1607.90ms + >> Checkpoint saved: ./checkpoints/ckpt_step54000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0291 + eval: batch 40/100 running_loss:2.0248 + eval: batch 60/100 running_loss:2.0172 + eval: batch 80/100 running_loss:2.0187 + eval: batch 100/100 running_loss:2.0268 +step:54000/60000 val_loss:2.0268 val_bpb:1.2167 train_time:86826715ms +step:54020/60000 train_loss:2.0074 lr_mul:0.0199 train_time:86859140ms step_avg:1607.91ms +step:54040/60000 train_loss:2.0863 lr_mul:0.0199 train_time:86891596ms step_avg:1607.91ms +step:54060/60000 train_loss:1.9017 lr_mul:0.0198 train_time:86924041ms step_avg:1607.92ms +step:54080/60000 train_loss:2.1065 lr_mul:0.0197 train_time:86956473ms step_avg:1607.92ms +step:54100/60000 train_loss:2.1172 lr_mul:0.0197 train_time:86988915ms step_avg:1607.93ms +step:54120/60000 train_loss:2.0425 lr_mul:0.0196 train_time:87021346ms step_avg:1607.93ms +step:54140/60000 train_loss:2.0962 lr_mul:0.0195 train_time:87053757ms step_avg:1607.94ms +step:54160/60000 train_loss:2.0504 lr_mul:0.0195 train_time:87086191ms step_avg:1607.94ms +step:54180/60000 train_loss:2.0343 lr_mul:0.0194 train_time:87118624ms step_avg:1607.95ms +step:54200/60000 train_loss:2.0853 lr_mul:0.0193 train_time:87151033ms step_avg:1607.95ms +step:54220/60000 train_loss:2.0585 lr_mul:0.0193 train_time:87183631ms step_avg:1607.96ms +step:54240/60000 train_loss:2.0127 lr_mul:0.0192 train_time:87216030ms step_avg:1607.97ms +step:54260/60000 train_loss:2.0616 lr_mul:0.0191 train_time:87248407ms step_avg:1607.97ms +step:54280/60000 train_loss:2.0361 lr_mul:0.0191 train_time:87280805ms step_avg:1607.97ms +step:54300/60000 train_loss:2.0191 lr_mul:0.0190 train_time:87313194ms step_avg:1607.98ms +step:54320/60000 train_loss:2.0633 lr_mul:0.0189 train_time:87345629ms step_avg:1607.98ms +step:54340/60000 train_loss:1.9471 lr_mul:0.0189 train_time:87378064ms step_avg:1607.99ms +step:54360/60000 train_loss:2.0134 lr_mul:0.0188 train_time:87410483ms step_avg:1607.99ms +step:54380/60000 train_loss:2.1218 lr_mul:0.0187 train_time:87442925ms step_avg:1608.00ms +step:54400/60000 train_loss:2.0396 lr_mul:0.0187 train_time:87475367ms step_avg:1608.00ms +step:54420/60000 train_loss:2.0583 lr_mul:0.0186 train_time:87507811ms step_avg:1608.01ms +step:54440/60000 train_loss:2.1110 lr_mul:0.0185 train_time:87540268ms step_avg:1608.01ms +step:54460/60000 train_loss:2.0341 lr_mul:0.0185 train_time:87572705ms step_avg:1608.02ms +step:54480/60000 train_loss:2.0412 lr_mul:0.0184 train_time:87605111ms step_avg:1608.02ms +step:54500/60000 train_loss:2.0655 lr_mul:0.0183 train_time:87637804ms step_avg:1608.03ms + >> Checkpoint saved: ./checkpoints/ckpt_step54500.pt +step:54520/60000 train_loss:1.9649 lr_mul:0.0183 train_time:87670385ms step_avg:1608.04ms +step:54540/60000 train_loss:2.0166 lr_mul:0.0182 train_time:87702795ms step_avg:1608.05ms +step:54560/60000 train_loss:2.0552 lr_mul:0.0181 train_time:87735220ms step_avg:1608.05ms +step:54580/60000 train_loss:2.0277 lr_mul:0.0181 train_time:87767633ms step_avg:1608.05ms +step:54600/60000 train_loss:2.0857 lr_mul:0.0180 train_time:87800047ms step_avg:1608.06ms +step:54620/60000 train_loss:2.1177 lr_mul:0.0179 train_time:87832450ms step_avg:1608.06ms +step:54640/60000 train_loss:2.0437 lr_mul:0.0179 train_time:87864862ms step_avg:1608.07ms +step:54660/60000 train_loss:2.0942 lr_mul:0.0178 train_time:87897253ms step_avg:1608.07ms +step:54680/60000 train_loss:2.0033 lr_mul:0.0177 train_time:87929685ms step_avg:1608.08ms +step:54700/60000 train_loss:2.0747 lr_mul:0.0177 train_time:87962081ms step_avg:1608.08ms +step:54720/60000 train_loss:1.9686 lr_mul:0.0176 train_time:87994506ms step_avg:1608.09ms +step:54740/60000 train_loss:2.0318 lr_mul:0.0175 train_time:88026904ms step_avg:1608.09ms +step:54760/60000 train_loss:1.9565 lr_mul:0.0175 train_time:88059310ms step_avg:1608.10ms +step:54780/60000 train_loss:2.0259 lr_mul:0.0174 train_time:88091907ms step_avg:1608.10ms +step:54800/60000 train_loss:2.1275 lr_mul:0.0173 train_time:88124313ms step_avg:1608.11ms +step:54820/60000 train_loss:2.0229 lr_mul:0.0173 train_time:88156720ms step_avg:1608.11ms +step:54840/60000 train_loss:2.0920 lr_mul:0.0172 train_time:88189122ms step_avg:1608.12ms +step:54860/60000 train_loss:2.0197 lr_mul:0.0171 train_time:88221548ms step_avg:1608.12ms +step:54880/60000 train_loss:2.0693 lr_mul:0.0171 train_time:88253963ms step_avg:1608.13ms +step:54900/60000 train_loss:2.0386 lr_mul:0.0170 train_time:88286389ms step_avg:1608.13ms +step:54920/60000 train_loss:1.9925 lr_mul:0.0169 train_time:88318810ms step_avg:1608.14ms +step:54940/60000 train_loss:2.0843 lr_mul:0.0169 train_time:88351230ms step_avg:1608.14ms +step:54960/60000 train_loss:1.9355 lr_mul:0.0168 train_time:88383637ms step_avg:1608.14ms +step:54980/60000 train_loss:2.0756 lr_mul:0.0167 train_time:88416037ms step_avg:1608.15ms +step:55000/60000 train_loss:2.0661 lr_mul:0.0167 train_time:88448434ms step_avg:1608.15ms + >> Checkpoint saved: ./checkpoints/ckpt_step55000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0293 + eval: batch 40/100 running_loss:2.0250 + eval: batch 60/100 running_loss:2.0174 + eval: batch 80/100 running_loss:2.0190 + eval: batch 100/100 running_loss:2.0271 +step:55000/60000 val_loss:2.0271 val_bpb:1.2169 train_time:88448634ms +step:55020/60000 train_loss:1.9862 lr_mul:0.0166 train_time:88481068ms step_avg:1608.16ms +step:55040/60000 train_loss:2.0417 lr_mul:0.0165 train_time:88513499ms step_avg:1608.17ms +step:55060/60000 train_loss:2.0256 lr_mul:0.0165 train_time:88545913ms step_avg:1608.17ms +step:55080/60000 train_loss:2.0050 lr_mul:0.0164 train_time:88578481ms step_avg:1608.18ms +step:55100/60000 train_loss:2.0382 lr_mul:0.0163 train_time:88610852ms step_avg:1608.18ms +step:55120/60000 train_loss:2.0983 lr_mul:0.0163 train_time:88643250ms step_avg:1608.19ms +step:55140/60000 train_loss:1.9823 lr_mul:0.0162 train_time:88675639ms step_avg:1608.19ms +step:55160/60000 train_loss:2.0479 lr_mul:0.0161 train_time:88708041ms step_avg:1608.20ms +step:55180/60000 train_loss:2.0011 lr_mul:0.0161 train_time:88740457ms step_avg:1608.20ms +step:55200/60000 train_loss:2.0601 lr_mul:0.0160 train_time:88772862ms step_avg:1608.20ms +step:55220/60000 train_loss:2.0229 lr_mul:0.0159 train_time:88805284ms step_avg:1608.21ms +step:55240/60000 train_loss:2.0227 lr_mul:0.0159 train_time:88837705ms step_avg:1608.21ms +step:55260/60000 train_loss:2.0514 lr_mul:0.0158 train_time:88870126ms step_avg:1608.22ms +step:55280/60000 train_loss:2.0276 lr_mul:0.0157 train_time:88902556ms step_avg:1608.22ms +step:55300/60000 train_loss:2.0616 lr_mul:0.0157 train_time:88934951ms step_avg:1608.23ms +step:55320/60000 train_loss:2.0101 lr_mul:0.0156 train_time:88967369ms step_avg:1608.23ms +step:55340/60000 train_loss:2.0590 lr_mul:0.0155 train_time:88999792ms step_avg:1608.24ms +step:55360/60000 train_loss:1.9666 lr_mul:0.0155 train_time:89032469ms step_avg:1608.25ms +step:55380/60000 train_loss:2.0100 lr_mul:0.0154 train_time:89064901ms step_avg:1608.25ms +step:55400/60000 train_loss:2.0827 lr_mul:0.0153 train_time:89097333ms step_avg:1608.26ms +step:55420/60000 train_loss:1.9670 lr_mul:0.0153 train_time:89129760ms step_avg:1608.26ms +step:55440/60000 train_loss:2.0426 lr_mul:0.0152 train_time:89162200ms step_avg:1608.26ms +step:55460/60000 train_loss:2.0664 lr_mul:0.0151 train_time:89194635ms step_avg:1608.27ms +step:55480/60000 train_loss:1.9978 lr_mul:0.0151 train_time:89227075ms step_avg:1608.27ms +step:55500/60000 train_loss:2.0036 lr_mul:0.0150 train_time:89259526ms step_avg:1608.28ms + >> Checkpoint saved: ./checkpoints/ckpt_step55500.pt +step:55520/60000 train_loss:2.0430 lr_mul:0.0149 train_time:89292136ms step_avg:1608.29ms +step:55540/60000 train_loss:1.9913 lr_mul:0.0149 train_time:89324560ms step_avg:1608.29ms +step:55560/60000 train_loss:2.0449 lr_mul:0.0148 train_time:89356989ms step_avg:1608.30ms +step:55580/60000 train_loss:2.0556 lr_mul:0.0147 train_time:89389399ms step_avg:1608.30ms +step:55600/60000 train_loss:2.0477 lr_mul:0.0147 train_time:89421813ms step_avg:1608.31ms +step:55620/60000 train_loss:2.0175 lr_mul:0.0146 train_time:89454268ms step_avg:1608.31ms +step:55640/60000 train_loss:1.9293 lr_mul:0.0145 train_time:89487015ms step_avg:1608.32ms +step:55660/60000 train_loss:2.0395 lr_mul:0.0145 train_time:89519427ms step_avg:1608.33ms +step:55680/60000 train_loss:1.9640 lr_mul:0.0144 train_time:89551847ms step_avg:1608.33ms +step:55700/60000 train_loss:2.0452 lr_mul:0.0143 train_time:89584362ms step_avg:1608.34ms +step:55720/60000 train_loss:2.0601 lr_mul:0.0143 train_time:89616781ms step_avg:1608.34ms +step:55740/60000 train_loss:2.0194 lr_mul:0.0142 train_time:89649247ms step_avg:1608.35ms +step:55760/60000 train_loss:2.0583 lr_mul:0.0141 train_time:89681714ms step_avg:1608.35ms +step:55780/60000 train_loss:2.0252 lr_mul:0.0141 train_time:89714162ms step_avg:1608.36ms +step:55800/60000 train_loss:2.1091 lr_mul:0.0140 train_time:89746611ms step_avg:1608.36ms +step:55820/60000 train_loss:2.0231 lr_mul:0.0139 train_time:89779026ms step_avg:1608.37ms +step:55840/60000 train_loss:2.0816 lr_mul:0.0139 train_time:89811464ms step_avg:1608.37ms +step:55860/60000 train_loss:1.9898 lr_mul:0.0138 train_time:89843895ms step_avg:1608.38ms +step:55880/60000 train_loss:2.0299 lr_mul:0.0137 train_time:89876328ms step_avg:1608.38ms +step:55900/60000 train_loss:1.9701 lr_mul:0.0137 train_time:89908755ms step_avg:1608.39ms +step:55920/60000 train_loss:2.0434 lr_mul:0.0136 train_time:89941156ms step_avg:1608.39ms +step:55940/60000 train_loss:1.9992 lr_mul:0.0135 train_time:89973851ms step_avg:1608.40ms +step:55960/60000 train_loss:2.0196 lr_mul:0.0135 train_time:90006289ms step_avg:1608.40ms +step:55980/60000 train_loss:2.1079 lr_mul:0.0134 train_time:90038705ms step_avg:1608.41ms +step:56000/60000 train_loss:1.9253 lr_mul:0.0133 train_time:90071129ms step_avg:1608.41ms + >> Checkpoint saved: ./checkpoints/ckpt_step56000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0291 + eval: batch 40/100 running_loss:2.0248 + eval: batch 60/100 running_loss:2.0173 + eval: batch 80/100 running_loss:2.0188 + eval: batch 100/100 running_loss:2.0269 +step:56000/60000 val_loss:2.0269 val_bpb:1.2167 train_time:90071338ms +step:56020/60000 train_loss:2.0448 lr_mul:0.0133 train_time:90103769ms step_avg:1608.42ms +step:56040/60000 train_loss:2.0787 lr_mul:0.0132 train_time:90136174ms step_avg:1608.43ms +step:56060/60000 train_loss:2.0412 lr_mul:0.0131 train_time:90168596ms step_avg:1608.43ms +step:56080/60000 train_loss:2.0898 lr_mul:0.0131 train_time:90201027ms step_avg:1608.43ms +step:56100/60000 train_loss:2.0189 lr_mul:0.0130 train_time:90233471ms step_avg:1608.44ms +step:56120/60000 train_loss:2.1061 lr_mul:0.0129 train_time:90265909ms step_avg:1608.44ms +step:56140/60000 train_loss:2.0691 lr_mul:0.0129 train_time:90298348ms step_avg:1608.45ms +step:56160/60000 train_loss:2.0260 lr_mul:0.0128 train_time:90330830ms step_avg:1608.45ms +step:56180/60000 train_loss:2.0348 lr_mul:0.0127 train_time:90363287ms step_avg:1608.46ms +step:56200/60000 train_loss:2.0257 lr_mul:0.0127 train_time:90395753ms step_avg:1608.47ms +step:56220/60000 train_loss:2.0559 lr_mul:0.0126 train_time:90428420ms step_avg:1608.47ms +step:56240/60000 train_loss:2.1494 lr_mul:0.0125 train_time:90460849ms step_avg:1608.48ms +step:56260/60000 train_loss:2.1171 lr_mul:0.0125 train_time:90493309ms step_avg:1608.48ms +step:56280/60000 train_loss:2.0292 lr_mul:0.0124 train_time:90525753ms step_avg:1608.49ms +step:56300/60000 train_loss:2.1690 lr_mul:0.0123 train_time:90558173ms step_avg:1608.49ms +step:56320/60000 train_loss:2.0138 lr_mul:0.0123 train_time:90590589ms step_avg:1608.50ms +step:56340/60000 train_loss:2.0410 lr_mul:0.0122 train_time:90622986ms step_avg:1608.50ms +step:56360/60000 train_loss:2.0449 lr_mul:0.0121 train_time:90655395ms step_avg:1608.51ms +step:56380/60000 train_loss:2.1152 lr_mul:0.0121 train_time:90687815ms step_avg:1608.51ms +step:56400/60000 train_loss:2.0465 lr_mul:0.0120 train_time:90720256ms step_avg:1608.52ms +step:56420/60000 train_loss:2.0033 lr_mul:0.0119 train_time:90752694ms step_avg:1608.52ms +step:56440/60000 train_loss:2.0356 lr_mul:0.0119 train_time:90785122ms step_avg:1608.52ms +step:56460/60000 train_loss:2.0718 lr_mul:0.0118 train_time:90817567ms step_avg:1608.53ms +step:56480/60000 train_loss:1.9933 lr_mul:0.0117 train_time:90850017ms step_avg:1608.53ms +step:56500/60000 train_loss:2.0731 lr_mul:0.0117 train_time:90882607ms step_avg:1608.54ms + >> Checkpoint saved: ./checkpoints/ckpt_step56500.pt +step:56520/60000 train_loss:1.9909 lr_mul:0.0116 train_time:90915214ms step_avg:1608.55ms +step:56540/60000 train_loss:2.0372 lr_mul:0.0115 train_time:90947630ms step_avg:1608.55ms +step:56560/60000 train_loss:2.0498 lr_mul:0.0115 train_time:90980070ms step_avg:1608.56ms +step:56580/60000 train_loss:2.0157 lr_mul:0.0114 train_time:91012497ms step_avg:1608.56ms +step:56600/60000 train_loss:2.0297 lr_mul:0.0113 train_time:91044914ms step_avg:1608.57ms +step:56620/60000 train_loss:2.0600 lr_mul:0.0113 train_time:91077350ms step_avg:1608.57ms +step:56640/60000 train_loss:2.1248 lr_mul:0.0112 train_time:91109777ms step_avg:1608.58ms +step:56660/60000 train_loss:2.1202 lr_mul:0.0111 train_time:91142210ms step_avg:1608.58ms +step:56680/60000 train_loss:2.1313 lr_mul:0.0111 train_time:91174637ms step_avg:1608.59ms +step:56700/60000 train_loss:2.0865 lr_mul:0.0110 train_time:91207064ms step_avg:1608.59ms +step:56720/60000 train_loss:2.0529 lr_mul:0.0109 train_time:91239508ms step_avg:1608.59ms +step:56740/60000 train_loss:2.0355 lr_mul:0.0109 train_time:91271939ms step_avg:1608.60ms +step:56760/60000 train_loss:2.0606 lr_mul:0.0108 train_time:91304382ms step_avg:1608.60ms +step:56780/60000 train_loss:2.1495 lr_mul:0.0107 train_time:91336806ms step_avg:1608.61ms +step:56800/60000 train_loss:2.0537 lr_mul:0.0107 train_time:91369521ms step_avg:1608.62ms +step:56820/60000 train_loss:2.1298 lr_mul:0.0106 train_time:91401953ms step_avg:1608.62ms +step:56840/60000 train_loss:2.0335 lr_mul:0.0105 train_time:91434378ms step_avg:1608.63ms +step:56860/60000 train_loss:2.0640 lr_mul:0.0105 train_time:91466812ms step_avg:1608.63ms +step:56880/60000 train_loss:2.0262 lr_mul:0.0104 train_time:91499266ms step_avg:1608.64ms +step:56900/60000 train_loss:2.0726 lr_mul:0.0103 train_time:91531720ms step_avg:1608.64ms +step:56920/60000 train_loss:2.1820 lr_mul:0.0103 train_time:91564157ms step_avg:1608.65ms +step:56940/60000 train_loss:1.9506 lr_mul:0.0102 train_time:91596582ms step_avg:1608.65ms +step:56960/60000 train_loss:2.0828 lr_mul:0.0101 train_time:91629037ms step_avg:1608.66ms +step:56980/60000 train_loss:1.9933 lr_mul:0.0101 train_time:91661509ms step_avg:1608.66ms +step:57000/60000 train_loss:2.0673 lr_mul:0.0100 train_time:91693967ms step_avg:1608.67ms + >> Checkpoint saved: ./checkpoints/ckpt_step57000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0279 + eval: batch 40/100 running_loss:2.0235 + eval: batch 60/100 running_loss:2.0160 + eval: batch 80/100 running_loss:2.0175 + eval: batch 100/100 running_loss:2.0256 +step:57000/60000 val_loss:2.0256 val_bpb:1.2160 train_time:91694174ms +step:57020/60000 train_loss:2.0035 lr_mul:0.0099 train_time:91726655ms step_avg:1608.68ms +step:57040/60000 train_loss:2.0038 lr_mul:0.0099 train_time:91759107ms step_avg:1608.68ms +step:57060/60000 train_loss:1.9740 lr_mul:0.0098 train_time:91791561ms step_avg:1608.68ms +step:57080/60000 train_loss:2.0352 lr_mul:0.0097 train_time:91824191ms step_avg:1608.69ms +step:57100/60000 train_loss:1.8961 lr_mul:0.0097 train_time:91856631ms step_avg:1608.70ms +step:57120/60000 train_loss:2.0188 lr_mul:0.0096 train_time:91889076ms step_avg:1608.70ms +step:57140/60000 train_loss:1.9681 lr_mul:0.0095 train_time:91921502ms step_avg:1608.71ms +step:57160/60000 train_loss:2.0713 lr_mul:0.0095 train_time:91953937ms step_avg:1608.71ms +step:57180/60000 train_loss:1.9868 lr_mul:0.0094 train_time:91986356ms step_avg:1608.72ms +step:57200/60000 train_loss:2.0545 lr_mul:0.0093 train_time:92018769ms step_avg:1608.72ms +step:57220/60000 train_loss:1.9695 lr_mul:0.0093 train_time:92051215ms step_avg:1608.72ms +step:57240/60000 train_loss:2.0204 lr_mul:0.0092 train_time:92083624ms step_avg:1608.73ms +step:57260/60000 train_loss:2.0369 lr_mul:0.0091 train_time:92116046ms step_avg:1608.73ms +step:57280/60000 train_loss:2.1109 lr_mul:0.0091 train_time:92148469ms step_avg:1608.74ms +step:57300/60000 train_loss:1.9996 lr_mul:0.0090 train_time:92180881ms step_avg:1608.74ms +step:57320/60000 train_loss:2.0396 lr_mul:0.0089 train_time:92213288ms step_avg:1608.75ms +step:57340/60000 train_loss:2.0465 lr_mul:0.0089 train_time:92245683ms step_avg:1608.75ms +step:57360/60000 train_loss:2.0589 lr_mul:0.0088 train_time:92278234ms step_avg:1608.76ms +step:57380/60000 train_loss:2.0249 lr_mul:0.0087 train_time:92310642ms step_avg:1608.76ms +step:57400/60000 train_loss:2.0538 lr_mul:0.0087 train_time:92343038ms step_avg:1608.76ms +step:57420/60000 train_loss:2.0732 lr_mul:0.0086 train_time:92375426ms step_avg:1608.77ms +step:57440/60000 train_loss:2.0207 lr_mul:0.0085 train_time:92407818ms step_avg:1608.77ms +step:57460/60000 train_loss:2.0722 lr_mul:0.0085 train_time:92440232ms step_avg:1608.78ms +step:57480/60000 train_loss:2.0311 lr_mul:0.0084 train_time:92472640ms step_avg:1608.78ms +step:57500/60000 train_loss:2.1440 lr_mul:0.0083 train_time:92505053ms step_avg:1608.78ms + >> Checkpoint saved: ./checkpoints/ckpt_step57500.pt +step:57520/60000 train_loss:2.0206 lr_mul:0.0083 train_time:92537639ms step_avg:1608.79ms +step:57540/60000 train_loss:2.0426 lr_mul:0.0082 train_time:92570041ms step_avg:1608.79ms +step:57560/60000 train_loss:2.4093 lr_mul:0.0081 train_time:92602443ms step_avg:1608.80ms +step:57580/60000 train_loss:2.0083 lr_mul:0.0081 train_time:92634850ms step_avg:1608.80ms +step:57600/60000 train_loss:2.1137 lr_mul:0.0080 train_time:92667248ms step_avg:1608.81ms +step:57620/60000 train_loss:1.9534 lr_mul:0.0079 train_time:92699670ms step_avg:1608.81ms +step:57640/60000 train_loss:2.0657 lr_mul:0.0079 train_time:92732094ms step_avg:1608.81ms +step:57660/60000 train_loss:1.9410 lr_mul:0.0078 train_time:92764697ms step_avg:1608.82ms +step:57680/60000 train_loss:2.0540 lr_mul:0.0077 train_time:92797103ms step_avg:1608.83ms +step:57700/60000 train_loss:1.9795 lr_mul:0.0077 train_time:92829530ms step_avg:1608.83ms +step:57720/60000 train_loss:1.9622 lr_mul:0.0076 train_time:92861925ms step_avg:1608.83ms +step:57740/60000 train_loss:1.9521 lr_mul:0.0075 train_time:92894324ms step_avg:1608.84ms +step:57760/60000 train_loss:2.0622 lr_mul:0.0075 train_time:92926721ms step_avg:1608.84ms +step:57780/60000 train_loss:1.9841 lr_mul:0.0074 train_time:92959100ms step_avg:1608.85ms +step:57800/60000 train_loss:2.0255 lr_mul:0.0073 train_time:92991505ms step_avg:1608.85ms +step:57820/60000 train_loss:1.9585 lr_mul:0.0073 train_time:93023952ms step_avg:1608.85ms +step:57840/60000 train_loss:2.0321 lr_mul:0.0072 train_time:93056400ms step_avg:1608.86ms +step:57860/60000 train_loss:2.0107 lr_mul:0.0071 train_time:93088842ms step_avg:1608.86ms +step:57880/60000 train_loss:2.0422 lr_mul:0.0071 train_time:93121285ms step_avg:1608.87ms +step:57900/60000 train_loss:2.0188 lr_mul:0.0070 train_time:93153729ms step_avg:1608.87ms +step:57920/60000 train_loss:2.1070 lr_mul:0.0069 train_time:93186167ms step_avg:1608.88ms +step:57940/60000 train_loss:2.0333 lr_mul:0.0069 train_time:93218824ms step_avg:1608.89ms +step:57960/60000 train_loss:2.0260 lr_mul:0.0068 train_time:93251289ms step_avg:1608.89ms +step:57980/60000 train_loss:2.0564 lr_mul:0.0067 train_time:93283710ms step_avg:1608.89ms +step:58000/60000 train_loss:2.0467 lr_mul:0.0067 train_time:93316176ms step_avg:1608.90ms + >> Checkpoint saved: ./checkpoints/ckpt_step58000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0281 + eval: batch 40/100 running_loss:2.0237 + eval: batch 60/100 running_loss:2.0162 + eval: batch 80/100 running_loss:2.0177 + eval: batch 100/100 running_loss:2.0258 +step:58000/60000 val_loss:2.0258 val_bpb:1.2161 train_time:93316380ms +step:58020/60000 train_loss:1.9960 lr_mul:0.0066 train_time:93348815ms step_avg:1608.91ms +step:58040/60000 train_loss:1.9972 lr_mul:0.0065 train_time:93381219ms step_avg:1608.91ms +step:58060/60000 train_loss:2.1030 lr_mul:0.0065 train_time:93413647ms step_avg:1608.92ms +step:58080/60000 train_loss:2.0455 lr_mul:0.0064 train_time:93446071ms step_avg:1608.92ms +step:58100/60000 train_loss:2.0123 lr_mul:0.0063 train_time:93478498ms step_avg:1608.92ms +step:58120/60000 train_loss:2.1076 lr_mul:0.0063 train_time:93510937ms step_avg:1608.93ms +step:58140/60000 train_loss:1.9743 lr_mul:0.0062 train_time:93543352ms step_avg:1608.93ms +step:58160/60000 train_loss:2.0845 lr_mul:0.0061 train_time:93575766ms step_avg:1608.94ms +step:58180/60000 train_loss:2.0607 lr_mul:0.0061 train_time:93608200ms step_avg:1608.94ms +step:58200/60000 train_loss:1.9888 lr_mul:0.0060 train_time:93640610ms step_avg:1608.95ms +step:58220/60000 train_loss:2.0731 lr_mul:0.0059 train_time:93673209ms step_avg:1608.95ms +step:58240/60000 train_loss:2.0221 lr_mul:0.0059 train_time:93705660ms step_avg:1608.96ms +step:58260/60000 train_loss:1.9996 lr_mul:0.0058 train_time:93738087ms step_avg:1608.96ms +step:58280/60000 train_loss:1.9625 lr_mul:0.0057 train_time:93770500ms step_avg:1608.97ms +step:58300/60000 train_loss:2.0210 lr_mul:0.0057 train_time:93802913ms step_avg:1608.97ms +step:58320/60000 train_loss:1.9888 lr_mul:0.0056 train_time:93835340ms step_avg:1608.97ms +step:58340/60000 train_loss:2.0053 lr_mul:0.0055 train_time:93867782ms step_avg:1608.98ms +step:58360/60000 train_loss:2.0433 lr_mul:0.0055 train_time:93900209ms step_avg:1608.98ms +step:58380/60000 train_loss:2.0075 lr_mul:0.0054 train_time:93932653ms step_avg:1608.99ms +step:58400/60000 train_loss:2.0720 lr_mul:0.0053 train_time:93965124ms step_avg:1608.99ms +step:58420/60000 train_loss:2.0158 lr_mul:0.0053 train_time:93997581ms step_avg:1609.00ms +step:58440/60000 train_loss:2.0984 lr_mul:0.0052 train_time:94030049ms step_avg:1609.00ms +step:58460/60000 train_loss:1.9720 lr_mul:0.0051 train_time:94062525ms step_avg:1609.01ms +step:58480/60000 train_loss:2.0857 lr_mul:0.0051 train_time:94094970ms step_avg:1609.01ms +step:58500/60000 train_loss:1.9488 lr_mul:0.0050 train_time:94127404ms step_avg:1609.02ms + >> Checkpoint saved: ./checkpoints/ckpt_step58500.pt +step:58520/60000 train_loss:2.0814 lr_mul:0.0049 train_time:94160276ms step_avg:1609.03ms +step:58540/60000 train_loss:1.9112 lr_mul:0.0049 train_time:94192701ms step_avg:1609.03ms +step:58560/60000 train_loss:2.0545 lr_mul:0.0048 train_time:94225143ms step_avg:1609.04ms +step:58580/60000 train_loss:1.9191 lr_mul:0.0047 train_time:94257589ms step_avg:1609.04ms +step:58600/60000 train_loss:2.1430 lr_mul:0.0047 train_time:94290025ms step_avg:1609.04ms +step:58620/60000 train_loss:1.9653 lr_mul:0.0046 train_time:94322453ms step_avg:1609.05ms +step:58640/60000 train_loss:2.1084 lr_mul:0.0045 train_time:94354883ms step_avg:1609.05ms +step:58660/60000 train_loss:1.9534 lr_mul:0.0045 train_time:94387305ms step_avg:1609.06ms +step:58680/60000 train_loss:2.0869 lr_mul:0.0044 train_time:94419713ms step_avg:1609.06ms +step:58700/60000 train_loss:1.9880 lr_mul:0.0043 train_time:94452152ms step_avg:1609.07ms +step:58720/60000 train_loss:2.1068 lr_mul:0.0043 train_time:94484578ms step_avg:1609.07ms +step:58740/60000 train_loss:2.0180 lr_mul:0.0042 train_time:94516982ms step_avg:1609.07ms +step:58760/60000 train_loss:2.0334 lr_mul:0.0041 train_time:94549409ms step_avg:1609.08ms +step:58780/60000 train_loss:1.9471 lr_mul:0.0041 train_time:94581830ms step_avg:1609.08ms +step:58800/60000 train_loss:2.1275 lr_mul:0.0040 train_time:94614423ms step_avg:1609.09ms +step:58820/60000 train_loss:2.0357 lr_mul:0.0039 train_time:94646851ms step_avg:1609.09ms +step:58840/60000 train_loss:2.0492 lr_mul:0.0039 train_time:94679295ms step_avg:1609.10ms +step:58860/60000 train_loss:2.0808 lr_mul:0.0038 train_time:94711730ms step_avg:1609.10ms +step:58880/60000 train_loss:2.1050 lr_mul:0.0037 train_time:94744136ms step_avg:1609.11ms +step:58900/60000 train_loss:2.0213 lr_mul:0.0037 train_time:94776555ms step_avg:1609.11ms +step:58920/60000 train_loss:2.0328 lr_mul:0.0036 train_time:94808995ms step_avg:1609.11ms +step:58940/60000 train_loss:2.0826 lr_mul:0.0035 train_time:94841418ms step_avg:1609.12ms +step:58960/60000 train_loss:2.0098 lr_mul:0.0035 train_time:94873862ms step_avg:1609.12ms +step:58980/60000 train_loss:2.0743 lr_mul:0.0034 train_time:94906289ms step_avg:1609.13ms +step:59000/60000 train_loss:2.0971 lr_mul:0.0033 train_time:94938704ms step_avg:1609.13ms + >> Checkpoint saved: ./checkpoints/ckpt_step59000.pt + eval: 100 batches (limited) + eval: batch 20/100 running_loss:2.0278 + eval: batch 40/100 running_loss:2.0234 + eval: batch 60/100 running_loss:2.0159 + eval: batch 80/100 running_loss:2.0175 + eval: batch 100/100 running_loss:2.0256 +step:59000/60000 val_loss:2.0256 val_bpb:1.2160 train_time:94938916ms +step:59020/60000 train_loss:2.0620 lr_mul:0.0033 train_time:94971412ms step_avg:1609.14ms +step:59040/60000 train_loss:2.0137 lr_mul:0.0032 train_time:95003821ms step_avg:1609.14ms +step:59060/60000 train_loss:2.0441 lr_mul:0.0031 train_time:95036209ms step_avg:1609.15ms +step:59080/60000 train_loss:2.0807 lr_mul:0.0031 train_time:95068618ms step_avg:1609.15ms +step:59100/60000 train_loss:1.9739 lr_mul:0.0030 train_time:95101290ms step_avg:1609.16ms +step:59120/60000 train_loss:2.0538 lr_mul:0.0029 train_time:95133705ms step_avg:1609.16ms +step:59140/60000 train_loss:2.0228 lr_mul:0.0029 train_time:95166115ms step_avg:1609.17ms +step:59160/60000 train_loss:2.0736 lr_mul:0.0028 train_time:95198530ms step_avg:1609.17ms +step:59180/60000 train_loss:2.0499 lr_mul:0.0027 train_time:95230949ms step_avg:1609.17ms +step:59200/60000 train_loss:2.0094 lr_mul:0.0027 train_time:95263412ms step_avg:1609.18ms +step:59220/60000 train_loss:2.0156 lr_mul:0.0026 train_time:95295861ms step_avg:1609.18ms +step:59240/60000 train_loss:2.0228 lr_mul:0.0025 train_time:95328302ms step_avg:1609.19ms +step:59260/60000 train_loss:1.9823 lr_mul:0.0025 train_time:95360777ms step_avg:1609.19ms +step:59280/60000 train_loss:2.0586 lr_mul:0.0024 train_time:95393227ms step_avg:1609.20ms +step:59300/60000 train_loss:2.0811 lr_mul:0.0023 train_time:95425669ms step_avg:1609.20ms +step:59320/60000 train_loss:1.9258 lr_mul:0.0023 train_time:95458107ms step_avg:1609.21ms +step:59340/60000 train_loss:2.0329 lr_mul:0.0022 train_time:95490560ms step_avg:1609.21ms +step:59360/60000 train_loss:2.0505 lr_mul:0.0021 train_time:95523003ms step_avg:1609.22ms +step:59380/60000 train_loss:2.0597 lr_mul:0.0021 train_time:95555628ms step_avg:1609.22ms +step:59400/60000 train_loss:1.9542 lr_mul:0.0020 train_time:95588078ms step_avg:1609.23ms +step:59420/60000 train_loss:1.9972 lr_mul:0.0019 train_time:95620539ms step_avg:1609.23ms +step:59440/60000 train_loss:2.0525 lr_mul:0.0019 train_time:95652996ms step_avg:1609.24ms +step:59460/60000 train_loss:2.0270 lr_mul:0.0018 train_time:95685457ms step_avg:1609.24ms +step:59480/60000 train_loss:2.0308 lr_mul:0.0017 train_time:95717893ms step_avg:1609.25ms +step:59500/60000 train_loss:2.0861 lr_mul:0.0017 train_time:95750312ms step_avg:1609.25ms + >> Checkpoint saved: ./checkpoints/ckpt_step59500.pt +step:59520/60000 train_loss:2.0519 lr_mul:0.0016 train_time:95782943ms step_avg:1609.26ms +step:59540/60000 train_loss:2.1035 lr_mul:0.0015 train_time:95815396ms step_avg:1609.26ms +step:59560/60000 train_loss:2.0078 lr_mul:0.0015 train_time:95847844ms step_avg:1609.27ms +step:59580/60000 train_loss:2.0825 lr_mul:0.0014 train_time:95880271ms step_avg:1609.27ms +step:59600/60000 train_loss:2.0153 lr_mul:0.0013 train_time:95912687ms step_avg:1609.27ms +step:59620/60000 train_loss:2.0228 lr_mul:0.0013 train_time:95945103ms step_avg:1609.28ms +step:59640/60000 train_loss:2.0515 lr_mul:0.0012 train_time:95977522ms step_avg:1609.28ms +step:59660/60000 train_loss:2.0097 lr_mul:0.0011 train_time:96010210ms step_avg:1609.29ms +step:59680/60000 train_loss:2.0752 lr_mul:0.0011 train_time:96042654ms step_avg:1609.29ms +step:59700/60000 train_loss:2.0970 lr_mul:0.0010 train_time:96075096ms step_avg:1609.30ms +step:59720/60000 train_loss:2.0209 lr_mul:0.0009 train_time:96107540ms step_avg:1609.30ms +step:59740/60000 train_loss:2.0919 lr_mul:0.0009 train_time:96139999ms step_avg:1609.31ms +step:59760/60000 train_loss:1.9990 lr_mul:0.0008 train_time:96172437ms step_avg:1609.31ms +step:59780/60000 train_loss:2.0155 lr_mul:0.0007 train_time:96204895ms step_avg:1609.32ms +step:59800/60000 train_loss:2.0168 lr_mul:0.0007 train_time:96237357ms step_avg:1609.32ms +step:59820/60000 train_loss:2.0889 lr_mul:0.0006 train_time:96269800ms step_avg:1609.32ms +step:59840/60000 train_loss:2.0525 lr_mul:0.0005 train_time:96302246ms step_avg:1609.33ms +step:59860/60000 train_loss:1.9267 lr_mul:0.0005 train_time:96334702ms step_avg:1609.33ms +step:59880/60000 train_loss:2.0699 lr_mul:0.0004 train_time:96367148ms step_avg:1609.34ms +step:59900/60000 train_loss:1.9818 lr_mul:0.0003 train_time:96399589ms step_avg:1609.34ms +step:59920/60000 train_loss:1.9927 lr_mul:0.0003 train_time:96432019ms step_avg:1609.35ms +step:59940/60000 train_loss:1.8720 lr_mul:0.0002 train_time:96464464ms step_avg:1609.35ms +step:59960/60000 train_loss:2.0201 lr_mul:0.0001 train_time:96497076ms step_avg:1609.36ms +step:59980/60000 train_loss:2.0668 lr_mul:0.0001 train_time:96529542ms step_avg:1609.36ms +step:60000/60000 train_loss:2.0225 lr_mul:0.0000 train_time:96561976ms step_avg:1609.37ms + >> Checkpoint saved: ./checkpoints/ckpt_step60000.pt + eval: 500 batches (limited) + eval: batch 20/500 running_loss:2.0273 + eval: batch 40/500 running_loss:2.0229 + eval: batch 60/500 running_loss:2.0154 + eval: batch 80/500 running_loss:2.0169 + eval: batch 100/500 running_loss:2.0250 + eval: batch 120/500 running_loss:2.0248 + eval: batch 140/500 running_loss:2.0240 + eval: batch 160/500 running_loss:2.0294 + eval: batch 180/500 running_loss:2.0325 + eval: batch 200/500 running_loss:2.0350 + eval: batch 220/500 running_loss:2.0356 + eval: batch 240/500 running_loss:2.0373 + eval: batch 260/500 running_loss:2.0373 + eval: batch 280/500 running_loss:2.0397 + eval: batch 300/500 running_loss:2.0383 + eval: batch 320/500 running_loss:2.0371 + eval: batch 340/500 running_loss:2.0361 + eval: batch 360/500 running_loss:2.0356 + eval: batch 380/500 running_loss:2.0356 + eval: batch 400/500 running_loss:2.0347 + eval: batch 420/500 running_loss:2.0331 + eval: batch 440/500 running_loss:2.0323 + eval: batch 460/500 running_loss:2.0300 + eval: batch 480/500 running_loss:2.0305 + eval: batch 500/500 running_loss:2.0322 +step:60000/60000 val_loss:2.0322 val_bpb:1.2079 train_time:96562184ms +peak memory: 23749 MiB +--- Saving model --- +Raw model: 85587955 bytes | Code: 70969 bytes | Total: 85658924 +--- Quantizing to INT7+zlib --- + quantized 73 float tensors, 0 passthrough + raw quantized size: 24208849 bytes + zlib compressed: 15370972 bytes +INT7+zlib model: 15370972 bytes | Total submission: 15441941 bytes +Within 16MB budget: YES (15.44 MB) +--- Roundtrip validation (loading INT7 model) --- + decompressing and loading weights... + weights loaded, starting eval... + eval: 500 batches (limited) + eval: batch 20/500 running_loss:2.0740 + eval: batch 40/500 running_loss:2.0694 + eval: batch 60/500 running_loss:2.0617 + eval: batch 80/500 running_loss:2.0634 + eval: batch 100/500 running_loss:2.0717 + eval: batch 120/500 running_loss:2.0715 + eval: batch 140/500 running_loss:2.0706 + eval: batch 160/500 running_loss:2.0761 + eval: batch 180/500 running_loss:2.0792 + eval: batch 200/500 running_loss:2.0815 + eval: batch 220/500 running_loss:2.0821 + eval: batch 240/500 running_loss:2.0837 + eval: batch 260/500 running_loss:2.0836 + eval: batch 280/500 running_loss:2.0859 + eval: batch 300/500 running_loss:2.0844 + eval: batch 320/500 running_loss:2.0832 + eval: batch 340/500 running_loss:2.0822 + eval: batch 360/500 running_loss:2.0816 + eval: batch 380/500 running_loss:2.0816 + eval: batch 400/500 running_loss:2.0808 + eval: batch 420/500 running_loss:2.0792 + eval: batch 440/500 running_loss:2.0784 + eval: batch 460/500 running_loss:2.0762 + eval: batch 480/500 running_loss:2.0767 + eval: batch 500/500 running_loss:2.0784 +final_INT7_zlib_roundtrip val_loss:2.0784 val_bpb:1.2353 diff --git a/records/track_non_record_16mb/2026-03-31_LegendreGPT/train_gpt_legendre.py b/records/track_non_record_16mb/2026-03-31_LegendreGPT/train_gpt_legendre.py new file mode 100644 index 0000000000..0ddf9a16e8 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-31_LegendreGPT/train_gpt_legendre.py @@ -0,0 +1,1639 @@ +""" +============================================================================= +LEGENDRE-PARAMETERIZED GPT for OpenAI Parameter Golf Challenge +============================================================================= + +Core idea: Instead of storing independent weight matrices for each transformer +layer, we parameterize weights as smooth functions of depth using Legendre +polynomials. This gives us ~4x parameter compression, allowing more effective +depth (24 virtual layers) within the 16MB budget. + +Architecture: + - Sandwich structure: independent first/last layers + Legendre middle layers + - Legendre degree 5 (6 coefficients) for attention weights + - Legendre degree 2 (3 coefficients) for FFN weights (more compressible) + - Per-layer learned scalar for magnitude adjustment + - Optional wrap (modulo) operation for fast weight variation + - Factorized embeddings (ALBERT-style) for vocab compression + +Two configs: + - LOCAL: dim=256, 8 virtual layers, for RTX 4060 (8GB VRAM) + - FULL: dim=512, 24 virtual layers, for 8xH100 submission + +Usage (local): + MODE=local RUN_ID=legendre_test python3 train_gpt_legendre.py + +Usage (H100): + MODE=full RUN_ID=legendre_sub \ + torchrun --standalone --nproc_per_node=8 train_gpt_legendre.py +""" + +from __future__ import annotations + +import copy +import glob +import io +import math +import os +import random +import subprocess +import sys +import time +import uuid +import zlib +from pathlib import Path + +import numpy as np +import sentencepiece as spm +import torch +import torch.distributed as dist +import torch.nn.functional as F +from torch import Tensor, nn +from torch.nn.parallel import DistributedDataParallel as DDP + + +# ============================================================================= +# CONFIGURATION +# ============================================================================= + +MODE = os.environ.get("MODE", "local") # "local" or "full" + + +class Hyperparameters: + # --- Data paths --- + data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024") + train_files = os.path.join(data_path, "fineweb_train_*.bin") + val_files = os.path.join(data_path, "fineweb_val_*.bin") + tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model") + run_id = os.environ.get("RUN_ID", str(uuid.uuid4())) + seed = int(os.environ.get("SEED", 1337)) + + # --- Validation --- + _val_defaults = {"full": (524_288, 1000, 0, 200), "medium": (32768, 500, 500, 50), "local": (65536, 500, 1000, 20)} + _vd = _val_defaults.get(MODE, _val_defaults["local"]) + val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", _vd[0])) + val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", _vd[1])) + val_max_batches = int(os.environ.get("VAL_MAX_BATCHES", _vd[2])) # 0 = all + train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", _vd[3])) + + # --- Training length --- + _train_defaults = {"full": (20000, 1200, 20, 524_288, 1024, 600.0), + "medium": (2000, 1500, 10, 16384, 512, 0), + "local": (2000, 1500, 10, 8192, 512, 0)} + _td = _train_defaults.get(MODE, _train_defaults["local"]) + iterations = int(os.environ.get("ITERATIONS", _td[0])) + warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", _td[1])) + warmup_steps = int(os.environ.get("WARMUP_STEPS", _td[2])) + train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", _td[3])) + train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", _td[4])) + max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", _td[5])) + qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5)) + + # --- Model shape (differs by mode, overridable by env vars) --- + vocab_size = int(os.environ.get("VOCAB_SIZE", 1024)) + + if MODE == "full": + _defaults = (24, 512, 8, 4, 3, 128, 5, 2) + elif MODE == "medium": + _defaults = (16, 512, 8, 4, 3, 128, 5, 2) + else: + _defaults = (8, 256, 4, 2, 2, 64, 5, 2) + + num_virtual_layers = int(os.environ.get("NUM_VIRTUAL_LAYERS", _defaults[0])) + model_dim = int(os.environ.get("MODEL_DIM", _defaults[1])) + num_heads = int(os.environ.get("NUM_HEADS", _defaults[2])) + num_kv_heads = int(os.environ.get("NUM_KV_HEADS", _defaults[3])) + mlp_mult = int(os.environ.get("MLP_MULT", _defaults[4])) + embed_inner_dim = int(os.environ.get("EMBED_INNER_DIM", _defaults[5])) + legendre_degree_attn = int(os.environ.get("LEGENDRE_DEGREE_ATTN", _defaults[6])) + legendre_degree_ffn = int(os.environ.get("LEGENDRE_DEGREE_FFN", _defaults[7])) + + tie_embeddings = True + rope_base = float(os.environ.get("ROPE_BASE", 10000.0)) + logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0)) + + # --- Legendre-specific --- + use_wrap = bool(int(os.environ.get("USE_WRAP", "0"))) # modulo wrap experiment + wrap_init_scale = float(os.environ.get("WRAP_INIT_SCALE", 8.0)) # larger init when wrap is active + lora_rank = int(os.environ.get("LORA_RANK", 0)) # 0 = disabled + legendre_groups = int(os.environ.get("LEGENDRE_GROUPS", 1)) # split middle layers into N groups + qat_after_step = int(os.environ.get("QAT_AFTER_STEP", 0)) # 0 = disabled + qat_bits = int(os.environ.get("QAT_BITS", 7)) # quantization bits for QAT + + # --- Optimizer --- + embed_lr = float(os.environ.get("EMBED_LR", 0.6)) + tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.05)) + tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005)) + # Legendre coefficient LRs: scaled by polynomial order + coeff_lr_base = float(os.environ.get("COEFF_LR_BASE", 0.025)) + coeff_lr_scale_per_order = float(os.environ.get("COEFF_LR_SCALE", 1.1)) + independent_layer_lr = float(os.environ.get("INDEPENDENT_LR", 0.04)) + scalar_lr = float(os.environ.get("SCALAR_LR", 0.04)) + muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.95)) + muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5)) + muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.85)) + muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 500)) + muon_cooldown_start = int(os.environ.get("MUON_COOLDOWN_START", 10000)) + muon_cooldown_end = float(os.environ.get("MUON_COOLDOWN_END", 0.05)) + beta1 = float(os.environ.get("BETA1", 0.9)) + beta2 = float(os.environ.get("BETA2", 0.95)) + adam_eps = float(os.environ.get("ADAM_EPS", 1e-8)) + grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.0)) + + # --- Checkpointing --- + checkpoint_every = int(os.environ.get("CHECKPOINT_EVERY", 500)) # save every N steps (0=disabled) + checkpoint_dir = os.environ.get("CHECKPOINT_DIR", "./checkpoints") + resume_from = os.environ.get("RESUME_FROM", "") # path to checkpoint to resume from + skip_tokens = int(os.environ.get("SKIP_TOKENS", 0)) # manually skip N tokens at start + + +# ============================================================================= +# MUON OPTIMIZER (from modded-nanogpt, unchanged) +# ============================================================================= + +def zeropower_via_newtonschulz5(G: Tensor, steps: int = 10, eps: float = 1e-7) -> Tensor: + a, b, c = (3.4445, -4.7750, 2.0315) + X = G.bfloat16() + X /= X.norm() + eps + transposed = G.size(0) > G.size(1) + if transposed: + X = X.T + for _ in range(steps): + A = X @ X.T + B = b * A + c * A @ A + X = a * X + B @ X + return X.T if transposed else X + + +class Muon(torch.optim.Optimizer): + def __init__(self, params, lr: float, momentum: float, backend_steps: int, nesterov: bool = True): + super().__init__(params, dict(lr=lr, momentum=momentum, backend_steps=backend_steps, nesterov=nesterov)) + + @torch.no_grad() + def step(self, closure=None): + loss = None + if closure is not None: + with torch.enable_grad(): + loss = closure() + + distributed = dist.is_available() and dist.is_initialized() + world_size = dist.get_world_size() if distributed else 1 + rank = dist.get_rank() if distributed else 0 + + for group in self.param_groups: + params = group["params"] + if not params: + continue + lr = group["lr"] + momentum = group["momentum"] + backend_steps = group["backend_steps"] + nesterov = group["nesterov"] + + total_params = sum(int(p.numel()) for p in params) + updates_flat = torch.zeros(total_params, device=params[0].device, dtype=torch.bfloat16) + + curr = 0 + for i, p in enumerate(params): + if i % world_size == rank and p.grad is not None: + g = p.grad + state = self.state[p] + if "momentum_buffer" not in state: + state["momentum_buffer"] = torch.zeros_like(g) + buf = state["momentum_buffer"] + buf.mul_(momentum).add_(g) + if nesterov: + g = g.add(buf, alpha=momentum) + g = zeropower_via_newtonschulz5(g, steps=backend_steps) + g *= max(1, g.size(0) / g.size(1)) ** 0.5 + updates_flat[curr : curr + p.numel()] = g.reshape(-1) + curr += p.numel() + + if distributed: + dist.all_reduce(updates_flat, op=dist.ReduceOp.SUM) + + curr = 0 + for p in params: + g = updates_flat[curr : curr + p.numel()].view_as(p).to(dtype=p.dtype) + p.add_(g, alpha=-lr) + curr += p.numel() + + return loss + + +# ============================================================================= +# LEGENDRE POLYNOMIAL BASIS (pure PyTorch, no scipy) +# ============================================================================= + +def legendre_basis(num_layers: int, degree: int, device: torch.device = None) -> Tensor: + """ + Compute Legendre polynomial basis matrix B of shape (num_layers, degree+1). + + Maps layer indices [0, num_layers-1] to t in [-1, 1], then evaluates + P_0(t), P_1(t), ..., P_degree(t) using the Bonnet recurrence: + (k+1) P_{k+1}(t) = (2k+1) t P_k(t) - k P_{k-1}(t) + + This is numerically stable and runs entirely on GPU. + """ + if num_layers == 1: + t = torch.zeros(1, device=device) + else: + t = torch.linspace(-1.0, 1.0, num_layers, device=device) + + K = degree + 1 # number of basis functions + B = torch.zeros(num_layers, K, device=device) + B[:, 0] = 1.0 # P_0(t) = 1 + if K > 1: + B[:, 1] = t # P_1(t) = t + for k in range(1, K - 1): + # Bonnet's recurrence + B[:, k + 1] = ((2 * k + 1) * t * B[:, k] - k * B[:, k - 1]) / (k + 1) + + return B # (num_layers, K) + + +# ============================================================================= +# LEGENDRE WEIGHT GENERATOR +# ============================================================================= + +class LegendreWeightGen(nn.Module): + """ + Generates weight matrices for multiple virtual layers from Legendre coefficients. + + Stores K separate 2D coefficient matrices (in_features, out_features) — one per + polynomial order. This ensures Muon compatibility (needs 2D) and enables + per-order learning rates naturally. + + Generates weights W(l) = sum_k C_k * P_k(t_l) for each virtual layer l. + Per-layer scalar s_l adjusts magnitude independently. + Optional wrap mode: W_final = W - round(W) for sharp transitions. + """ + + def __init__( + self, + num_layers: int, + degree: int, + in_features: int, + out_features: int, + use_wrap: bool = False, + name: str = "", + ): + super().__init__() + self.num_layers = num_layers + self.degree = degree + self.K = degree + 1 + self.in_features = in_features + self.out_features = out_features + self.use_wrap = use_wrap + self.name = name + + # Legendre basis matrix: (num_layers, K) — NOT a parameter, precomputed + self.register_buffer("basis", legendre_basis(num_layers, degree)) + + # K separate 2D coefficient matrices — Muon-compatible, per-order LR ready + # c_0 is the "mean" weight, higher orders capture depth variation + self.coeffs = nn.ParameterList([ + nn.Parameter(torch.zeros(in_features, out_features)) + for _ in range(self.K) + ]) + + # Per-layer scalar for magnitude control: (num_layers,) + self.layer_scales = nn.Parameter(torch.ones(num_layers)) + + self._init_coeffs() + + def _init_coeffs(self): + """Initialize C_0 with standard init, higher orders progressively smaller.""" + with torch.no_grad(): + fan_in = self.in_features + std = 1.0 / math.sqrt(fan_in) + if self.use_wrap: + std *= Hyperparameters.wrap_init_scale + for k in range(self.K): + decay = 0.5 ** k # each order gets 2x smaller init + nn.init.normal_(self.coeffs[k], mean=0.0, std=std * decay) + + def _stack_coeffs(self) -> Tensor: + """Stack K separate 2D params into (K, in, out) for einsum.""" + stacked = torch.stack(list(self.coeffs), dim=0) # (K, in, out) + if getattr(self, "qat_active", False): + # Fake quantize the coefficients (what actually gets quantized at save time) + bits = getattr(self, "qat_bits", 7) + stacked = torch.stack([self.fake_quantize(stacked[k], bits=bits) for k in range(stacked.shape[0])], dim=0) + return stacked + + @staticmethod + def fake_quantize(tensor: Tensor, bits: int = 7) -> Tensor: + """Straight-through fake quantization matching real quantize_float_tensor.""" + max_val = 2 ** (bits - 1) - 1 # 7-bit: 63, 8-bit: 127 + clip_q = INT8_CLIP_Q # 99.99984 percentile + t = tensor + if t.ndim == 2: + clip_abs = torch.quantile(t.detach().abs(), clip_q, dim=1) + clip_abs = clip_abs.clamp(min=1e-8) + clipped = torch.maximum(torch.minimum(t, clip_abs[:, None]), -clip_abs[:, None]) + scale = clip_abs[:, None] / max_val + quantized = torch.round(clipped / scale) * scale + else: + clip_abs = torch.quantile(t.detach().abs().flatten(), clip_q).clamp(min=1e-8) + clipped = torch.clamp(t, -clip_abs, clip_abs) + scale = clip_abs / max_val + quantized = torch.round(clipped / scale) * scale + return t + (quantized - t).detach() + + def forward(self, layer_idx: int | None = None) -> Tensor: + """ + Generate weights for a specific layer or all layers. + + Args: + layer_idx: If None, return all layers (L, in, out). + If int, return single layer (in, out). + """ + stacked = self._stack_coeffs() # (K, in, out) + + if layer_idx is not None: + # Single layer: basis[layer_idx] is (K,) + b = self.basis[layer_idx] # (K,) + W = torch.einsum("k, k i o -> i o", b, stacked) + s = self.layer_scales[layer_idx] + W = s * W + if self.use_wrap: + W = W - torch.round(W) + return W + else: + # All layers at once: basis (L, K), stacked (K, in*out) + c_flat = stacked.reshape(self.K, -1) # (K, in*out) + W = self.basis @ c_flat # (L, in*out) + W = W.reshape(self.num_layers, self.in_features, self.out_features) + s = self.layer_scales[:, None, None] # (L, 1, 1) + W = s * W + if self.use_wrap: + W = W - torch.round(W) + return W + + +# ============================================================================= +# FACTORIZED EMBEDDING (ALBERT-style) +# ============================================================================= + +class FactorizedEmbedding(nn.Module): + """ + Embedding with bottleneck: vocab_size -> inner_dim -> model_dim + Saves bytes when vocab_size * model_dim >> vocab_size * inner_dim + inner_dim * model_dim + + For tied embeddings, the output head computes: + logits = x @ proj.weight @ embed.weight.T + This keeps both matrices in the autograd graph. + """ + + def __init__(self, vocab_size: int, inner_dim: int, model_dim: int, init_std: float = 0.005): + super().__init__() + self.embed = nn.Embedding(vocab_size, inner_dim) + self.proj = nn.Linear(inner_dim, model_dim, bias=False) + self.inner_dim = inner_dim + self.model_dim = model_dim + nn.init.normal_(self.embed.weight, mean=0.0, std=init_std) + + def forward(self, x: Tensor) -> Tensor: + return self.proj(self.embed(x)) + + def tied_logits(self, x: Tensor) -> Tensor: + """ + Compute logits using the factorized embedding in reverse. + x: (batch*seq, model_dim) + returns: (batch*seq, vocab_size) + + Does x @ proj.weight.T @ embed.weight.T without materializing the + full (vocab, model_dim) matrix. Keeps autograd happy. + """ + # x @ proj.weight.T -> (batch*seq, inner_dim) + h = F.linear(x, self.proj.weight.T.contiguous()) + # h @ embed.weight.T -> (batch*seq, vocab_size) + return F.linear(h, self.embed.weight) + + +# ============================================================================= +# STANDARD TRANSFORMER COMPONENTS +# ============================================================================= + +class RMSNorm(nn.Module): + def __init__(self, eps: float | None = None): + super().__init__() + self.eps = eps + + def forward(self, x: Tensor) -> Tensor: + return F.rms_norm(x, (x.size(-1),), eps=self.eps) + + +class Rotary(nn.Module): + def __init__(self, dim: int, base: float = 10000.0): + super().__init__() + inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim)) + self.register_buffer("inv_freq", inv_freq, persistent=False) + self._seq_len_cached = 0 + self._cos_cached: Tensor | None = None + self._sin_cached: Tensor | None = None + + def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]: + if (self._cos_cached is None or self._sin_cached is None + or self._seq_len_cached != seq_len or self._cos_cached.device != device): + t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype) + freqs = torch.outer(t, self.inv_freq.to(device)) + self._cos_cached = freqs.cos()[None, None, :, :] + self._sin_cached = freqs.sin()[None, None, :, :] + self._seq_len_cached = seq_len + return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype) + + +def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor) -> Tensor: + half = x.size(-1) // 2 + x1, x2 = x[..., :half], x[..., half:] + return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) + + +# ============================================================================= +# ATTENTION & MLP WITH EXTERNAL WEIGHTS +# ============================================================================= +# These modules take weight matrices as arguments instead of owning nn.Linear layers. +# This allows the Legendre generator to provide the weights. + +def attention_with_weights( + x: Tensor, + W_q: Tensor, # (dim, dim) + W_k: Tensor, # (dim, kv_dim) + W_v: Tensor, # (dim, kv_dim) + W_o: Tensor, # (dim, dim) + q_gain: Tensor, # (num_heads,) + rotary: Rotary, + num_heads: int, + num_kv_heads: int, +) -> Tensor: + bsz, seqlen, dim = x.shape + head_dim = dim // num_heads + + # Project with external weights (cast to x.dtype for bf16 compute) + q = F.linear(x, W_q.to(x.dtype)).reshape(bsz, seqlen, num_heads, head_dim).transpose(1, 2) + k = F.linear(x, W_k.to(x.dtype)).reshape(bsz, seqlen, num_kv_heads, head_dim).transpose(1, 2) + v = F.linear(x, W_v.to(x.dtype)).reshape(bsz, seqlen, num_kv_heads, head_dim).transpose(1, 2) + + # QK normalization + RoPE + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = rotary(seqlen, x.device, q.dtype) + q = apply_rotary_emb(q, cos, sin) + k = apply_rotary_emb(k, cos, sin) + q = q * q_gain.to(dtype=q.dtype)[None, :, None, None] + + # Attention + y = F.scaled_dot_product_attention(q, k, v, attn_mask=None, is_causal=True, + enable_gqa=(num_kv_heads != num_heads)) + y = y.transpose(1, 2).contiguous().reshape(bsz, seqlen, dim) + + # Output projection + return F.linear(y, W_o.to(x.dtype)) + + +def mlp_with_weights(x: Tensor, W_up: Tensor, W_down: Tensor) -> Tensor: + """ReluSquared MLP with external weights.""" + h = torch.relu(F.linear(x, W_up.to(x.dtype))) + return F.linear(h.square(), W_down.to(x.dtype)) + + +# ============================================================================= +# INDEPENDENT BLOCK (first/last layer with own weights) +# ============================================================================= + +class IndependentBlock(nn.Module): + """Standard transformer block with its own weight matrices.""" + + def __init__(self, dim: int, num_heads: int, num_kv_heads: int, + mlp_mult: int, rope_base: float, qk_gain_init: float): + super().__init__() + self.num_heads = num_heads + self.num_kv_heads = num_kv_heads + self.head_dim = dim // num_heads + kv_dim = num_kv_heads * self.head_dim + + # Own weight matrices + self.W_q = nn.Parameter(torch.empty(dim, dim)) + self.W_k = nn.Parameter(torch.empty(kv_dim, dim)) + self.W_v = nn.Parameter(torch.empty(kv_dim, dim)) + self.W_o = nn.Parameter(torch.zeros(dim, dim)) # zero-init like baseline + self.W_up = nn.Parameter(torch.empty(mlp_mult * dim, dim)) + self.W_down = nn.Parameter(torch.zeros(dim, mlp_mult * dim)) # zero-init + + self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32)) + self.rotary = Rotary(self.head_dim, base=rope_base) + + self.attn_norm = RMSNorm() + self.mlp_norm = RMSNorm() + self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float()) + + self._init_weights(dim) + + def _init_weights(self, dim): + std = 1.0 / math.sqrt(dim) + nn.init.normal_(self.W_q, mean=0.0, std=std) + nn.init.normal_(self.W_k, mean=0.0, std=std) + nn.init.normal_(self.W_v, mean=0.0, std=std) + nn.init.normal_(self.W_up, mean=0.0, std=std) + + def forward(self, x: Tensor, x0: Tensor) -> Tensor: + # Residual mixing with initial embedding (like baseline) + mix = self.resid_mix.to(dtype=x.dtype) + x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + # Attention + attn_out = attention_with_weights( + self.attn_norm(x), self.W_q, self.W_k, self.W_v, self.W_o, + self.q_gain, self.rotary, self.num_heads, self.num_kv_heads, + ) + x = x + self.attn_scale.to(dtype=x.dtype)[None, None, :] * attn_out + # MLP + mlp_out = mlp_with_weights(self.mlp_norm(x), self.W_up, self.W_down) + x = x + self.mlp_scale.to(dtype=x.dtype)[None, None, :] * mlp_out + return x + + +# ============================================================================= +# LEGENDRE BLOCK (middle layers with generated weights) +# ============================================================================= + +class LegendreBlockSet(nn.Module): + """ + A set of virtual transformer blocks whose weights are generated by + Legendre polynomial coefficients. This is the core innovation. + + Instead of storing N separate Block modules, we store: + - Legendre coefficients for each weight type (Q, K, V, O, up, down) + - Per-layer norms and scales (small, ~dim floats each) + - One shared Rotary embedding + """ + + def __init__( + self, + num_layers: int, + dim: int, + num_heads: int, + num_kv_heads: int, + mlp_mult: int, + rope_base: float, + qk_gain_init: float, + degree_attn: int, + degree_ffn: int, + use_wrap: bool = False, + lora_rank: int = 0, + ): + super().__init__() + self.num_layers = num_layers + self.dim = dim + self.num_heads = num_heads + self.num_kv_heads = num_kv_heads + self.lora_rank = lora_rank + head_dim = dim // num_heads + kv_dim = num_kv_heads * head_dim + mlp_hidden = mlp_mult * dim + + # --- Legendre weight generators for each weight type --- + # Attention weights: higher degree for more expressivity + self.gen_q = LegendreWeightGen(num_layers, degree_attn, dim, dim, use_wrap, "Q") + self.gen_k = LegendreWeightGen(num_layers, degree_attn, kv_dim, dim, use_wrap, "K") + self.gen_v = LegendreWeightGen(num_layers, degree_attn, kv_dim, dim, use_wrap, "V") + self.gen_o = LegendreWeightGen(num_layers, degree_attn, dim, dim, use_wrap, "O") + + # FFN weights: lower degree (more compressible per research) + self.gen_up = LegendreWeightGen(num_layers, degree_ffn, mlp_hidden, dim, use_wrap, "up") + self.gen_down = LegendreWeightGen(num_layers, degree_ffn, dim, mlp_hidden, use_wrap, "down") + + # --- Per-layer lightweight params (NOT generated, very cheap) --- + self.q_gains = nn.Parameter( + torch.full((num_layers, num_heads), qk_gain_init, dtype=torch.float32) + ) + self.attn_scales = nn.ParameterList([ + nn.Parameter(torch.ones(dim, dtype=torch.float32)) for _ in range(num_layers) + ]) + self.mlp_scales = nn.ParameterList([ + nn.Parameter(torch.ones(dim, dtype=torch.float32)) for _ in range(num_layers) + ]) + self.resid_mixes = nn.ParameterList([ + nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float()) + for _ in range(num_layers) + ]) + self.attn_norms = nn.ModuleList([RMSNorm() for _ in range(num_layers)]) + self.mlp_norms = nn.ModuleList([RMSNorm() for _ in range(num_layers)]) + + # Shared rotary (one for all layers) + self.rotary = Rotary(head_dim, base=rope_base) + + # Zero-init output projections and FFN down + with torch.no_grad(): + for gen in [self.gen_o, self.gen_down]: + for coeff in gen.coeffs: + coeff.zero_() + + # --- LoRA per-layer corrections --- + if lora_rank > 0: + r = lora_rank + self.lora_A_q = nn.ParameterList([nn.Parameter(torch.zeros(dim, r)) for _ in range(num_layers)]) + self.lora_B_q = nn.ParameterList([nn.Parameter(torch.randn(r, dim) * 0.01) for _ in range(num_layers)]) + self.lora_A_k = nn.ParameterList([nn.Parameter(torch.zeros(kv_dim, r)) for _ in range(num_layers)]) + self.lora_B_k = nn.ParameterList([nn.Parameter(torch.randn(r, dim) * 0.01) for _ in range(num_layers)]) + self.lora_A_v = nn.ParameterList([nn.Parameter(torch.zeros(kv_dim, r)) for _ in range(num_layers)]) + self.lora_B_v = nn.ParameterList([nn.Parameter(torch.randn(r, dim) * 0.01) for _ in range(num_layers)]) + self.lora_A_o = nn.ParameterList([nn.Parameter(torch.zeros(dim, r)) for _ in range(num_layers)]) + self.lora_B_o = nn.ParameterList([nn.Parameter(torch.randn(r, dim) * 0.01) for _ in range(num_layers)]) + self.lora_A_up = nn.ParameterList([nn.Parameter(torch.zeros(mlp_hidden, r)) for _ in range(num_layers)]) + self.lora_B_up = nn.ParameterList([nn.Parameter(torch.randn(r, dim) * 0.01) for _ in range(num_layers)]) + self.lora_A_down = nn.ParameterList([nn.Parameter(torch.zeros(dim, r)) for _ in range(num_layers)]) + self.lora_B_down = nn.ParameterList([nn.Parameter(torch.randn(r, mlp_hidden) * 0.01) for _ in range(num_layers)]) + + def forward(self, x: Tensor, x0: Tensor, start_idx: int = 0) -> Tensor: + """ + Process x through all virtual layers sequentially. + Generates weights on-the-fly per layer to save VRAM. + """ + for i in range(self.num_layers): + # Residual mixing with initial embedding + mix = self.resid_mixes[i].to(dtype=x.dtype) + x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + + # Generate weights for this layer + W_q = self.gen_q(i) + W_k = self.gen_k(i) + W_v = self.gen_v(i) + W_o = self.gen_o(i) + W_up = self.gen_up(i) + W_down = self.gen_down(i) + + # Add LoRA corrections + if self.lora_rank > 0: + W_q = W_q + self.lora_A_q[i] @ self.lora_B_q[i] + W_k = W_k + self.lora_A_k[i] @ self.lora_B_k[i] + W_v = W_v + self.lora_A_v[i] @ self.lora_B_v[i] + W_o = W_o + self.lora_A_o[i] @ self.lora_B_o[i] + W_up = W_up + self.lora_A_up[i] @ self.lora_B_up[i] + W_down = W_down + self.lora_A_down[i] @ self.lora_B_down[i] + + # Attention + attn_out = attention_with_weights( + self.attn_norms[i](x), W_q, W_k, W_v, W_o, + self.q_gains[i], self.rotary, self.num_heads, self.num_kv_heads, + ) + x = x + self.attn_scales[i].to(dtype=x.dtype)[None, None, :] * attn_out + + # MLP + mlp_out = mlp_with_weights(self.mlp_norms[i](x), W_up, W_down) + x = x + self.mlp_scales[i].to(dtype=x.dtype)[None, None, :] * mlp_out + + return x + + +# ============================================================================= +# FULL MODEL: LegendreGPT +# ============================================================================= + +class LegendreGPT(nn.Module): + """ + Sandwich-structured GPT with Legendre-parameterized middle layers. + + Structure: + [Factorized Embedding] + [Independent Block 0] <- first layer, own weights + [Legendre Block 1..N-2] <- middle layers, generated weights + [Independent Block N-1] <- last layer, own weights + [RMSNorm -> Tied Head] + """ + + def __init__( + self, + vocab_size: int, + num_virtual_layers: int, + model_dim: int, + num_heads: int, + num_kv_heads: int, + mlp_mult: int, + embed_inner_dim: int, + degree_attn: int, + degree_ffn: int, + tie_embeddings: bool, + tied_embed_init_std: float, + logit_softcap: float, + rope_base: float, + qk_gain_init: float, + use_wrap: bool = False, + lora_rank: int = 0, + legendre_groups: int = 1, + ): + super().__init__() + if num_virtual_layers < 3: + raise ValueError("Need at least 3 layers for sandwich structure") + self.logit_softcap = logit_softcap + self.tie_embeddings = tie_embeddings + self.model_dim = model_dim + self.num_virtual_layers = num_virtual_layers + + num_middle = num_virtual_layers - 2 # layers generated by Legendre + + # --- Factorized embedding --- + self.tok_emb = FactorizedEmbedding(vocab_size, embed_inner_dim, model_dim, tied_embed_init_std) + + # --- Sandwich: first independent block --- + self.first_block = IndependentBlock( + model_dim, num_heads, num_kv_heads, mlp_mult, rope_base, qk_gain_init, + ) + + # --- Middle: Legendre-generated blocks --- + self.middle_blocks = nn.ModuleList() + layers_per_group = num_middle // legendre_groups + remainder = num_middle % legendre_groups + for g in range(legendre_groups): + n = layers_per_group + (1 if g < remainder else 0) + self.middle_blocks.append(LegendreBlockSet( + num_layers=n, + dim=model_dim, + num_heads=num_heads, + num_kv_heads=num_kv_heads, + mlp_mult=mlp_mult, + rope_base=rope_base, + qk_gain_init=qk_gain_init, + degree_attn=degree_attn, + degree_ffn=degree_ffn, + use_wrap=use_wrap, + lora_rank=lora_rank, + )) + + # --- Sandwich: last independent block --- + self.last_block = IndependentBlock( + model_dim, num_heads, num_kv_heads, mlp_mult, rope_base, qk_gain_init, + ) + + # --- Output --- + self.final_norm = RMSNorm() + + def forward(self, input_ids: Tensor, target_ids: Tensor) -> Tensor: + x = self.tok_emb(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + x0 = x + + # Sandwich forward pass + x = self.first_block(x, x0) + for group in self.middle_blocks: + x = group(x, x0) + x = self.last_block(x, x0) + + # Project to logits + x = self.final_norm(x).reshape(-1, x.size(-1)) + targets = target_ids.reshape(-1) + + if self.tie_embeddings: + logits_proj = self.tok_emb.tied_logits(x) + else: + raise NotImplementedError("Untied embeddings not implemented for Legendre model") + + logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + return F.cross_entropy(logits.float(), targets, reduction="mean") + + def count_params(self) -> dict: + """Count parameters by component for budget analysis.""" + counts = {} + counts["embedding"] = sum(p.numel() for p in self.tok_emb.parameters()) + counts["first_block"] = sum(p.numel() for p in self.first_block.parameters()) + counts["legendre_coeffs"] = sum( + p.numel() for name, p in self.middle_blocks.named_parameters() + if "coeffs" in name or "layer_scales" in name + ) + counts["lora"] = sum( + p.numel() for name, p in self.middle_blocks.named_parameters() + if "lora_" in name + ) + counts["legendre_norms_scales"] = sum( + p.numel() for name, p in self.middle_blocks.named_parameters() + if "coeffs" not in name and "layer_scales" not in name and "lora_" not in name + ) + counts["last_block"] = sum(p.numel() for p in self.last_block.parameters()) + counts["final_norm"] = sum(p.numel() for p in self.final_norm.parameters()) + counts["total"] = sum(p.numel() for p in self.parameters()) + return counts + + +# ============================================================================= +# TOKENIZER-AGNOSTIC EVALUATION (unchanged from baseline) +# ============================================================================= + +def build_sentencepiece_luts( + sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device +) -> tuple[Tensor, Tensor, Tensor]: + sp_vocab_size = int(sp.vocab_size()) + table_size = max(sp_vocab_size, vocab_size) + base_bytes_np = np.zeros((table_size,), dtype=np.int16) + has_leading_space_np = np.zeros((table_size,), dtype=np.bool_) + is_boundary_token_np = np.ones((table_size,), dtype=np.bool_) + for token_id in range(sp_vocab_size): + if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id): + continue + is_boundary_token_np[token_id] = False + if sp.is_byte(token_id): + base_bytes_np[token_id] = 1 + continue + piece = sp.id_to_piece(token_id) + if piece.startswith("▁"): + has_leading_space_np[token_id] = True + piece = piece[1:] + base_bytes_np[token_id] = len(piece.encode("utf-8")) + return ( + torch.tensor(base_bytes_np, dtype=torch.int16, device=device), + torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), + torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), + ) + + +def load_validation_tokens(pattern: str, seq_len: int) -> Tensor: + files = [Path(p) for p in sorted(glob.glob(pattern))] + if not files: + raise FileNotFoundError(f"No files found for pattern: {pattern}") + tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() + usable = ((tokens.numel() - 1) // seq_len) * seq_len + if usable <= 0: + raise ValueError(f"Validation split is too short for seq_len={seq_len}") + return tokens[: usable + 1] + + +def eval_val(args, model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + log_fn=None, max_batches=0): + local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps) + if local_batch_tokens < args.train_seq_len: + raise ValueError("VAL_BATCH_SIZE too small") + local_batch_seqs = local_batch_tokens // args.train_seq_len + total_seqs = (val_tokens.numel() - 1) // args.train_seq_len + seq_start = (total_seqs * rank) // world_size + seq_end = (total_seqs * (rank + 1)) // world_size + total_batches = max(1, (seq_end - seq_start + local_batch_seqs - 1) // local_batch_seqs) + if max_batches > 0: + total_batches = min(total_batches, max_batches) + val_loss_sum = torch.zeros((), device=device, dtype=torch.float64) + val_token_count = torch.zeros((), device=device, dtype=torch.float64) + val_byte_count = torch.zeros((), device=device, dtype=torch.float64) + + if log_fn: + log_fn(f" eval: {total_batches} batches {'(limited)' if max_batches > 0 else ''}") + + model.eval() + batch_idx = 0 + with torch.inference_mode(): + for batch_seq_start in range(seq_start, seq_end, local_batch_seqs): + if max_batches > 0 and batch_idx >= max_batches: + break + batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end) + raw_start = batch_seq_start * args.train_seq_len + raw_end = batch_seq_end * args.train_seq_len + 1 + local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True) + x = local[:-1].reshape(-1, args.train_seq_len) + y = local[1:].reshape(-1, args.train_seq_len) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + batch_loss = model(x, y).detach() + batch_token_count = float(y.numel()) + val_loss_sum += batch_loss.to(torch.float64) * batch_token_count + val_token_count += batch_token_count + prev_ids = x.reshape(-1) + tgt_ids = y.reshape(-1) + token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16) + token_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16) + val_byte_count += token_bytes.to(torch.float64).sum() + batch_idx += 1 + if log_fn and (batch_idx % 20 == 0 or batch_idx == total_batches): + running_loss = (val_loss_sum / val_token_count).item() if val_token_count > 0 else 0 + log_fn(f" eval: batch {batch_idx}/{total_batches} running_loss:{running_loss:.4f}") + + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM) + + val_loss = val_loss_sum / val_token_count + bits_per_token = val_loss.item() / math.log(2.0) + tokens_per_byte = val_token_count.item() / val_byte_count.item() + model.train() + return float(val_loss.item()), float(bits_per_token * tokens_per_byte) + + +# ============================================================================= +# POST-TRAINING QUANTIZATION (unchanged from baseline) +# ============================================================================= + +CONTROL_TENSOR_NAME_PATTERNS = ( + "attn_scale", "attn_scales", "mlp_scale", "mlp_scales", + "q_gain", "q_gains", "layer_scales", "resid_mix", "resid_mixes", +) +INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = CONTROL_TENSOR_NAME_PATTERNS +INT8_KEEP_FLOAT_MAX_NUMEL = 65_536 +INT8_KEEP_FLOAT_STORE_DTYPE = torch.float16 +INT8_PER_ROW_SCALE_DTYPE = torch.float16 +INT8_CLIP_PERCENTILE = 99.99984 +INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0 + + +def tensor_nbytes(t: Tensor) -> int: + return int(t.numel()) * int(t.element_size()) + + +def keep_float_tensor(name, t, passthrough_orig_dtypes): + if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS): + return t.float().contiguous() + if t.dtype in {torch.float32, torch.bfloat16}: + passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.") + return t.to(dtype=INT8_KEEP_FLOAT_STORE_DTYPE).contiguous() + return t + + +def quantize_float_tensor(t, bits=8): + max_val = 2 ** (bits - 1) - 1 # 8-bit: 127, 7-bit: 63 + t32 = t.float() + if t32.ndim == 2: + clip_abs = (torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1) + if t32.numel() else torch.empty((t32.shape[0],), dtype=torch.float32)) + clipped = torch.maximum(torch.minimum(t32, clip_abs[:, None]), -clip_abs[:, None]) + scale = (clip_abs / max_val).clamp_min(1.0 / max_val) + q = torch.clamp(torch.round(clipped / scale[:, None]), -max_val, max_val).to(torch.int8).contiguous() + return q, scale.to(dtype=INT8_PER_ROW_SCALE_DTYPE).contiguous() + clip_abs = float(torch.quantile(t32.abs().flatten(), INT8_CLIP_Q).item()) if t32.numel() else 0.0 + scale = torch.tensor(clip_abs / max_val if clip_abs > 0 else 1.0, dtype=torch.float32) + q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), -max_val, max_val).to(torch.int8).contiguous() + return q, scale + + +def quantize_state_dict_int8(state_dict, bits=8): + quantized, scales, dtypes, passthrough = {}, {}, {}, {} + passthrough_orig_dtypes = {} + qmeta = {} + stats = dict.fromkeys( + ("param_count", "num_tensors", "num_float_tensors", "num_nonfloat_tensors", + "baseline_tensor_bytes", "int8_payload_bytes"), 0) + + for name, tensor in state_dict.items(): + t = tensor.detach().to("cpu").contiguous() + stats["param_count"] += int(t.numel()) + stats["num_tensors"] += 1 + stats["baseline_tensor_bytes"] += tensor_nbytes(t) + + if not t.is_floating_point(): + stats["num_nonfloat_tensors"] += 1 + passthrough[name] = t + stats["int8_payload_bytes"] += tensor_nbytes(t) + continue + + if t.numel() <= INT8_KEEP_FLOAT_MAX_NUMEL: + kept = keep_float_tensor(name, t, passthrough_orig_dtypes) + passthrough[name] = kept + stats["int8_payload_bytes"] += tensor_nbytes(kept) + continue + + stats["num_float_tensors"] += 1 + q, s = quantize_float_tensor(t, bits=bits) + if s.ndim > 0: + qmeta[name] = {"scheme": "per_row", "axis": 0} + quantized[name] = q + scales[name] = s + dtypes[name] = str(t.dtype).removeprefix("torch.") + stats["int8_payload_bytes"] += tensor_nbytes(q) + tensor_nbytes(s) + + obj = {"__quant_format__": "int8_clean_per_row_v1", + "quantized": quantized, "scales": scales, "dtypes": dtypes, "passthrough": passthrough} + if qmeta: + obj["qmeta"] = qmeta + if passthrough_orig_dtypes: + obj["passthrough_orig_dtypes"] = passthrough_orig_dtypes + return obj, stats + + +def dequantize_state_dict_int8(obj): + out = {} + qmeta = obj.get("qmeta", {}) + passthrough_orig_dtypes = obj.get("passthrough_orig_dtypes", {}) + for name, q in obj["quantized"].items(): + dtype = getattr(torch, obj["dtypes"][name]) + s = obj["scales"][name] + if qmeta.get(name, {}).get("scheme") == "per_row" or s.ndim > 0: + s = s.to(dtype=torch.float32) + out[name] = (q.float() * s.view(q.shape[0], *([1] * (q.ndim - 1)))).to(dtype=dtype).contiguous() + else: + out[name] = (q.float() * float(s.item())).to(dtype=dtype).contiguous() + for name, t in obj["passthrough"].items(): + out_t = t.detach().to("cpu").contiguous() + orig_dtype = passthrough_orig_dtypes.get(name) + if isinstance(orig_dtype, str): + out_t = out_t.to(dtype=getattr(torch, orig_dtype)).contiguous() + out[name] = out_t + return out + + +# ============================================================================= +# DATA LOADING (unchanged from baseline) +# ============================================================================= + +def load_data_shard(file: Path) -> Tensor: + header_bytes = 256 * np.dtype(" 0: + self._skip(skip_tokens) + + def _skip(self, n: int): + """Skip n tokens without loading them all into memory.""" + remaining = n + while remaining > 0: + avail = self.tokens.numel() - self.pos + if avail <= 0: + self._advance_file() + continue + k = min(remaining, avail) + self.pos += k + remaining -= k + + def _advance_file(self): + self.file_idx = (self.file_idx + 1) % len(self.files) + self.tokens = load_data_shard(self.files[self.file_idx]) + self.pos = 0 + + def take(self, n: int) -> Tensor: + chunks = [] + remaining = n + while remaining > 0: + avail = self.tokens.numel() - self.pos + if avail <= 0: + self._advance_file() + continue + k = min(remaining, avail) + chunks.append(self.tokens[self.pos : self.pos + k]) + self.pos += k + remaining -= k + return chunks[0] if len(chunks) == 1 else torch.cat(chunks) + + +class DistributedTokenLoader: + def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device, skip_tokens: int = 0): + self.rank = rank + self.world_size = world_size + self.device = device + self.stream = TokenStream(pattern, skip_tokens=skip_tokens) + + def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int): + local_tokens = global_tokens // (self.world_size * grad_accum_steps) + per_rank_span = local_tokens + 1 + chunk = self.stream.take(per_rank_span * self.world_size) + start = self.rank * per_rank_span + local = chunk[start : start + per_rank_span].to(dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True) + + +# ============================================================================= +# WRAP-AWARE GRADIENT MODIFICATION +# ============================================================================= + +def apply_wrap_aware_gradients(model, use_wrap): + """Modify gradients to be aware of circular wrap topology.""" + if not use_wrap: + return + for name, param in model.middle_blocks.named_parameters(): + if "coeffs" not in name or param.grad is None: + continue + with torch.no_grad(): + grad = param.grad + wrapped = param.data - torch.round(param.data) + dist_to_pos_border = 0.5 - wrapped + dist_to_neg_border = 0.5 + wrapped + + # Soft wall: scale down gradient near borders to avoid accidental crossing + border_proximity = torch.min(dist_to_pos_border, dist_to_neg_border) + border_scale = torch.clamp(border_proximity / 0.1, 0.0, 1.0) + + # Shortcut detection: if wrap path is shorter, flip the gradient + normal_dist = torch.where(grad > 0, dist_to_pos_border, dist_to_neg_border) + wrap_dist = torch.where(grad > 0, dist_to_neg_border, dist_to_pos_border) + should_flip = (wrap_dist < normal_dist) & (grad.abs() > 0.01) + + new_grad = grad * border_scale + new_grad[should_flip] = -new_grad[should_flip] + param.grad.copy_(new_grad) + + +# ============================================================================= +# TRAINING +# ============================================================================= + +def main() -> None: + global zeropower_via_newtonschulz5 + + code = Path(__file__).read_text(encoding="utf-8") + args = Hyperparameters() + zeropower_via_newtonschulz5 = torch.compile(zeropower_via_newtonschulz5) + + # --- Distributed + CUDA setup --- + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + if world_size <= 0: + raise ValueError(f"WORLD_SIZE must be positive, got {world_size}") + grad_accum_steps = max(1, 8 // world_size) + grad_scale = 1.0 / grad_accum_steps + if not torch.cuda.is_available(): + raise RuntimeError("CUDA is required") + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + master_process = rank == 0 + + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + + logfile = None + if master_process: + os.makedirs("logs", exist_ok=True) + logfile = f"logs/{args.run_id}.txt" + print(logfile) + + def log0(msg: str, console: bool = True): + if not master_process: + return + if console: + print(msg) + if logfile is not None: + with open(logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + + log0(f"{'='*60}") + log0(f"LEGENDRE GPT — Mode: {MODE}") + log0(f"{'='*60}") + + # --- Tokenizer + validation setup --- + random.seed(args.seed) + np.random.seed(args.seed) + torch.manual_seed(args.seed) + torch.cuda.manual_seed_all(args.seed) + + sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) + val_tokens = load_validation_tokens(args.val_files, args.train_seq_len) + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts( + sp, args.vocab_size, device + ) + log0(f"val tokens: {val_tokens.numel() - 1}") + + # --- Build model --- + base_model = LegendreGPT( + vocab_size=args.vocab_size, + num_virtual_layers=args.num_virtual_layers, + model_dim=args.model_dim, + num_heads=args.num_heads, + num_kv_heads=args.num_kv_heads, + mlp_mult=args.mlp_mult, + embed_inner_dim=args.embed_inner_dim, + degree_attn=args.legendre_degree_attn, + degree_ffn=args.legendre_degree_ffn, + tie_embeddings=args.tie_embeddings, + tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, + rope_base=args.rope_base, + qk_gain_init=args.qk_gain_init, + use_wrap=args.use_wrap, + lora_rank=args.lora_rank, + legendre_groups=args.legendre_groups, + ).to(device).bfloat16() + + # Keep weight generators and small params in fp32 + for name, param in base_model.named_parameters(): + if param.ndim < 2 or "coeffs" in name or "lora_" in name or any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS): + param.data = param.data.float() + + # --- Parameter count report --- + param_counts = base_model.count_params() + log0(f"\n--- Parameter Budget ---") + for k, v in param_counts.items(): + mb = v * 4 / 1e6 # fp32 size + mb_int8 = v / 1e6 # int8 size + log0(f" {k}: {v:,} params ({mb:.2f} MB fp32, {mb_int8:.2f} MB int8)") + log0(f" Virtual layers: {args.num_virtual_layers} (2 independent + {args.num_virtual_layers - 2} Legendre)") + log0(f" Legendre degree: attn={args.legendre_degree_attn}, ffn={args.legendre_degree_ffn}") + log0(f" Wrap mode: {args.use_wrap}") + if args.legendre_groups > 1: + layers_per = (args.num_virtual_layers - 2) // args.legendre_groups + log0(f" Legendre groups: {args.legendre_groups}, ~{layers_per} layers each") + if args.lora_rank > 0: + lora_total = sum(p.numel() for n, p in base_model.middle_blocks.named_parameters() if "lora_" in n) + log0(f" LoRA: rank={args.lora_rank}, {lora_total:,} params ({lora_total/1e6:.2f} MB int8)") + + # LR start multiplier (can override with LR_START env var) + lr_start = float(os.environ.get("LR_START", 0.2)) + + # Log the LR schedule + log0(f"\n--- LR Schedule (linear decay) ---") + log0(f" Step 0: lr_mul={lr_start}") + log0(f" Step {args.iterations}: lr_mul=0.0") + log0(f" Coeff LR base={args.coeff_lr_base}, scale={args.coeff_lr_scale_per_order}x per order") + log0(f" Effective start LR for C0: {args.coeff_lr_base * lr_start:.4f}") + + # --- Compile + DDP --- + use_compile = bool(int(os.environ.get("USE_COMPILE", "1"))) + if use_compile: + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + else: + log0("torch.compile DISABLED") + compiled_model = base_model + model = DDP(compiled_model, device_ids=[local_rank], broadcast_buffers=False) if distributed else compiled_model + + # --- Optimizer setup --- + # Group 1: Embedding params (Adam) + embed_params = list(base_model.tok_emb.parameters()) + optimizer_embed = torch.optim.Adam( + [{"params": embed_params, "lr": args.tied_embed_lr, "base_lr": args.tied_embed_lr}], + betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True, + ) + + # Group 2: Legendre coefficients (Muon) — grouped by polynomial order + # Coefficients are now separate 2D params named like gen_q.coeffs.0, gen_q.coeffs.1, etc. + # The last number is the polynomial order k. We scale LR by order. + coeff_groups = [] + max_degree = max(args.legendre_degree_attn, args.legendre_degree_ffn) + for k in range(max_degree + 1): + order_params = [] + for name, p in base_model.middle_blocks.named_parameters(): + if f"coeffs.{k}" in name and p.ndim == 2: + order_params.append(p) + if order_params: + lr = args.coeff_lr_base * (args.coeff_lr_scale_per_order ** k) + coeff_groups.append({"params": order_params, "lr": lr, "base_lr": lr}) + + coeff_params_all = [p for g in coeff_groups for p in g["params"]] + optimizer_muon = Muon( + coeff_params_all, + lr=args.coeff_lr_base, + momentum=args.muon_momentum, + backend_steps=args.muon_backend_steps, + ) + # Override with per-order LR groups + optimizer_muon.param_groups = coeff_groups + for group in optimizer_muon.param_groups: + group.setdefault("momentum", args.muon_momentum) + group.setdefault("backend_steps", args.muon_backend_steps) + group.setdefault("nesterov", True) + + # Group 3: Independent layer weights (Muon) + indep_matrix_params = [] + for block in [base_model.first_block, base_model.last_block]: + for name, p in block.named_parameters(): + if p.ndim == 2 and not any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS): + indep_matrix_params.append(p) + + optimizer_indep = Muon( + indep_matrix_params, + lr=args.independent_layer_lr, + momentum=args.muon_momentum, + backend_steps=args.muon_backend_steps, + ) + for group in optimizer_indep.param_groups: + group["base_lr"] = args.independent_layer_lr + + # Group 3b: LoRA params (Muon) — per-layer corrections + lora_params = [] + optimizer_lora = None + if args.lora_rank > 0: + for name, p in base_model.middle_blocks.named_parameters(): + if "lora_" in name and p.ndim == 2: + lora_params.append(p) + optimizer_lora = Muon( + lora_params, + lr=args.independent_layer_lr, + momentum=args.muon_momentum, + backend_steps=args.muon_backend_steps, + ) + for group in optimizer_lora.param_groups: + group["base_lr"] = args.independent_layer_lr + + # Group 4: All scalar/vector params (Adam) — norms, scales, resid_mix, q_gains, layer_scales + scalar_params = [] + params_in_other_groups = set(id(p) for p in embed_params + coeff_params_all + indep_matrix_params + lora_params) + for name, p in base_model.named_parameters(): + if id(p) not in params_in_other_groups: + scalar_params.append(p) + + optimizer_scalar = torch.optim.Adam( + [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}], + betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True, + ) + + optimizers = [optimizer_embed, optimizer_muon, optimizer_indep, optimizer_scalar] + if optimizer_lora is not None: + optimizers.append(optimizer_lora) + + log0(f"\nOptimizer groups: embed={len(embed_params)}, " + f"coeff_orders={len(coeff_groups)} (total {len(coeff_params_all)} params), " + f"indep={len(indep_matrix_params)}, scalar={len(scalar_params)}") + for i, g in enumerate(coeff_groups): + log0(f" Legendre order {i}: {len(g['params'])} params, lr={g['lr']:.4f}") + if args.lora_rank > 0: + log0(f" LoRA: {len(lora_params)} params, lr={args.independent_layer_lr}") + + # --- Checkpoint helpers --- + def save_checkpoint(step_num, training_time_ms_val): + if not master_process: + return + os.makedirs(args.checkpoint_dir, exist_ok=True) + ckpt_path = os.path.join(args.checkpoint_dir, f"ckpt_step{step_num}.pt") + tokens_consumed = step_num * args.train_batch_tokens + ckpt = { + "step": step_num, + "training_time_ms": training_time_ms_val, + "tokens_consumed": tokens_consumed, + "model_state_dict": base_model.state_dict(), + "optimizer_states": [opt.state_dict() for opt in optimizers], + "args_run_id": args.run_id, + } + torch.save(ckpt, ckpt_path) + log0(f" >> Checkpoint saved: {ckpt_path}") + # Keep symlink to latest + latest = os.path.join(args.checkpoint_dir, "latest.pt") + if os.path.exists(latest): + os.remove(latest) + try: + os.symlink(os.path.abspath(ckpt_path), latest) + except OSError: + # Windows/permission fallback: just copy + import shutil + shutil.copy2(ckpt_path, latest) + + resume_step = 0 + resume_training_time_ms = 0.0 + resume_tokens_consumed = 0 + if args.resume_from: + resume_path = args.resume_from + if resume_path == "latest": + resume_path = os.path.join(args.checkpoint_dir, "latest.pt") + if os.path.exists(resume_path): + log0(f"Resuming from checkpoint: {resume_path}") + ckpt = torch.load(resume_path, map_location=device) + # Clone tensors to avoid inference-mode issues with torch.compile + clean_state = {k: v.clone() for k, v in ckpt["model_state_dict"].items()} + base_model.load_state_dict(clean_state) + for opt, opt_state in zip(optimizers, ckpt["optimizer_states"], strict=True): + opt.load_state_dict(opt_state) + resume_step = ckpt["step"] + resume_training_time_ms = ckpt.get("training_time_ms", 0.0) + resume_tokens_consumed = args.skip_tokens or ckpt.get("tokens_consumed", resume_step * args.train_batch_tokens) + log0(f" Resumed at step {resume_step}, training_time={resume_training_time_ms:.0f}ms, " + f"skipping {resume_tokens_consumed/1e6:.0f}M tokens") + else: + log0(f"[WARN] Checkpoint not found: {resume_path}, starting from scratch") + + # --- Data loader --- + train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device, skip_tokens=resume_tokens_consumed) + + def zero_grad_all(): + for opt in optimizers: + opt.zero_grad(set_to_none=True) + + max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None + + lr_schedule = os.environ.get("LR_SCHEDULE", "piecewise") # "linear" or "piecewise" + log0(f" LR schedule: {lr_schedule}") + + def lr_mul(step, elapsed_ms): + if max_wallclock_ms is not None: + remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) + frac = remaining_ms / max_wallclock_ms + progress = 1.0 - frac + else: + progress = step / max(args.iterations, 1) + + if lr_schedule == "linear": + # Simple linear decay: lr_start → 0 + return lr_start * max(1.0 - progress, 0.0) + + # Piecewise: fast drop → plateau → long decay + p1 = 1000.0 / max(args.iterations, 1) # end of fast drop + p2 = 5000.0 / max(args.iterations, 1) # end of plateau + + if progress <= p1: + t = progress / p1 + return lr_start * (1.0 - 0.5 * t) # 0.2 → 0.1 + elif progress <= p2: + return lr_start * 0.5 # 0.1 plateau + else: + t = (progress - p2) / (1.0 - p2) + return lr_start * 0.5 * (1.0 - t) # 0.1 → 0 + + # --- Warmup (compile) --- + if args.warmup_steps > 0: + initial_model_state = {n: t.detach().cpu().clone() for n, t in base_model.state_dict().items()} + initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] + model.train() + for warmup_step in range(args.warmup_steps): + zero_grad_all() + for micro_step in range(grad_accum_steps): + if distributed: + model.require_backward_grad_sync = micro_step == grad_accum_steps - 1 + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + warmup_loss = model(x, y) + (warmup_loss * grad_scale).backward() + for opt in optimizers: + opt.step() + zero_grad_all() + if warmup_step + 1 == args.warmup_steps: + log0(f"warmup done ({args.warmup_steps} steps)") + base_model.load_state_dict(initial_model_state, strict=True) + for opt, state in zip(optimizers, initial_optimizer_states, strict=True): + opt.load_state_dict(state) + zero_grad_all() + if distributed: + model.require_backward_grad_sync = True + train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device, skip_tokens=resume_tokens_consumed) + + # --- QAT state --- + qat_activated = False + + # --- Main training loop --- + training_time_ms = resume_training_time_ms + stop_after_step = None + torch.cuda.synchronize() + t0 = time.perf_counter() + + step = resume_step + while True: + last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step) + + should_validate = last_step or (args.val_loss_every > 0 and step > 0 and step % args.val_loss_every == 0) + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1000.0 * (time.perf_counter() - t0) + # More batches for final eval, fewer for intermediate checks + eval_batches = 500 if last_step else 100 + val_loss, val_bpb = eval_val( + args, model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + log_fn=log0, max_batches=eval_batches, + ) + log0(f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} " + f"train_time:{training_time_ms:.0f}ms") + torch.cuda.synchronize() + t0 = time.perf_counter() + + if last_step: + break + + elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + scale = lr_mul(step, elapsed_ms) + zero_grad_all() + train_loss = torch.zeros((), device=device) + + for micro_step in range(grad_accum_steps): + if distributed: + model.require_backward_grad_sync = micro_step == grad_accum_steps - 1 + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + loss = model(x, y) + train_loss += loss.detach() + (loss * grad_scale).backward() + train_loss /= grad_accum_steps + + # Muon momentum warmup + cooldown + if step < args.muon_momentum_warmup_steps: + frac = step / max(args.muon_momentum_warmup_steps, 1) + muon_mom = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum + elif step >= args.muon_cooldown_start: + frac = min((step - args.muon_cooldown_start) / max(args.iterations - args.muon_cooldown_start, 1), 1.0) + muon_mom = (1 - frac) * args.muon_momentum + frac * args.muon_cooldown_end + else: + muon_mom = args.muon_momentum + for opt in [optimizer_muon, optimizer_indep]: + for group in opt.param_groups: + group["momentum"] = muon_mom + + # LR scheduling + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group["base_lr"] * scale + + if args.use_wrap: + apply_wrap_aware_gradients(base_model, args.use_wrap) + if args.grad_clip_norm > 0: + torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm) + for opt in optimizers: + opt.step() + zero_grad_all() + + step += 1 + + # --- QAT activation --- + if args.qat_after_step > 0 and step >= args.qat_after_step and not qat_activated: + log0(f" >> Activating QAT at step {step} (INT{args.qat_bits})") + for group in base_model.middle_blocks: + for gen in [group.gen_q, group.gen_k, group.gen_v, group.gen_o, group.gen_up, group.gen_down]: + gen.qat_active = True + gen.qat_bits = args.qat_bits + qat_activated = True + approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + should_log = args.train_log_every > 0 and (step <= 10 or step % args.train_log_every == 0) + if should_log: + log0(f"step:{step}/{args.iterations} train_loss:{train_loss.item():.4f} lr_mul:{scale:.4f} " + f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms/step:.2f}ms") + if args.use_wrap and step % 500 == 0: + total_near = 0 + total_params = 0 + for pname, p in base_model.middle_blocks.named_parameters(): + if "coeffs" in pname: + w = p.data - torch.round(p.data) + bp = torch.min(0.5 - w, 0.5 + w) + total_near += (bp < 0.1).sum().item() + total_params += p.numel() + log0(f" wrap stats: {total_near}/{total_params} params near border ({100*total_near/max(total_params,1):.1f}%)") + + # --- Periodic checkpoint --- + if args.checkpoint_every > 0 and step % args.checkpoint_every == 0: + save_checkpoint(step, approx_training_time_ms) + + reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms + if distributed and max_wallclock_ms is not None: + reached_cap_tensor = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX) + reached_cap = bool(reached_cap_tensor.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + + log0(f"peak memory: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB") + + # --- Serialization + quantization --- + log0("--- Saving model ---") + if master_process: + torch.save(base_model.state_dict(), "final_model.pt") + model_bytes = os.path.getsize("final_model.pt") + code_bytes = len(code.encode("utf-8")) + log0(f"Raw model: {model_bytes} bytes | Code: {code_bytes} bytes | Total: {model_bytes + code_bytes}") + + quant_bits = int(os.environ.get("QUANT_BITS", 7)) + quant_label = f"INT{quant_bits}" + quant_filename = f"final_model.int{quant_bits}.ptz" + log0(f"--- Quantizing to {quant_label}+zlib ---") + quant_obj, quant_stats = quantize_state_dict_int8(base_model.state_dict(), bits=quant_bits) + log0(f" quantized {quant_stats['num_float_tensors']} float tensors, " + f"{quant_stats['num_nonfloat_tensors']} passthrough") + quant_buf = io.BytesIO() + torch.save(quant_obj, quant_buf) + log0(f" raw quantized size: {len(quant_buf.getvalue())} bytes") + quant_blob = zlib.compress(quant_buf.getvalue(), level=9) + log0(f" zlib compressed: {len(quant_blob)} bytes") + if master_process: + with open(quant_filename, "wb") as f: + f.write(quant_blob) + quant_file_bytes = os.path.getsize(quant_filename) + code_bytes = len(code.encode("utf-8")) + log0(f"{quant_label}+zlib model: {quant_file_bytes} bytes | Total submission: {quant_file_bytes + code_bytes} bytes") + within_budget = (quant_file_bytes + code_bytes) <= 16_000_000 + log0(f"Within 16MB budget: {'YES' if within_budget else 'NO'} " + f"({'%.2f' % ((quant_file_bytes + code_bytes) / 1e6)} MB)") + + # --- Roundtrip validation --- + log0(f"--- Roundtrip validation (loading {quant_label} model) ---") + if distributed: + dist.barrier() + with open(quant_filename, "rb") as f: + quant_blob_disk = f.read() + log0(" decompressing and loading weights...") + quant_state = torch.load(io.BytesIO(zlib.decompress(quant_blob_disk)), map_location="cpu") + base_model.load_state_dict(dequantize_state_dict_int8(quant_state), strict=True) + log0(" weights loaded, starting eval...") + torch.cuda.synchronize() + q_val_loss, q_val_bpb = eval_val( + args, model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + log_fn=log0, max_batches=500, + ) + log0(f"final_{quant_label}_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f}") + + if distributed: + dist.destroy_process_group() + + +if __name__ == "__main__": + main()