Skip to content
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
fused-softcap-ce @ git+https://github.com/anthony-maio/fused-softcap-ce.git
Comment thread
anthony-maio marked this conversation as resolved.
Outdated

Large diffs are not rendered by default.

470 changes: 470 additions & 0 deletions records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_gpt_sota.py

Large diffs are not rendered by default.

137 changes: 137 additions & 0 deletions records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed1337.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
W0412 14:14:57.052000 35802 torch/distributed/run.py:803]
W0412 14:14:57.052000 35802 torch/distributed/run.py:803] *****************************************
W0412 14:14:57.052000 35802 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0412 14:14:57.052000 35802 torch/distributed/run.py:803] *****************************************
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: ./data
datasets_dir: ./data/datasets/fineweb10B_sp8192
distributed: True
ema_decay: 0.997
embed_bits: 8
embed_clip_sigmas: 20.0
embed_lr: 0.6
embed_wd: 0.085
embedding_dim: 512
enable_looping_at: 0.5
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_reserve_seconds: 12.0
grad_accum_steps: 1
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/sp8192_seed1337.txt
logit_softcap: 30.0
loop_end: 5
loop_start: 4
matrix_bits: 6
matrix_clip_sigmas: 12.85
matrix_lr: 0.02
max_wallclock_seconds: 600.0
min_lr: 0.0
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.99
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_row_normalize: True
muon_wd: 0.085
num_heads: 8
num_kv_heads: 4
num_layers: 11
num_loops: 2
qk_gain_init: 4.0
quantized_model_path: final_model.int6.ptz
rank: 0
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: sp8192_seed1337
scalar_lr: 0.02
seed: 1337
skip_gates_enabled: True
sliding_window_enabled: True
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
train_batch_tokens: 786432
train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
val_batch_tokens: 524288
val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
val_loss_every: 4000
vocab_size: 8192
warmdown_frac: 0.667
warmup_steps: 20
world_size: 8
xsa_last_n: 11
train_shards: 80
val_tokens: 40548352
model_params:35943512
gptq:reserving 12s, effective=588000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10]
loop_warmup_step: 1/20
loop_warmup_step: 2/20
loop_warmup_step: 3/20
loop_warmup_step: 4/20
loop_warmup_step: 5/20
loop_warmup_step: 6/20
loop_warmup_step: 10/20
loop_warmup_step: 20/20
0/20000 val_loss: 9.0047 val_bpb: 3.4867
1/20000 train_loss: 9.0080 train_time: 0.0m tok/s: 8089272
2/20000 train_loss: 12.3015 train_time: 0.0m tok/s: 8022559
3/20000 train_loss: 11.0711 train_time: 0.0m tok/s: 7954927
4/20000 train_loss: 9.4520 train_time: 0.0m tok/s: 7918173
5/20000 train_loss: 8.3679 train_time: 0.0m tok/s: 7892396
500/20000 train_loss: 3.3349 train_time: 0.9m tok/s: 7690797
1000/20000 train_loss: 3.2063 train_time: 1.7m tok/s: 7685016
1500/20000 train_loss: 3.0906 train_time: 2.6m tok/s: 7688746
2000/20000 train_loss: 3.0213 train_time: 3.4m tok/s: 7689501
2500/20000 train_loss: 3.0327 train_time: 4.3m tok/s: 7692100
layer_loop:enabled step:2877 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10]
3000/20000 train_loss: 3.0867 train_time: 5.2m tok/s: 7563624
3500/20000 train_loss: 2.9550 train_time: 6.3m tok/s: 7265791
4000/20000 train_loss: 2.9969 train_time: 7.5m tok/s: 7031651
4000/20000 val_loss: 2.9178 val_bpb: 1.1298
4500/20000 train_loss: 2.8096 train_time: 8.6m tok/s: 6882955
5000/20000 train_loss: 2.7590 train_time: 9.7m tok/s: 6766993
5052/20000 val_loss: 2.8139 val_bpb: 1.0896
stopping_early: wallclock_cap train_time: 588041ms step: 5052/20000
peak memory allocated: 35373 MiB reserved: 35478 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:2.81131292 val_bpb:1.08857004 eval_time:6825ms
Serialized model: 135426937 bytes
Code size: 58367 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 67 Hessians in 11.3s
Quantized weights:
gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
gptq (int8): tok_emb.weight
passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights
Serialized model quantized+brotli: 15970240 bytes
Total submission size quantized+brotli: 16028607 bytes
quantized val_loss:2.84129693 val_bpb:1.10018017 eval_time:22233ms
quantized_sliding_window val_loss:2.79834517 val_bpb:1.08354879 eval_time:83683ms
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
W0412 17:41:11.842000 48239 torch/distributed/run.py:803]
W0412 17:41:11.842000 48239 torch/distributed/run.py:803] *****************************************
W0412 17:41:11.842000 48239 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0412 17:41:11.842000 48239 torch/distributed/run.py:803] *****************************************
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: /workspace/data
datasets_dir: /workspace/data/datasets/fineweb10B_sp8192
distributed: True
ema_decay: 0.9965
embed_bits: 8
embed_clip_sigmas: 20.0
embed_lr: 0.6
embed_wd: 0.085
embedding_dim: 512
enable_looping_at: 0.35
etlb_clip: 3.0
etlb_enabled: False
etlb_lr: 0.05
etlb_steps: 5
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_reserve_seconds: 12.0
grad_accum_steps: 1
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/frontier_seed1337.txt
logit_softcap: 30.0
loop_end: 5
loop_start: 3
matrix_bits: 6
matrix_clip_sigmas: 12.85
matrix_lr: 0.022
max_wallclock_seconds: 600.0
min_lr: 0.0
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.99
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_row_normalize: True
muon_wd: 0.095
num_heads: 8
num_kv_heads: 4
num_layers: 11
num_loops: 2
parallel_residual_start: 7
qk_gain_init: 5.25
quantized_model_path: final_model.int6.ptz
rank: 0
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: frontier_seed1337
scalar_lr: 0.02
seed: 1337
skip_gates_enabled: True
sliding_window_enabled: True
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: /workspace/data/tokenizers/fineweb_8192_bpe.model
train_batch_tokens: 786432
train_files: /workspace/data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
ttt_chunk_tokens: 32768
ttt_enabled: True
ttt_epochs: 3
ttt_lr: 0.005
ttt_momentum: 0.9
val_batch_tokens: 524288
val_files: /workspace/data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
val_loss_every: 4000
vocab_size: 8192
warmdown_frac: 0.72
warmup_steps: 20
world_size: 8
xsa_last_n: 11
train_shards: 80
val_tokens: 40548352
model_params:35944536
gptq:reserving 12s, effective=588000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
loop_warmup_step: 1/20
loop_warmup_step: 2/20
loop_warmup_step: 3/20
loop_warmup_step: 4/20
loop_warmup_step: 5/20
loop_warmup_step: 6/20
loop_warmup_step: 10/20
loop_warmup_step: 20/20
0/20000 val_loss: 9.0047 val_bpb: 3.4867
1/20000 train_loss: 9.0080 train_time: 0.0m tok/s: 8336072
2/20000 train_loss: 12.2992 train_time: 0.0m tok/s: 8184327
3/20000 train_loss: 11.0456 train_time: 0.0m tok/s: 8084574
4/20000 train_loss: 9.4139 train_time: 0.0m tok/s: 8030457
5/20000 train_loss: 8.3296 train_time: 0.0m tok/s: 7997738
500/20000 train_loss: 3.3332 train_time: 0.8m tok/s: 7731821
1000/20000 train_loss: 3.2115 train_time: 1.7m tok/s: 7728010
1500/20000 train_loss: 3.0985 train_time: 2.5m tok/s: 7736121
2000/20000 train_loss: 3.0193 train_time: 3.4m tok/s: 7741721
layer_loop:enabled step:2026 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
2500/20000 train_loss: 2.9987 train_time: 4.6m tok/s: 7114884
3000/20000 train_loss: 3.0367 train_time: 5.8m tok/s: 6727898
3500/20000 train_loss: 2.9188 train_time: 7.1m tok/s: 6476757
4000/20000 train_loss: 2.9547 train_time: 8.3m tok/s: 6299690
4000/20000 val_loss: 2.8728 val_bpb: 1.1124
4500/20000 train_loss: 2.7579 train_time: 9.6m tok/s: 6170374
4598/20000 val_loss: 2.8075 val_bpb: 1.0871
stopping_early: wallclock_cap train_time: 588092ms step: 4598/20000
peak memory allocated: 39046 MiB reserved: 39070 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:2.80424019 val_bpb:1.08583141 eval_time:6825ms
Serialized model: 135431033 bytes
Code size: 16791 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 67 Hessians in 12.7s
Quantized weights:
gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
gptq (int8): tok_emb.weight
passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights
Serialized model quantized+brotli: 15975659 bytes
Total submission size quantized+brotli: 15992450 bytes
quantized val_loss:2.83421669 val_bpb:1.09743862 eval_time:8477ms
quantized_sliding_window val_loss:2.79040941 val_bpb:1.08047598 eval_time:88503ms
ttt:start chunks=1238 ttt_lr=0.005 ttt_epochs=3
quantized_ttt val_loss:2.78678937 val_bpb:1.07907426 eval_time:334602ms
Loading
Loading