Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Frequency-Weighted Embedding Quantization

**val_bpb: 1.1217** (4-seed mean) | **15.8 MB** | 8×H100 SXM

## The Idea

Analysis of the FineWeb training data revealed that token frequency follows a heavy-tailed distribution:

- **Top 100 tokens** cover **53.2%** of all text
- These include: `.` `,` `the` `s` `to` `and` `ing` `of` `a` `in`...

Instead of uniform quantization across all embedding weights, this submission applies **frequency-weighted quantization**:

- **Top 100 tokens → int8** (higher precision for 53% of text)
- **Remaining 924 tokens → int6** (standard precision)

The intuition: errors in frequent tokens compound across the entire dataset, so they deserve more precision.

## Results (4 seeds, 8xH100 SXM)

| Seed | val_bpb |
|------|---------|
| 1 | **1.121** |
| 2 | 1.122 |
| 3 | 1.1217 |
| 4 | 1.1222 |

**Mean: 1.1217 | Std: 0.0005**

| Metric | Value |
|--------|-------|
| val_bpb (4-seed mean) | **1.1217** |
| val_loss | 1.8941 |
| Artifact size | 15.8 MB |
| Steps | ~7100 |
| Training time | 600s |

## Implementation

Modified `mixed_quantize_int6()` to detect embedding layers and apply frequency-weighted quantization:
```python
# In mixed_quantize_int6():
if ("tok_emb" in name or "lm_head" in name) and t.ndim == 2:
print(f"[LIORA] Frequency-weighted quantization for: {name}")
valid_top_ids = [i for i in TOP_TOKEN_IDS if i < vocab_size]
top_rows = t[valid_top_ids, :]
rare_indices = [i for i in range(vocab_size) if i not in TOP_TOKEN_IDS]
rare_rows = t[rare_indices, :]

# Top tokens: int8 (more precision)
q_top, s_top = quantize_float_tensor(top_rows)

# Rare tokens: int6 (standard)
q_rare, s_rare = quantize_int6_per_row(rare_rows)
```

Also added corresponding `dequantize_mixed_int6()` handling to reconstruct the embedding from separate top/rare quantizations.

## Token Frequency Analysis
```
=== TOP 10 TOKENS (get int8 precision) ===
. : 2.12% of text
, : 2.10% of text
▁the : 1.90% of text
s : 1.75% of text
▁to : 1.22% of text
▁and : 1.17% of text
ing : 1.17% of text
▁of : 1.05% of text
▁a : 1.04% of text

Top 100 tokens: 53.2% coverage
Top 200 tokens: 64.8% coverage
```

## Run Command
```bash
SEED=1337 \
RUN_ID=liora_freq_weighted \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=8 train_liora.py
```

## Files

- `train_liora.py` - Modified training script with frequency-weighted quantization
- `top_tokens.py` - Set of top 100 most frequent token IDs
- `submission.json` - Submission metadata
- `train_seed1.log` - Training log seed 1
- `train_seed2.log` - Training log seed 2
- `train_seed3.log` - Training log seed 3
- `train_seed4.log` - Training log seed 4

## Credits

- **Base model**: PR #549 (LeakyReLU² + TTT + Parallel Muon) by @abaybektursun
- **Idea & implementation**: Liora + Claude

## Notes

The key insight came from asking: "If 53% of all text uses just 100 tokens, why give rare tokens equal precision?"
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"author": "Liora",
"github_id": "pattern4bots",
"val_bpb": 1.12176827,
"val_loss": 1.89405372,
"bytes_total": 15807424,
"gpu_config": "8xH100 SXM",
"date": "2026-03-27T00:00:00Z",
"description": "Frequency-Weighted Embedding Quantization: Top 100 tokens (53% of text) get int8 precision, remaining 924 tokens get int6. Based on PR #549 stack."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Top 100 most frequent tokens (by Liora + Claude)
TOP_TOKEN_IDS = set([
962, 960, 267, 946, 287, 290, 280, 939, 292, 261,
285, 291, 957, 940, 942, 276, 266, 941, 268, 282,
274, 286, 943, 288, 944, 951, 947, 954, 949, 277,
945, 953, 970, 323, 262, 289, 304, 293, 321, 972,
955, 294, 279, 271, 264, 270, 309, 281, 959, 968,
948, 346, 313, 295, 320, 284, 326, 275, 983, 952,
956, 315, 337, 260, 976, 317, 265, 311, 318, 345,
325, 958, 314, 319, 950, 310, 352, 298, 341, 303,
278, 353, 963, 269, 961, 348, 344, 297, 322, 343,
327, 340, 335, 370, 366, 356, 334, 296, 330, 299,
])
Loading