-
Notifications
You must be signed in to change notification settings - Fork 3.4k
H-Net: First Learned Byte-Level Tokenization (README Wishlist) -- 1.90 BPB, 22M params #1044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
greqone
wants to merge
2
commits into
openai:main
Choose a base branch
from
greqone:hnet-byte-tokenization
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
127 changes: 127 additions & 0 deletions
127
records/track_non_record_16mb/2026-03-28_TinyHNet_Byte260_RTX4090/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,127 @@ | ||
| # Tiny H-Net: First Learned Byte-Level Tokenization for Parameter Golf | ||
|
|
||
| **Author:** GreQ (Grzegorz Nowosielski) | ||
| **Date:** 2026-03-28 | ||
| **Track:** Non-record, unlimited compute, 16MB | ||
| **val_bpb:** 1.8989 (post int6+zstd22 quantization roundtrip) | ||
| **Hardware:** 1x RTX 4090 (local), ~2.8 hours training | ||
|
|
||
| ## Summary | ||
|
|
||
| This is the first implementation of **H-Net tokenization** (arXiv:2507.07955, Hwang/Wang/Gu, Goomba Lab) at tiny scale for the Parameter Golf challenge. H-Net was specifically listed in the README's "Requests for PRs" as a creative technique the organizers wanted to see explored. | ||
|
|
||
| Instead of using a fixed BPE/SentencePiece tokenizer, this model **learns to segment raw bytes dynamically** during training via a differentiable chunking gate. The architecture eliminates the traditional tokenization pipeline entirely. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ``` | ||
| Raw bytes (vocab=260) --> Embedding(260, 512) | ||
| --> Encoder: 3x CausalDepthwiseConv1d(d=512, kernel=4) | ||
| --> ChunkingGate: cosine similarity + STE --> boundary mask | ||
| --> ChunkLayer: gather boundary tokens (~25% of sequence = ~4 bytes/chunk) | ||
| --> Main Transformer: 9 layers, d=512, 8 heads, 4 KV heads, LeakyReLU(0.5)^2 MLP 3x | ||
| --> DeChunkLayer: vectorized EMA expansion back to full byte sequence | ||
| --> Decoder: 3x CausalDepthwiseConv1d(d=512, kernel=4) | ||
| --> Tied output head --> 260-dim logits | ||
| ``` | ||
|
|
||
| **Key simplification vs. reference H-Net:** Replaced Mamba-2 SSM layers (which require custom CUDA kernels from `mamba_ssm`/`causal_conv1d`) with pure-PyTorch depthwise causal Conv1d. This eliminates all exotic dependencies while preserving the core dynamic chunking mechanism. | ||
|
|
||
| **Total parameters:** 22,178,377 | ||
| **Compressed artifact:** 15,443,775 bytes (int6 + zstd-22), well under the 16MB limit. | ||
|
|
||
| ## How Dynamic Chunking Works | ||
|
|
||
| 1. The **byte encoder** (3 causal conv layers) processes raw UTF-8 bytes into hidden representations | ||
| 2. The **ChunkingGate** computes cosine similarity between consecutive encoder outputs. High dissimilarity triggers a boundary | ||
| 3. **Straight-Through Estimation (STE)** makes the discrete boundary decision differentiable | ||
| 4. A **chunk ratio auxiliary loss** steers the gate toward a target boundary density (~25%) | ||
| 5. The **ChunkLayer** gathers hidden states at boundary positions, compressing the sequence ~4x | ||
| 6. The main **transformer** processes these compressed chunks (with causal + padding attention masks) | ||
| 7. The **DeChunkLayer** expands back to full byte length using learned exponential moving average (EMA) decay | ||
| 8. The **byte decoder** (3 causal conv layers) produces final representations for next-byte prediction | ||
|
|
||
| The gate learned to create boundaries approximately every 4 bytes on average, which is remarkably close to the average bytes-per-token ratio of BPE tokenizers -- the model independently discovered a similar compression ratio. | ||
|
|
||
| ## Results | ||
|
|
||
| | Step | val_bpb | Notes | | ||
| |------|---------|-------| | ||
| | 0 | 7.9934 | Random init | | ||
| | 5,000 | 2.2424 | Gate converged to ~25% ratio | | ||
| | 10,000 | 2.0568 | | | ||
| | 15,000 | 2.0399 | | | ||
| | 20,000 | **1.9002** | Pre-quantization | | ||
| | 20,000 (int6) | **1.8989** | Post-quantization roundtrip | | ||
|
|
||
| The 1.90 BPB is not competitive with the BPE transformer SOTA (~1.12 BPB), which is expected: a byte-level model must learn character-level patterns that BPE tokenization solves for free. The value of this submission is architectural novelty, not BPB optimization. | ||
|
|
||
| ## Key Engineering Challenges Solved | ||
|
|
||
| 1. **Gate initialization:** The cosine similarity threshold must be carefully tuned. Too high = no boundaries (ratio ~0.002), too low = everything is a boundary (ratio ~1.0). We use `sigmoid(-3.0) = 0.047` as the initial threshold with a strong ratio loss (weight=1.0) to steer convergence. | ||
|
|
||
| 2. **Vectorized ChunkLayer/DeChunkLayer:** Naive Python for-loops over the batch dimension are too slow for training. We use cumsum-based segment ID computation, scatter operations for chunking, and broadcasted exponential decay for dechunking -- all fully vectorized, no batch-dim loops. | ||
|
|
||
| 3. **Rotary cache poisoning:** PyTorch's `torch.inference_mode()` creates tensors that cannot participate in autograd. The Rotary positional embedding cache must be cleared after every eval_val call to prevent `RuntimeError: Inference tensors cannot be saved for backward`. | ||
|
|
||
| 4. **Byte-level data conversion:** The competition's HF dataset does not include byte260 shards. We wrote a converter that decodes sp1024 shards back to text via SentencePiece, then re-encodes as byte260 tokens. | ||
|
|
||
| ## Reproduction | ||
|
|
||
| ```bash | ||
| # 1. Prepare byte260 data (requires sp1024 data + tokenizer already present) | ||
| python data/convert_sp_to_byte260.py | ||
|
|
||
| # 2. Train (single GPU, ~2.8 hours on RTX 4090) | ||
| RUN_ID=hnet_v6_20k \ | ||
| ITERATIONS=20000 \ | ||
| VAL_LOSS_EVERY=5000 \ | ||
| TRAIN_LOG_EVERY=200 \ | ||
| TRAIN_BATCH_TOKENS=65536 \ | ||
| ENABLE_TORCH_COMPILE=0 \ | ||
| WARMUP_STEPS=5 \ | ||
| python train_hnet.py | ||
| ``` | ||
|
|
||
| ## What Could Improve This | ||
|
|
||
| - **More training:** Loss was still decreasing at step 20K. 100K+ steps would likely push below 1.7 BPB. | ||
| - **More data:** We only used 1 train shard (244M bytes). The full dataset has 80 shards. | ||
| - **Replace Conv1d with actual Mamba-2:** The reference H-Net uses Mamba-2 SSM for encoder/decoder, which has longer effective receptive field than our 3-layer kernel-4 conv (10 positions). | ||
| - **2-stage H-Net:** The reference architecture supports nested hierarchical chunking for additional compression. | ||
| - **Larger model:** With only 15.4MB of the 16MB budget used, there's room for more transformer layers or wider dimensions. | ||
| - **SWA/EMA:** Stochastic weight averaging was not implemented for this initial submission. | ||
| - **torch.compile:** Disabled due to dynamic shapes from chunking. Could be enabled for the fixed-shape transformer portion only. | ||
|
|
||
| ## Files | ||
|
|
||
| - `train_hnet.py` -- Complete self-contained training script | ||
| - `submission.json` -- Submission metadata | ||
| - `README.md` -- This file | ||
|
|
||
| ## Data Preparation | ||
|
|
||
| The byte260 shards are not published on HF. To generate them, decode sp1024 shards back to text via SentencePiece, then re-encode as byte260 tokens. A simple converter: | ||
|
|
||
| ```python | ||
| # Requires: sp1024 shards + fineweb_1024_bpe.model already present | ||
| import sentencepiece as spm, numpy as np, glob | ||
| from pathlib import Path | ||
|
|
||
| sp = spm.SentencePieceProcessor(model_file="data/tokenizers/fineweb_1024_bpe.model") | ||
| BYTE_OFFSET, BOS_ID, MAGIC = 4, 1, 20240520 | ||
|
|
||
| for src in sorted(glob.glob("data/datasets/fineweb10B_sp1024/fineweb_*.bin")): | ||
| header = np.fromfile(src, dtype="<i4", count=256) | ||
| tokens = np.fromfile(src, dtype="<u2", count=int(header[2]), offset=1024) | ||
| text = sp.decode([int(t) for t in tokens if t >= 4]) | ||
| byte_tokens = np.array([BOS_ID] + [b + BYTE_OFFSET for b in text.encode("utf-8")], dtype="<u2") | ||
| dst = Path(src.replace("sp1024", "byte260")) | ||
| dst.parent.mkdir(parents=True, exist_ok=True) | ||
| hdr = np.zeros(256, dtype="<i4"); hdr[0], hdr[1], hdr[2] = MAGIC, 1, len(byte_tokens) | ||
| with open(dst, "wb") as f: f.write(hdr.tobytes()); f.write(byte_tokens.tobytes()) | ||
| ``` | ||
|
|
||
| ## Note on DDP | ||
|
|
||
| The `forward_with_aux()` call bypasses DDP wrapping intentionally -- this submission targets single-GPU training only. For multi-GPU, the forward call should go through the DDP wrapper with aux loss support added to `forward()`. | ||
19 changes: 19 additions & 0 deletions
19
records/track_non_record_16mb/2026-03-28_TinyHNet_Byte260_RTX4090/submission.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| { | ||
| "author": "GreQ (Grzegorz Nowosielski)", | ||
| "github_id": "greq333", | ||
| "name": "Tiny H-Net: First Learned Byte-Level Tokenization for Parameter Golf", | ||
| "blurb": "Non-record submission implementing H-Net (arXiv:2507.07955) at tiny scale: byte-level input with dynamic chunking gate (cosine similarity + STE) that learns to segment raw bytes into variable-length chunks (~4 bytes avg), processed by a 9-layer transformer, expanded via EMA dechunking. Replaces Mamba-2 SSM with pure-PyTorch depthwise causal Conv1d. First-ever sub-100M H-Net. Post-quantization val_bpb 1.8989 under 16MB.", | ||
| "date": "2026-03-28T22:00:00Z", | ||
| "track": "non-record-unlimited-compute-16mb", | ||
| "val_loss": 1.3171, | ||
| "val_bpb": 1.9002, | ||
| "pre_quant_val_loss": 1.3171, | ||
| "pre_quant_val_bpb": 1.9002, | ||
| "post_quant_val_loss": 1.3162, | ||
| "post_quant_val_bpb": 1.8989, | ||
| "step_stop": 20000, | ||
| "wallclock_seconds": 10175, | ||
| "bytes_total": 15512976, | ||
| "bytes_model_int6_zstd": 15443775, | ||
| "bytes_code": 69201 | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reproduction instructions reference
data/convert_sp_to_byte260.py, but that script does not appear to be included anywhere in this repository (search forconvert_sp_to_byte260returns no matches). Either add the converter script to the PR or update the README to point at the correct existing path/tooling needed to generate the byte260 shards.