Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Tiny H-Net: First Learned Byte-Level Tokenization for Parameter Golf

**Author:** GreQ (Grzegorz Nowosielski)
**Date:** 2026-03-28
**Track:** Non-record, unlimited compute, 16MB
**val_bpb:** 1.8989 (post int6+zstd22 quantization roundtrip)
**Hardware:** 1x RTX 4090 (local), ~2.8 hours training

## Summary

This is the first implementation of **H-Net tokenization** (arXiv:2507.07955, Hwang/Wang/Gu, Goomba Lab) at tiny scale for the Parameter Golf challenge. H-Net was specifically listed in the README's "Requests for PRs" as a creative technique the organizers wanted to see explored.

Instead of using a fixed BPE/SentencePiece tokenizer, this model **learns to segment raw bytes dynamically** during training via a differentiable chunking gate. The architecture eliminates the traditional tokenization pipeline entirely.

## Architecture

```
Raw bytes (vocab=260) --> Embedding(260, 512)
--> Encoder: 3x CausalDepthwiseConv1d(d=512, kernel=4)
--> ChunkingGate: cosine similarity + STE --> boundary mask
--> ChunkLayer: gather boundary tokens (~25% of sequence = ~4 bytes/chunk)
--> Main Transformer: 9 layers, d=512, 8 heads, 4 KV heads, LeakyReLU(0.5)^2 MLP 3x
--> DeChunkLayer: vectorized EMA expansion back to full byte sequence
--> Decoder: 3x CausalDepthwiseConv1d(d=512, kernel=4)
--> Tied output head --> 260-dim logits
```

**Key simplification vs. reference H-Net:** Replaced Mamba-2 SSM layers (which require custom CUDA kernels from `mamba_ssm`/`causal_conv1d`) with pure-PyTorch depthwise causal Conv1d. This eliminates all exotic dependencies while preserving the core dynamic chunking mechanism.

**Total parameters:** 22,178,377
**Compressed artifact:** 15,443,775 bytes (int6 + zstd-22), well under the 16MB limit.

## How Dynamic Chunking Works

1. The **byte encoder** (3 causal conv layers) processes raw UTF-8 bytes into hidden representations
2. The **ChunkingGate** computes cosine similarity between consecutive encoder outputs. High dissimilarity triggers a boundary
3. **Straight-Through Estimation (STE)** makes the discrete boundary decision differentiable
4. A **chunk ratio auxiliary loss** steers the gate toward a target boundary density (~25%)
5. The **ChunkLayer** gathers hidden states at boundary positions, compressing the sequence ~4x
6. The main **transformer** processes these compressed chunks (with causal + padding attention masks)
7. The **DeChunkLayer** expands back to full byte length using learned exponential moving average (EMA) decay
8. The **byte decoder** (3 causal conv layers) produces final representations for next-byte prediction

The gate learned to create boundaries approximately every 4 bytes on average, which is remarkably close to the average bytes-per-token ratio of BPE tokenizers -- the model independently discovered a similar compression ratio.

## Results

| Step | val_bpb | Notes |
|------|---------|-------|
| 0 | 7.9934 | Random init |
| 5,000 | 2.2424 | Gate converged to ~25% ratio |
| 10,000 | 2.0568 | |
| 15,000 | 2.0399 | |
| 20,000 | **1.9002** | Pre-quantization |
| 20,000 (int6) | **1.8989** | Post-quantization roundtrip |

The 1.90 BPB is not competitive with the BPE transformer SOTA (~1.12 BPB), which is expected: a byte-level model must learn character-level patterns that BPE tokenization solves for free. The value of this submission is architectural novelty, not BPB optimization.

## Key Engineering Challenges Solved

1. **Gate initialization:** The cosine similarity threshold must be carefully tuned. Too high = no boundaries (ratio ~0.002), too low = everything is a boundary (ratio ~1.0). We use `sigmoid(-3.0) = 0.047` as the initial threshold with a strong ratio loss (weight=1.0) to steer convergence.

2. **Vectorized ChunkLayer/DeChunkLayer:** Naive Python for-loops over the batch dimension are too slow for training. We use cumsum-based segment ID computation, scatter operations for chunking, and broadcasted exponential decay for dechunking -- all fully vectorized, no batch-dim loops.

3. **Rotary cache poisoning:** PyTorch's `torch.inference_mode()` creates tensors that cannot participate in autograd. The Rotary positional embedding cache must be cleared after every eval_val call to prevent `RuntimeError: Inference tensors cannot be saved for backward`.

4. **Byte-level data conversion:** The competition's HF dataset does not include byte260 shards. We wrote a converter that decodes sp1024 shards back to text via SentencePiece, then re-encodes as byte260 tokens.

## Reproduction

```bash
# 1. Prepare byte260 data (requires sp1024 data + tokenizer already present)
python data/convert_sp_to_byte260.py

# 2. Train (single GPU, ~2.8 hours on RTX 4090)
RUN_ID=hnet_v6_20k \
ITERATIONS=20000 \
VAL_LOSS_EVERY=5000 \
TRAIN_LOG_EVERY=200 \
TRAIN_BATCH_TOKENS=65536 \
ENABLE_TORCH_COMPILE=0 \
WARMUP_STEPS=5 \
python train_hnet.py
Comment on lines +71 to +83
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reproduction instructions reference data/convert_sp_to_byte260.py, but that script does not appear to be included anywhere in this repository (search for convert_sp_to_byte260 returns no matches). Either add the converter script to the PR or update the README to point at the correct existing path/tooling needed to generate the byte260 shards.

Copilot uses AI. Check for mistakes.
```

## What Could Improve This

- **More training:** Loss was still decreasing at step 20K. 100K+ steps would likely push below 1.7 BPB.
- **More data:** We only used 1 train shard (244M bytes). The full dataset has 80 shards.
- **Replace Conv1d with actual Mamba-2:** The reference H-Net uses Mamba-2 SSM for encoder/decoder, which has longer effective receptive field than our 3-layer kernel-4 conv (10 positions).
- **2-stage H-Net:** The reference architecture supports nested hierarchical chunking for additional compression.
- **Larger model:** With only 15.4MB of the 16MB budget used, there's room for more transformer layers or wider dimensions.
- **SWA/EMA:** Stochastic weight averaging was not implemented for this initial submission.
- **torch.compile:** Disabled due to dynamic shapes from chunking. Could be enabled for the fixed-shape transformer portion only.

## Files

- `train_hnet.py` -- Complete self-contained training script
- `submission.json` -- Submission metadata
- `README.md` -- This file

## Data Preparation

The byte260 shards are not published on HF. To generate them, decode sp1024 shards back to text via SentencePiece, then re-encode as byte260 tokens. A simple converter:

```python
# Requires: sp1024 shards + fineweb_1024_bpe.model already present
import sentencepiece as spm, numpy as np, glob
from pathlib import Path

sp = spm.SentencePieceProcessor(model_file="data/tokenizers/fineweb_1024_bpe.model")
BYTE_OFFSET, BOS_ID, MAGIC = 4, 1, 20240520

for src in sorted(glob.glob("data/datasets/fineweb10B_sp1024/fineweb_*.bin")):
header = np.fromfile(src, dtype="<i4", count=256)
tokens = np.fromfile(src, dtype="<u2", count=int(header[2]), offset=1024)
text = sp.decode([int(t) for t in tokens if t >= 4])
byte_tokens = np.array([BOS_ID] + [b + BYTE_OFFSET for b in text.encode("utf-8")], dtype="<u2")
dst = Path(src.replace("sp1024", "byte260"))
dst.parent.mkdir(parents=True, exist_ok=True)
hdr = np.zeros(256, dtype="<i4"); hdr[0], hdr[1], hdr[2] = MAGIC, 1, len(byte_tokens)
with open(dst, "wb") as f: f.write(hdr.tobytes()); f.write(byte_tokens.tobytes())
```

## Note on DDP

The `forward_with_aux()` call bypasses DDP wrapping intentionally -- this submission targets single-GPU training only. For multi-GPU, the forward call should go through the DDP wrapper with aux loss support added to `forward()`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"author": "GreQ (Grzegorz Nowosielski)",
"github_id": "greq333",
"name": "Tiny H-Net: First Learned Byte-Level Tokenization for Parameter Golf",
"blurb": "Non-record submission implementing H-Net (arXiv:2507.07955) at tiny scale: byte-level input with dynamic chunking gate (cosine similarity + STE) that learns to segment raw bytes into variable-length chunks (~4 bytes avg), processed by a 9-layer transformer, expanded via EMA dechunking. Replaces Mamba-2 SSM with pure-PyTorch depthwise causal Conv1d. First-ever sub-100M H-Net. Post-quantization val_bpb 1.8989 under 16MB.",
"date": "2026-03-28T22:00:00Z",
"track": "non-record-unlimited-compute-16mb",
"val_loss": 1.3171,
"val_bpb": 1.9002,
"pre_quant_val_loss": 1.3171,
"pre_quant_val_bpb": 1.9002,
"post_quant_val_loss": 1.3162,
"post_quant_val_bpb": 1.8989,
"step_stop": 20000,
"wallclock_seconds": 10175,
"bytes_total": 15512976,
"bytes_model_int6_zstd": 15443775,
"bytes_code": 69201
}
Loading