Skip to content

Conversation

@MadMax129
Copy link

@MadMax129 MadMax129 commented Oct 23, 2025

I saw this post on X: https://x.com/karpathy/status/1981012090220581340 Which was describing the slow fancy-regex used during Rust tokenizer training.

I went ahead and implemented a first version of a fairly simple C parser that parsers specifically the custom Regex Pattern described in nanochat/tokenizer.py.

Along side that I provided a suit of fuzz/compare tests, to verify the C regex parser splits verious texts against the Rust fancy-regex and Python Huggingface tokenizer.

To get the regex splits from rust I added a simple function that is callable from python:

#[pyfunction]
pub fn split_text(pattern: String, text: String) -> PyResult<Vec<String>> {
    let re = Regex::new(&pattern)
        .map_err(|e| pyo3::exceptions::PyValueError::new_err(format!("Invalid regex pattern: {}", e)))?;
    let mut out: Vec<String> = Vec::new();
    for m in re.find_iter(&text) {
        let m = m
            .map_err(|e| pyo3::exceptions::PyRuntimeError::new_err(format!("Regex match failed: {}", e)))?;
        out.push(m.as_str().to_string());
    }
    Ok(out)
}

How to run

Alongside the fregex.c C implementation, there are the following files:

  • fuzz.py
  • compare.py
  • bench.py

For fuzzing I played around with generating random utf8 sequences or adjustable length, which also sometimes output text that should test certain rules of the regex pattern.

More interestingly bench.py compares the Rust split_text function against the custom C regex parser. When running on a few random utf8 text dataset from the internet I found the C parser to be ~2X faster, especially visible on the cases where the text is short in length. This speedup is likely due to the little amount of dependecies, but is by no means optimal. I didn't even begin optimizing the mem allocations, copy operations for faster regex splitting, threading is also a possible direction.

The main idea is to keep the fregex.c simple, so users who are interested in what really is going on in the regex pattern could see how it can be implemented quite simply without using external regex engines.

Quick overview of commands related to reproducing results

# When inside the repo directory

# Clone simple 3 file Unicode library into fregex/utf8proc
# Compiled later together with fregex.c
https://github.com/JuliaStrings/utf8proc/releases/tag/v2.10.0

# Had some issues with this, but need to rebuilt the lib.rs to expose the split_text function
uv run maturin develop --release --manifest-path rustbpe/Cargo.toml
uv sync

# Build the shared libfregex library (mac setup) so that it can be called from Python
gcc -shared -fPIC -o ./fregex/libfregex.dylib ./fregex/fregex.c ./fregex/utf8proc/utf8proc.c -I. -Iutf8proc -lm -std=c99

# Python fuzz/compare/bench

# Inside a test folder we can put in as many .txt files to run a comparision between:
# Huggingface tokenizer
# Rust exposed split_text function
# fregex.c 
python -m fregex.compare ./fregex/tests

# If the string splits between the implementions differ this tool will highlight what sequences of unicode or ascii characters differ

python -m fregex.fuzz --iters 10000 --max-len 1200 --seed 543543 --stop-on-first

# Generates random fuzz input to test the Huggingface regex engine against the fregex.c 

python -m fregex.bench .../taylorswift.txt

# Running a bench mark between the fregex.c splitter and the Rust implementation
# Example on the taylorswift.txt wikipedia Andrej gave in the minibpe repo 

Mac M2 Pro (16 GB)
--- Dataset: taylorswift.txt (185768 bytes, 10000 iterations) ---

C tokenizer              185768 B  min=5.605ms  max=196.493ms  mean=6.123ms  median=5.890ms  stdev=2.812ms
Rust split               185768 B  min=12.680ms  max=190.186ms  mean=13.456ms  median=13.027ms  stdev=3.840ms
Speedup: 2.20x (C is faster)

Compare: OK (C vs Py splits match)

Remarks/Future

  • As mentioned the fregex.c is not optimized nearly as much as it could be
  • Some of the fuzz.py and other testing files are messy, could be simplified down to be cleaner.
  • Contains the small dependency on utf8proc but I would argue this cleanly allows for parsing unicode, and is readable
  • If interest in adding something simple and clean like this exists, I am happy to even furthur document and expand the code. The next steps would likely be either adding the training loop into fregex.c or given the simple code, rewriting in rust, and replacing the use of fancy-regex in the rustbpe module
  • Happy to hear feedback and especially some deeper analysis from those who know more about regex parsing/optimizations

@MadMax129
Copy link
Author

Update

  • Added optimizations: string allocations are replaced with start/end indexes in the original input buffer. Now after we split the whole input, we are given back positions to where splits occured, and can cut out the strings without extra allocations.
  • Updated lib.rs to load the fregex.c splitter and use it during training

Run 'test_rustbpe.py' with vocab size 65536

With fregex.c:
📊 Performance comparison:
RustBPE: 1.1606s
HuggingFace: 3.6580s
Speedup: 3.15x

Without:
📊 Performance comparison:
RustBPE: 1.3044s
HuggingFace: 3.7344s
Speedup: 2.86x

python -m scripts.tok_train --max_chars=2000000000

With fregex.c:
Training time: 28.22s

Without:
Training time: 63.26s

@svlandeg svlandeg added feature New feature or request suggest/feedback labels Oct 29, 2025
@Majdoddin
Copy link

My PR to tiktoken has mathematically provable exactly same output,
uses regex instead of fancy-regex,
and is 6x faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request suggest/feedback

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants