Faster Regex pattern parsing in C #161

MadMax129 · 2025-10-23T21:58:47Z

I saw this post on X: https://x.com/karpathy/status/1981012090220581340 Which was describing the slow fancy-regex used during Rust tokenizer training.

I went ahead and implemented a first version of a fairly simple C parser that parsers specifically the custom Regex Pattern described in nanochat/tokenizer.py.

Along side that I provided a suit of fuzz/compare tests, to verify the C regex parser splits verious texts against the Rust fancy-regex and Python Huggingface tokenizer.

To get the regex splits from rust I added a simple function that is callable from python:

#[pyfunction]
pub fn split_text(pattern: String, text: String) -> PyResult<Vec<String>> {
    let re = Regex::new(&pattern)
        .map_err(|e| pyo3::exceptions::PyValueError::new_err(format!("Invalid regex pattern: {}", e)))?;
    let mut out: Vec<String> = Vec::new();
    for m in re.find_iter(&text) {
        let m = m
            .map_err(|e| pyo3::exceptions::PyRuntimeError::new_err(format!("Regex match failed: {}", e)))?;
        out.push(m.as_str().to_string());
    }
    Ok(out)
}

How to run

Alongside the fregex.c C implementation, there are the following files:

fuzz.py
compare.py
bench.py

For fuzzing I played around with generating random utf8 sequences or adjustable length, which also sometimes output text that should test certain rules of the regex pattern.

More interestingly bench.py compares the Rust split_text function against the custom C regex parser. When running on a few random utf8 text dataset from the internet I found the C parser to be ~2X faster, especially visible on the cases where the text is short in length. This speedup is likely due to the little amount of dependecies, but is by no means optimal. I didn't even begin optimizing the mem allocations, copy operations for faster regex splitting, threading is also a possible direction.

The main idea is to keep the fregex.c simple, so users who are interested in what really is going on in the regex pattern could see how it can be implemented quite simply without using external regex engines.

Quick overview of commands related to reproducing results

# When inside the repo directory

# Clone simple 3 file Unicode library into fregex/utf8proc
# Compiled later together with fregex.c
https://github.com/JuliaStrings/utf8proc/releases/tag/v2.10.0

# Had some issues with this, but need to rebuilt the lib.rs to expose the split_text function
uv run maturin develop --release --manifest-path rustbpe/Cargo.toml
uv sync

# Build the shared libfregex library (mac setup) so that it can be called from Python
gcc -shared -fPIC -o ./fregex/libfregex.dylib ./fregex/fregex.c ./fregex/utf8proc/utf8proc.c -I. -Iutf8proc -lm -std=c99

# Python fuzz/compare/bench

# Inside a test folder we can put in as many .txt files to run a comparision between:
# Huggingface tokenizer
# Rust exposed split_text function
# fregex.c 
python -m fregex.compare ./fregex/tests

# If the string splits between the implementions differ this tool will highlight what sequences of unicode or ascii characters differ

python -m fregex.fuzz --iters 10000 --max-len 1200 --seed 543543 --stop-on-first

# Generates random fuzz input to test the Huggingface regex engine against the fregex.c 

python -m fregex.bench .../taylorswift.txt

# Running a bench mark between the fregex.c splitter and the Rust implementation
# Example on the taylorswift.txt wikipedia Andrej gave in the minibpe repo 

Mac M2 Pro (16 GB)
--- Dataset: taylorswift.txt (185768 bytes, 10000 iterations) ---

C tokenizer              185768 B  min=5.605ms  max=196.493ms  mean=6.123ms  median=5.890ms  stdev=2.812ms
Rust split               185768 B  min=12.680ms  max=190.186ms  mean=13.456ms  median=13.027ms  stdev=3.840ms
Speedup: 2.20x (C is faster)

Compare: OK (C vs Py splits match)

Remarks/Future

As mentioned the fregex.c is not optimized nearly as much as it could be
Some of the fuzz.py and other testing files are messy, could be simplified down to be cleaner.
Contains the small dependency on utf8proc but I would argue this cleanly allows for parsing unicode, and is readable
If interest in adding something simple and clean like this exists, I am happy to even furthur document and expand the code. The next steps would likely be either adding the training loop into fregex.c or given the simple code, rewriting in rust, and replacing the use of fancy-regex in the rustbpe module
Happy to hear feedback and especially some deeper analysis from those who know more about regex parsing/optimizations

MadMax129 · 2025-10-24T22:53:09Z

Update

Added optimizations: string allocations are replaced with start/end indexes in the original input buffer. Now after we split the whole input, we are given back positions to where splits occured, and can cut out the strings without extra allocations.
Updated lib.rs to load the fregex.c splitter and use it during training

Run 'test_rustbpe.py' with vocab size 65536

With fregex.c:
📊 Performance comparison:
RustBPE: 1.1606s
HuggingFace: 3.6580s
Speedup: 3.15x

Without:
📊 Performance comparison:
RustBPE: 1.3044s
HuggingFace: 3.7344s
Speedup: 2.86x

python -m scripts.tok_train --max_chars=2000000000

With fregex.c:
Training time: 28.22s

Without:
Training time: 63.26s

Majdoddin · 2025-11-07T19:51:03Z

My PR to tiktoken has mathematically provable exactly same output,
uses regex instead of fancy-regex,
and is 6x faster.

MadMax129 added 5 commits October 23, 2025 16:59

faster regex in C

12f418f

cleanup

e02938c

removed buffer approuch

41c8b8d

remove string allocations

851810c

add into rustbpe

0a1059d

svlandeg added feature New feature or request suggest/feedback labels Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster Regex pattern parsing in C #161

Faster Regex pattern parsing in C #161

MadMax129 commented Oct 23, 2025 •

edited

Loading

Uh oh!

MadMax129 commented Oct 24, 2025

Uh oh!

Majdoddin commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Faster Regex pattern parsing in C #161

Are you sure you want to change the base?

Faster Regex pattern parsing in C #161

Conversation

MadMax129 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to run

Remarks/Future

Uh oh!

MadMax129 commented Oct 24, 2025

Update

Run 'test_rustbpe.py' with vocab size 65536

python -m scripts.tok_train --max_chars=2000000000

Uh oh!

Majdoddin commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MadMax129 commented Oct 23, 2025 •

edited

Loading