Faster Regex pattern parsing in C #161
Open
+1,443
−25
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I saw this post on X: https://x.com/karpathy/status/1981012090220581340 Which was describing the slow fancy-regex used during Rust tokenizer training.
I went ahead and implemented a first version of a fairly simple C parser that parsers specifically the custom Regex Pattern described in nanochat/tokenizer.py.
Along side that I provided a suit of fuzz/compare tests, to verify the C regex parser splits verious texts against the Rust fancy-regex and Python Huggingface tokenizer.
To get the regex splits from rust I added a simple function that is callable from python:
How to run
Alongside the fregex.c C implementation, there are the following files:
For fuzzing I played around with generating random utf8 sequences or adjustable length, which also sometimes output text that should test certain rules of the regex pattern.
More interestingly bench.py compares the Rust split_text function against the custom C regex parser. When running on a few random utf8 text dataset from the internet I found the C parser to be ~2X faster, especially visible on the cases where the text is short in length. This speedup is likely due to the little amount of dependecies, but is by no means optimal. I didn't even begin optimizing the mem allocations, copy operations for faster regex splitting, threading is also a possible direction.
The main idea is to keep the fregex.c simple, so users who are interested in what really is going on in the regex pattern could see how it can be implemented quite simply without using external regex engines.
Quick overview of commands related to reproducing results
Remarks/Future