Description
What happened?
The llama.cpp tokenizer for Phi-3 has odd behavior, where re-tokenizing the same text over and over keeps adding whitespaces to the first non-BOS token. This has several issues:
- It doesn't match the original tokenizer behavior from Huggingface
Transformers
- Re-processing the same text causes the kv-cache to be invalidated, forcing another prompt fill of all the input tokens.
I maintain the Guidance library (https://github.com/guidance-ai/guidance), where we often need to re-tokenize inputs after adding any templated/deterministic text from the user. This is causing a significant performance regression in Phi-3 usage via llama.cpp on guidance whenever we go through this cycle :(. I believe pretty much all constrained generation libraries would likely affected by this too.
Here's an example of the bug in action (using the llama-cpp-python bindings, which are very thin wrappers on the tokenizer)
The model I'm using: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/blob/main/Phi-3-mini-4k-instruct-q4.gguf
import llama_cpp
print(llama_cpp.__version__) # '0.2.78' -- filing here because it seems like it's a lower level bug in llama.cpp
model = llama_cpp.Llama(model_path="Phi-3-mini-4k-instruct-q4.gguf", logits_all=True)
tokenizer = llama_cpp.LlamaTokenizer(model)
test_str = "Hi I am a hippo"
test_tokens = tokenizer.tokenize(test_str.encode("utf-8")) # [1, 6324, 306, 626, 263, 7251, 9759]
retokenized = b''.join([tokenizer.detokenize([i]) for i in test_tokens]) # b' Hi I am a hippo'
retokenized_tokens = tokenizer.tokenize(retokenized) # [1, 29871, 6324, 306, 626, 263, 7251, 9759]
retokenized2 = b''.join([tokenizer.detokenize([i]) for i in retokenized_tokens]) # b' Hi I am a hippo'
Note how the token at index 1
has a continually growing whitespace when going through the tokenize/detokenize cycle. Repeating this process continuously increases the whitespace (
" Hi I am a hippo"
->
" Hi I am a hippo"
->
" Hi I am a hippo"
->
" Hi I am a hippo"
...
This is the heart of the issue, and doesn't happen with the original tokenizer implementation in Transformers.
Name and Version
llama-cpp-python is using this commit for their latest release: fd5ea0f
What operating system are you seeing the problem on?
Linux, Mac, Windows
Relevant log output
No response