Skip to content

Bug: Phi-3 Tokenizer Adds Whitespaces on re-tokenization (which invalidates KV-cache) #7938

Closed
@Harsha-Nori

Description

@Harsha-Nori

What happened?

The llama.cpp tokenizer for Phi-3 has odd behavior, where re-tokenizing the same text over and over keeps adding whitespaces to the first non-BOS token. This has several issues:

  1. It doesn't match the original tokenizer behavior from Huggingface Transformers
  2. Re-processing the same text causes the kv-cache to be invalidated, forcing another prompt fill of all the input tokens.

I maintain the Guidance library (https://github.com/guidance-ai/guidance), where we often need to re-tokenize inputs after adding any templated/deterministic text from the user. This is causing a significant performance regression in Phi-3 usage via llama.cpp on guidance whenever we go through this cycle :(. I believe pretty much all constrained generation libraries would likely affected by this too.

Here's an example of the bug in action (using the llama-cpp-python bindings, which are very thin wrappers on the tokenizer)

The model I'm using: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/blob/main/Phi-3-mini-4k-instruct-q4.gguf

import llama_cpp
print(llama_cpp.__version__) # '0.2.78' -- filing here because it seems like it's a lower level bug in llama.cpp

model = llama_cpp.Llama(model_path="Phi-3-mini-4k-instruct-q4.gguf", logits_all=True)
tokenizer = llama_cpp.LlamaTokenizer(model)

test_str = "Hi I am a hippo"
test_tokens = tokenizer.tokenize(test_str.encode("utf-8")) # [1, 6324, 306, 626, 263, 7251, 9759]

retokenized = b''.join([tokenizer.detokenize([i]) for i in test_tokens]) # b' Hi I am a hippo'
retokenized_tokens = tokenizer.tokenize(retokenized) # [1, 29871, 6324, 306, 626, 263, 7251, 9759]

retokenized2 = b''.join([tokenizer.detokenize([i]) for i in retokenized_tokens]) # b'  Hi I am a hippo'

Note how the token at index 1 has a continually growing whitespace when going through the tokenize/detokenize cycle. Repeating this process continuously increases the whitespace (

" Hi I am a hippo" ->
" Hi I am a hippo" ->
" Hi I am a hippo" ->
" Hi I am a hippo" ...

This is the heart of the issue, and doesn't happen with the original tokenizer implementation in Transformers.

Name and Version

llama-cpp-python is using this commit for their latest release: fd5ea0f

What operating system are you seeing the problem on?

Linux, Mac, Windows

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions