Enable support more diverse tokenizers #2418

eric8607242 · 2023-07-27T08:22:04Z

Hi, thanks for this awesome work.

I am currently trying to perform inference on different LLM (e.g., xgen and Aquila) using this project.

I always encounter issues with generating Chinese text smoothly.
By adopting the flag --verbose-prompt, I found that the Chinese words are always being tokenized into wrong token IDs.
After digging into the root cause, I found the reason is that the Chinese characters, which are composed of multiple bytes, are always tokenized incorrectly by this part.

llama_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;

This code can work for the llama series of models primarily because the llama's tokenizer follows the char
coding order and three special tokens are placed at the beginning:

'<unk>': 0,
'<s>': 1,
'</s>': 2,
'<0x00>': 3,
'<0x01>': 4,
'<0x02>': 5,
'<0x03>': 6,
...

Unfortunately, not all open-source pre-trained models adopt llama's tokenizer such as xgen and Aquila mentioned above.
Therefore, for more flexible support for more diverse pre-trained model tokenizers. I believe we should use the vocabulary generated by convert.py appropriately in this case.

For example, the xgen's tokenizer map looks like:

b'!': 0,
b'"': 1,
b'#': 2,
b'$': 3,
b'%': 4,
b'&': 5,
b"'": 6,
b'(': 7,
...

Although this PR only modifies one line of code, it brings significant benefits for supporting more models with UTF-8 characters. Just like #2228, enabling only BPE in convert.py is not sufficient to successfully infer Chinese words without this modification.

Big thanks for this amazing work again!

klosax · 2023-07-27T08:59:55Z

Note: The upcoming gguf file format ggml-org/ggml#302 will bring support for llamas using gpt2 bpe tokenizer (Aquila).

eric8607242 · 2023-07-27T09:03:12Z

Sorry for the wrong pull request. I resend a pull request #2420 for the correct pr.

supporting more diverse tokenizers

a6c25eb

eric8607242 closed this Jul 27, 2023

eric8607242 deleted the flexible-vocab branch July 27, 2023 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable support more diverse tokenizers #2418

Enable support more diverse tokenizers #2418

eric8607242 commented Jul 27, 2023

klosax commented Jul 27, 2023

eric8607242 commented Jul 27, 2023

Enable support more diverse tokenizers #2418

Enable support more diverse tokenizers #2418

Conversation

eric8607242 commented Jul 27, 2023

klosax commented Jul 27, 2023

eric8607242 commented Jul 27, 2023