Enable support more diverse tokenizers #2418
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi, thanks for this awesome work.
I am currently trying to perform inference on different LLM (e.g., xgen and Aquila) using this project.
I always encounter issues with generating Chinese text smoothly.
By adopting the flag
--verbose-prompt
, I found that the Chinese words are always being tokenized into wrong token IDs.After digging into the root cause, I found the reason is that the Chinese characters, which are composed of multiple bytes, are always tokenized incorrectly by this part.
This code can work for the llama series of models primarily because the llama's tokenizer follows the char
coding order and three special tokens are placed at the beginning:
Unfortunately, not all open-source pre-trained models adopt llama's tokenizer such as xgen and Aquila mentioned above.
Therefore, for more flexible support for more diverse pre-trained model tokenizers. I believe we should use the vocabulary generated by
convert.py
appropriately in this case.For example, the xgen's tokenizer map looks like:
Although this PR only modifies one line of code, it brings significant benefits for supporting more models with UTF-8 characters. Just like #2228, enabling only BPE in convert.py is not sufficient to successfully infer Chinese words without this modification.
Big thanks for this amazing work again!