-
Notifications
You must be signed in to change notification settings - Fork 14.5k
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Feature Description
Currently llama.cpp lacks support for HuggingFace's tokenization pipeline.
it is crucial to address its current limitations regarding integrated tokenization pipeline configurations from HuggingFace's Tokenizers library, which are stored in a separate JSON file named "tokenizer.json". These configuration files contain essential information for implementing advanced features like subword regularization and customizable pre-processing techniques that improve language model performance.
By incorporating these metadata into the gguf format, llama.cpp can offer users a more seamless experience by providing access to HuggingFace's comprehensive tokenization pipeline within its single file implementation of language models.
Motivation
Possible Implementation
Only need to add the contents of the relevant subkeys (normalizer, pretokenizer, model, postprocessor, decoder) in the tokenizer.json file to the metadata of gguf. Also don’t forget the tokenizer_config.json file for special tokenizer configruation.
Subsequently, the next step can proceed to implement the tokenizer within HF (Hugging Face). The most effortless approach is utilizing the pre-existing tokenizers-cpp, which encapsulates and binds both the HuggingFace tokenizers library and sentencepiece while offering a minimal common interface in C++ language for seamless integration with HF applications.
Alternatively, it is possible to implement all tokenizer functionalities solely using pure C++ code without any external libraries or dependencies if so desired. For a clearer example of how these pre-tokenization steps might be implemented using JavaScript code in one file, you can refer to the source for transformers.js's tokenizer implementation: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js
Related JSON Example in tokenizer.json
{
"normalizer": {
"type": "Sequence",
"normalizers": [
{
"type": "Prepend",
"prepend": "▁"
},
{
"type": "Replace",
"pattern": {
"String": " "
},
"content": "▁"
}
]
},
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{"type": "WhitespaceSplit"},
{"type": "Metaspace","replacement": "▁", ...}
]
},
"model": {
"type": "BPE",
"dropout": null,
"unk_token": "<unk>",
"continuing_subword_prefix": null,
"end_of_word_suffix": null,
"fuse_unk": true,
"byte_fallback": true,
"vocab": { ... }
},
"post_processor": {
"type": "TemplateProcessing",
"single": [
{
"SpecialToken": {
"id": "<s>",
"type_id": 0
}
},
{
"Sequence": {
"id": "A",
"type_id": 0
}
}
],
"pair": [
{
"SpecialToken": {
"id": "<s>",
"type_id": 0
}
},
{
"Sequence": {
"id": "A",
"type_id": 0
}
},
{
"SpecialToken": {
"id": "<s>",
"type_id": 1
}
},
{
"Sequence": {
"id": "B",
"type_id": 1
}
}
],
"special_tokens": {
"<s>": {
"id": "<s>",
"ids": [
1
],
"tokens": [
"<s>"
]
}
}
},
"decoder": {
"type": "Sequence",
"decoders": [
{
"type": "Replace",
"pattern": {
"String": "▁"
},
"content": " "
},
{
"type": "ByteFallback"
},
{
"type": "Fuse"
},
{
"type": "Strip",
"content": " ",
"start": 1,
"stop": 0
}
]
},
}