[Feat]: Add Tokenizer Metadata in tokenizer.json to gguf Format for Enhanced llama.cpp Capabilities

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [X] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Feature Description

Currently llama.cpp lacks support for [HuggingFace's tokenization pipeline](https://huggingface.co/docs/tokenizers/python/latest/pipeline.html). 

it is crucial to address its current limitations regarding integrated tokenization pipeline configurations from HuggingFace's Tokenizers library, which are stored in a separate JSON file named "tokenizer.json". These configuration files contain essential information for implementing advanced features like subword regularization and customizable pre-processing techniques that improve language model performance.

By incorporating these metadata into the gguf format, llama.cpp can offer users a more seamless experience by providing access to HuggingFace's comprehensive tokenization pipeline within its single file implementation of language models.

# Motivation

Related Issues:  #2872 #3502 

# Possible Implementation

Only need to add the contents of the relevant subkeys (normalizer, pretokenizer, model, postprocessor, decoder) in the `tokenizer.json` file to the metadata of gguf. Also don’t forget the `tokenizer_config.json` file for special tokenizer configruation.

Subsequently, the next step  can proceed to implement the tokenizer within HF (Hugging Face). The most effortless approach is utilizing the pre-existing [tokenizers-cpp](https://github.com/mlc-ai/tokenizers-cpp), which encapsulates and binds both the [HuggingFace tokenizers library](https://github.com/huggingface/tokenizers) and [sentencepiece](https://github.com/google/sentencepiece) while offering a minimal common interface in C++ language for seamless integration with HF applications.

Alternatively, it is possible to implement all tokenizer functionalities solely using pure C++ code without any external libraries or dependencies if so desired. For a clearer example of how these pre-tokenization steps might be implemented using JavaScript code in one file, you can refer to the source for transformers.js's tokenizer implementation: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js


<details>
<summary>Related JSON Example in tokenizer.json</summary>

```json
{
  "normalizer": {
    "type": "Sequence",
    "normalizers": [
      {
        "type": "Prepend",
        "prepend": "▁"
      },
      {
        "type": "Replace",
        "pattern": {
          "String": " "
        },
        "content": "▁"
      }
    ]
  },
  "pre_tokenizer": {
     "type": "Sequence",
     "pretokenizers": [
        {"type": "WhitespaceSplit"},
        {"type": "Metaspace","replacement": "▁", ...}
     ]
  },
  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": "<unk>",
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": true,
    "byte_fallback": true,
    "vocab": { ... }
  },

  "post_processor": {
    "type": "TemplateProcessing",
    "single": [
      {
        "SpecialToken": {
          "id": "<s>",
          "type_id": 0
        }
      },
      {
        "Sequence": {
          "id": "A",
          "type_id": 0
        }
      }
    ],
    "pair": [
      {
        "SpecialToken": {
          "id": "<s>",
          "type_id": 0
        }
      },
      {
        "Sequence": {
          "id": "A",
          "type_id": 0
        }
      },
      {
        "SpecialToken": {
          "id": "<s>",
          "type_id": 1
        }
      },
      {
        "Sequence": {
          "id": "B",
          "type_id": 1
        }
      }
    ],
    "special_tokens": {
      "<s>": {
        "id": "<s>",
        "ids": [
          1
        ],
        "tokens": [
          "<s>"
        ]
      }
    }
  },
  "decoder": {
    "type": "Sequence",
    "decoders": [
      {
        "type": "Replace",
        "pattern": {
          "String": "▁"
        },
        "content": " "
      },
      {
        "type": "ByteFallback"
      },
      {
        "type": "Fuse"
      },
      {
        "type": "Strip",
        "content": " ",
        "start": 1,
        "stop": 0
      }
    ]
  },
}
```
</details>



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat]: Add Tokenizer Metadata in tokenizer.json to gguf Format for Enhanced llama.cpp Capabilities #4868

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feat]: Add Tokenizer Metadata in tokenizer.json to gguf Format for Enhanced llama.cpp Capabilities #4868

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions