ggml : unified file format

Obsoletes #147, #150, https://github.com/ggerganov/llama.cpp/issues/1575, https://github.com/ggerganov/llama.cpp/issues/1590, https://github.com/rustformers/llm/discussions/143, and probably some other issues across some other repositories.

Please see the spec PR at #302; the following is left as-is so you can see the original proposal.

---

# Current state of affairs

## Overview

At present, there are two GGML file formats floating around for LLMs (and potentially other ggml-using projects, I haven't looked too much at the implementation of whisper):
- GGML unversioned
- GGJTv3 (same as v1 and v2, but with different quantization formats), which is similar to GGML but includes a version and aligns the tensors to allow for memory-mapping

Both of these formats share the same fundamental structure:

- a magic number with an optional version number
- model-specific hyperparameters that include a `ftype` that should describe the type of the majority of the tensors, and for GGML files, the quantization version encoded using a modulo in the ftype
- an embedded vocabulary, which is a list of strings with length prepended. The GGMF/GGJT formats embed a f32 score next to the strings
- finally, a list of tensors with their length-prepended name, type, and (aligned, in the case of GGJT) tensor data

We have more details on the format here: <https://github.com/rustformers/llm/tree/main/crates/ggml#format>

## Drawbacks

Unfortunately, over the last few months, there are a few issues that have become apparent with the existing models:

- There's no way to identify which model architecture a given model is for, because that information isn't present
  - Similarly, existing programs cannot intelligently fail upon encountering new architectures
- Adding or removing any new hyperparameters is a breaking change, which is impossible for a reader to detect without [herculean hacks](https://github.com/LostRuins/koboldcpp/blob/concedo/model_adapter.cpp#L178)
- Each model architecture requires its own conversion script to their architecture's variant of GGML
- Maintaining backwards compatibility without breaking the structure of the format requires clever tricks, like packing the quantization version into the ftype, which are not guaranteed to be picked up by readers/writers, and are not consistent between the two formats

# GGJTv4/GGUF

Based on this, I'd like to propose a new format that's designed to be universal and addresses these issues. It is largely identical to GGJTv3, but makes one important difference: the hyperparameters are encoded as an array of key-value pairs that can be read in any order, and these hyperparameters are used to encode additional information about the model. A really important property I'd like to keep is single-file deployment: if I give you a GGUF file and you have a compatible executor, it should Just Work:tm without any additional conversion or extra files.

## "Specification"

To quote from https://github.com/ggerganov/llama.cpp/issues/1575#issuecomment-1566196582:

> Instead of storing the hyperparameters as
> 
> ```
> n_vocab: i32,
> n_ctx: i32,
> n_embd: i32,
> n_head: i32,
> n_layer: i32,
> n_rot: i32,
> use_parallel_residual: bool,
> file_type: i32,
> ```
> 
> it's instead stored as an array of
> ```
> key_length: u32,
> key: [u8; key_length],
> value_type: ValueType,
> value: raw binary little-endian representation of value
> ```
> 
> so that you might have
> ```
> [
>   {
>     key_length: 6,
>     key: 'n_embd',
>     value_type: ValueType::I32,
>     value: 2560
>   },
>   {
>     key_length: 11,
>     key = 'use_parallel_residual',
>     value_type = ValueType::Bool,
>     value: true
>   },
>   ...
> ]
> ```
> The brackets are for notational convenience - in practice, they're flatpacked and would come after each other in the binary. The `ValueType` enum would be standardized (like `ggml_type`), and so would the ways to represent each type of value.
> 
> This would allow for the addition of more parameters, readers to be more resilient to models coming from other sources, etc, because you'd be looking up values by key and trying to read them by binary.
> 
> It wouldn't be freeform - the storage medium would be entirely structured, so that any reader could pick up data from it without having to know about the other fields. As time goes on, I imagine this would look like [ID3v2](https://en.wikipedia.org/wiki/ID3#ID3v2), with commonly-used tags being standardized by the community for whatever metadata they want to attach.
> 
> The main thing I want to achieve is to a) allow the reading of a GGML file knowing *nothing* else about it, even if you can't do anything with it and b) allow for community model authors to add useful metadata in a way that won't cause breakage for future readers, while still remaining maximally compatible.

Filling in some of the missing details:

### Keys

Keys are ASCII lower_snake_case with dots for separation. Their length is stored before the key. They have a maximal length of 256 (open for debate, just a number I picked that seems like a reasonable upper bound).

This means that:
- `vocabulary.hugging_face` is a valid key
- `vocabulary-hugging-face` is not
- `Vocabulary.HuggingFace` is not
- `vocabulary.hugging-face` is not

I'd say we're looking at something like [TOML keys](https://toml.io/en/v1.0.0#keys) without quotation.

### Values

Values are one of the following types:

- `U32`: little-endian unsigned 32-bit integer
- `I32`: little-endian signed 32-bit integer (honestly not sure if this is necessary, I feel like a lot of the existing i32 use has been more just due to the use of `int` than anything)
- `F32`: IEEE754 32-bit floating point number
- `String`: UTF-8 string data, length prepended
- `Bytes`: Raw binary data with no specific meaning attached, length prepended
- `Boolean`: 1-byte value where 0 is false and 1 is true. Anything else is invalid. I considered making anything other than 0 true, but being strict on this will help detect misbehaving writers.

### Standardized key-value pairs

This list is incomplete. Feel free to suggest additions. Where possible, I've tried to use the original names from the models to remove a layer of semantic confusion.

This is just from a quick appraisal of the models that `llm` supports. There are likely other fields that we can standardise ahead of time by looking at the HuggingFace config.

#### General
- `general.architecture: String`: describes what architecture this model implements. Values can include `llama`, `mpt`, `gpt-neox`, `gpt-j`, `gpt-2`, `bloom`, etc. (List more if you can think of them, and they're not just variants of existing architectures!)
- `general.quantization_version: u32`: version of quantization scheme
- `general.file_type: String`: type of the majority of the tensors in the file. This shouldn't have any semantic meaning and should be purely informational, hence the use of `String`. 
- `general.license: String`: SPDX license of the model
- `general.description: String`: information about the model, including provenance
- `general.original_model_url: String`: path to the original model that this GGML file was created from

#### LLM
- `llm.context_length: u32`: size of the maximum supported context
- `llm.hidden_size: u32`: embedding layer size
- `llm.num_hidden_layers: u32`: number of hidden layers
- `llm.num_rotary: u32`: `int(hparams["rotary_pct"]*(hparams["hidden_size"]//hparams["num_attention_heads"]))`
- `llm.use_parallel_residual: bool`: whether or not the parallel residual logic should be used
- `llm.max_seq_len: u32`: Maximum sequence length
- `llm.attention.num_heads: u32`: number of attention heads
- `llm.attention.alibi_bias_max: f32`: The maximum bias to use for ALiBI
- `llm.attention.clip_kqv: f32`: not sure

#### Vocabulary
- `vocabulary.embedded_size: u32`: size of the embedded vocabulary. Zero if there is no embedded vocabulary.
- `vocabulary.huggingface_tokenizer_json: String`: the entirety of the HF `tokenizer.json` for a given model (e.g. <https://huggingface.co/mosaicml/mpt-7b-instruct/blob/main/tokenizer.json>). Optional, but highly recommended for best tokenization quality with supported executors.

#### Future
This is *not* something we should aim for in the MVP, but ggml now has support for exporting the computation graph. A sample computation graph could be embedded to allow an executor to run the model without having direct support for the architecture.

## Migration

The existing migrations have been pretty messy for the ecosystem and for the community. We should try to avoid causing significant upset by providing a migration path.

My suggestion is to switch over *all* model implementations, including llama.cpp, to GGUF, but offer a very straightforward conversion utility that does not require Python and can convert GGML and GGJTv3 to GGUF with all required information.

If interested, we could also include support for GGJT v1 and v2 using https://github.com/ggerganov/llama.cpp/pull/1504 (although the requantisation process is inherently lossy).

Hopefully, this is the last time we have to bite this bullet. Even if we make breaking changes (like quantization version) again, software consuming GGUF can intelligently decide what to do based on the available information in the hyperparameters.

New model architectures can use GGUF without any additional work, so no breaking changes should be necessary there, either.

## Conversion of Python models to GGUF

Ideally, all of the existing `convert-h5-to-ggml.py` and `convert.py` scripts can be entirely deprecated. Instead, there is one script that takes an arbitrary HuggingFace model and converts it to a compatible GGUF file. This vastly reduces the maintenance burden and makes it simpler to action changes across the ecosystem when necessary.

---

cc @ggerganov @LostRuins @KerfuffleV2 @LLukas22 @TheBloke @iacore @comex and others who work with GGML models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : unified file format #220

Current state of affairs

Overview

Drawbacks

GGJTv4/GGUF

"Specification"

Keys

Values

Standardized key-value pairs

General

LLM

Vocabulary

Future

Migration

Conversion of Python models to GGUF

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ggml : unified file format #220

Description

Current state of affairs

Overview

Drawbacks

GGJTv4/GGUF

"Specification"

Keys

Values

Standardized key-value pairs

General

LLM

Vocabulary

Future

Migration

Conversion of Python models to GGUF

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions