Skip to content

ggml : unified file format #220

Closed
@philpax

Description

@philpax

Obsoletes #147, #150, ggml-org/llama.cpp#1575, ggml-org/llama.cpp#1590, rustformers/llm#143, and probably some other issues across some other repositories.

Please see the spec PR at #302; the following is left as-is so you can see the original proposal.


Current state of affairs

Overview

At present, there are two GGML file formats floating around for LLMs (and potentially other ggml-using projects, I haven't looked too much at the implementation of whisper):

  • GGML unversioned
  • GGJTv3 (same as v1 and v2, but with different quantization formats), which is similar to GGML but includes a version and aligns the tensors to allow for memory-mapping

Both of these formats share the same fundamental structure:

  • a magic number with an optional version number
  • model-specific hyperparameters that include a ftype that should describe the type of the majority of the tensors, and for GGML files, the quantization version encoded using a modulo in the ftype
  • an embedded vocabulary, which is a list of strings with length prepended. The GGMF/GGJT formats embed a f32 score next to the strings
  • finally, a list of tensors with their length-prepended name, type, and (aligned, in the case of GGJT) tensor data

We have more details on the format here: https://github.com/rustformers/llm/tree/main/crates/ggml#format

Drawbacks

Unfortunately, over the last few months, there are a few issues that have become apparent with the existing models:

  • There's no way to identify which model architecture a given model is for, because that information isn't present
    • Similarly, existing programs cannot intelligently fail upon encountering new architectures
  • Adding or removing any new hyperparameters is a breaking change, which is impossible for a reader to detect without herculean hacks
  • Each model architecture requires its own conversion script to their architecture's variant of GGML
  • Maintaining backwards compatibility without breaking the structure of the format requires clever tricks, like packing the quantization version into the ftype, which are not guaranteed to be picked up by readers/writers, and are not consistent between the two formats

GGJTv4/GGUF

Based on this, I'd like to propose a new format that's designed to be universal and addresses these issues. It is largely identical to GGJTv3, but makes one important difference: the hyperparameters are encoded as an array of key-value pairs that can be read in any order, and these hyperparameters are used to encode additional information about the model. A really important property I'd like to keep is single-file deployment: if I give you a GGUF file and you have a compatible executor, it should Just Work:tm without any additional conversion or extra files.

"Specification"

To quote from ggml-org/llama.cpp#1575 (comment):

Instead of storing the hyperparameters as

n_vocab: i32,
n_ctx: i32,
n_embd: i32,
n_head: i32,
n_layer: i32,
n_rot: i32,
use_parallel_residual: bool,
file_type: i32,

it's instead stored as an array of

key_length: u32,
key: [u8; key_length],
value_type: ValueType,
value: raw binary little-endian representation of value

so that you might have

[
  {
    key_length: 6,
    key: 'n_embd',
    value_type: ValueType::I32,
    value: 2560
  },
  {
    key_length: 11,
    key = 'use_parallel_residual',
    value_type = ValueType::Bool,
    value: true
  },
  ...
]

The brackets are for notational convenience - in practice, they're flatpacked and would come after each other in the binary. The ValueType enum would be standardized (like ggml_type), and so would the ways to represent each type of value.

This would allow for the addition of more parameters, readers to be more resilient to models coming from other sources, etc, because you'd be looking up values by key and trying to read them by binary.

It wouldn't be freeform - the storage medium would be entirely structured, so that any reader could pick up data from it without having to know about the other fields. As time goes on, I imagine this would look like ID3v2, with commonly-used tags being standardized by the community for whatever metadata they want to attach.

The main thing I want to achieve is to a) allow the reading of a GGML file knowing nothing else about it, even if you can't do anything with it and b) allow for community model authors to add useful metadata in a way that won't cause breakage for future readers, while still remaining maximally compatible.

Filling in some of the missing details:

Keys

Keys are ASCII lower_snake_case with dots for separation. Their length is stored before the key. They have a maximal length of 256 (open for debate, just a number I picked that seems like a reasonable upper bound).

This means that:

  • vocabulary.hugging_face is a valid key
  • vocabulary-hugging-face is not
  • Vocabulary.HuggingFace is not
  • vocabulary.hugging-face is not

I'd say we're looking at something like TOML keys without quotation.

Values

Values are one of the following types:

  • U32: little-endian unsigned 32-bit integer
  • I32: little-endian signed 32-bit integer (honestly not sure if this is necessary, I feel like a lot of the existing i32 use has been more just due to the use of int than anything)
  • F32: IEEE754 32-bit floating point number
  • String: UTF-8 string data, length prepended
  • Bytes: Raw binary data with no specific meaning attached, length prepended
  • Boolean: 1-byte value where 0 is false and 1 is true. Anything else is invalid. I considered making anything other than 0 true, but being strict on this will help detect misbehaving writers.

Standardized key-value pairs

This list is incomplete. Feel free to suggest additions. Where possible, I've tried to use the original names from the models to remove a layer of semantic confusion.

This is just from a quick appraisal of the models that llm supports. There are likely other fields that we can standardise ahead of time by looking at the HuggingFace config.

General

  • general.architecture: String: describes what architecture this model implements. Values can include llama, mpt, gpt-neox, gpt-j, gpt-2, bloom, etc. (List more if you can think of them, and they're not just variants of existing architectures!)
  • general.quantization_version: u32: version of quantization scheme
  • general.file_type: String: type of the majority of the tensors in the file. This shouldn't have any semantic meaning and should be purely informational, hence the use of String.
  • general.license: String: SPDX license of the model
  • general.description: String: information about the model, including provenance
  • general.original_model_url: String: path to the original model that this GGML file was created from

LLM

  • llm.context_length: u32: size of the maximum supported context
  • llm.hidden_size: u32: embedding layer size
  • llm.num_hidden_layers: u32: number of hidden layers
  • llm.num_rotary: u32: int(hparams["rotary_pct"]*(hparams["hidden_size"]//hparams["num_attention_heads"]))
  • llm.use_parallel_residual: bool: whether or not the parallel residual logic should be used
  • llm.max_seq_len: u32: Maximum sequence length
  • llm.attention.num_heads: u32: number of attention heads
  • llm.attention.alibi_bias_max: f32: The maximum bias to use for ALiBI
  • llm.attention.clip_kqv: f32: not sure

Vocabulary

  • vocabulary.embedded_size: u32: size of the embedded vocabulary. Zero if there is no embedded vocabulary.
  • vocabulary.huggingface_tokenizer_json: String: the entirety of the HF tokenizer.json for a given model (e.g. https://huggingface.co/mosaicml/mpt-7b-instruct/blob/main/tokenizer.json). Optional, but highly recommended for best tokenization quality with supported executors.

Future

This is not something we should aim for in the MVP, but ggml now has support for exporting the computation graph. A sample computation graph could be embedded to allow an executor to run the model without having direct support for the architecture.

Migration

The existing migrations have been pretty messy for the ecosystem and for the community. We should try to avoid causing significant upset by providing a migration path.

My suggestion is to switch over all model implementations, including llama.cpp, to GGUF, but offer a very straightforward conversion utility that does not require Python and can convert GGML and GGJTv3 to GGUF with all required information.

If interested, we could also include support for GGJT v1 and v2 using ggml-org/llama.cpp#1504 (although the requantisation process is inherently lossy).

Hopefully, this is the last time we have to bite this bullet. Even if we make breaking changes (like quantization version) again, software consuming GGUF can intelligently decide what to do based on the available information in the hyperparameters.

New model architectures can use GGUF without any additional work, so no breaking changes should be necessary there, either.

Conversion of Python models to GGUF

Ideally, all of the existing convert-h5-to-ggml.py and convert.py scripts can be entirely deprecated. Instead, there is one script that takes an arbitrary HuggingFace model and converts it to a compatible GGUF file. This vastly reduces the maintenance burden and makes it simpler to action changes across the ecosystem when necessary.


cc @ggerganov @LostRuins @KerfuffleV2 @LLukas22 @TheBloke @iacore @comex and others who work with GGML models

Metadata

Metadata

Labels

documentationImprovements or additions to documentationenhancementNew feature or requesthelp wantedExtra attention is neededrefactoringRefactoring

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions