Skip to content

Commit 33eff40

Browse files
ngxsonggerganov
andauthored
server : vision support via libmtmd (#12898)
* server : (experimental) vision support via libmtmd * mtmd : add more api around mtmd_image_tokens * mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * abstract out the batch management * small fix * refactor logic adding tokens to batch * implement hashing image * use FNV hash, now hash bitmap instead of file data * allow decoding image embedding to be split into batches * rm whitespace * disable some features when mtmd is on * fix --no-mmproj-offload * mtmd_context_params no timings * refactor server_inp to server_tokens * fix the failing test case * init * wip * working version * add mtmd::bitmaps * add test target * rm redundant define * test: mtmd_input_chunks_free * rm outdated comment * fix merging issue * explicitly create mtmd::input_chunks * mtmd_input_chunk_copy * add clone() * improve server_input struct * clip : fix confused naming ffn_up and ffn_down * rm ffn_i/o/g naming * rename n_embd, n_ff * small fix * no check n_ff * fix detokenize * add const to various places * add warning about breaking changes * add c api * helper: use mtmd_image_tokens_get_n_pos * fix ctx_shift * fix name shadowing * more strict condition * support remote image_url * remote image_url log * add CI test * do not log base64 * add "has_multimodal" to /props * remove dangling image * speculative: use slot.cache_tokens.insert * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * rm can_be_detokenized * on prmpt processing done, assert cache_tokens.size * handle_completions_impl returns void * adapt the new web ui * update docs and hot topics * rm assert * small fix (2) --------- Co-authored-by: Georgi Gerganov <[email protected]>
1 parent 17512a9 commit 33eff40

File tree

10 files changed

+774
-101
lines changed

10 files changed

+774
-101
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,9 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
1616

1717
## Hot topics
1818

19+
- 🔥 Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md)
1920
- **GGML developer experience survey (organized and reviewed by NVIDIA):** [link](https://forms.gle/Gasw3cRgyhNEnrwK9)
20-
- A new binary `llama-mtmd-cli` is introduced to replace `llava-cli`, `minicpmv-cli`, `gemma3-cli` ([#13012](https://github.com/ggml-org/llama.cpp/pull/13012)) and `qwen2vl-cli` ([#13141]((https://github.com/ggml-org/llama.cpp/pull/13141))), `libllava` will be deprecated
21+
- A new binary `llama-mtmd-cli` is introduced to replace `llava-cli`, `minicpmv-cli`, `gemma3-cli` ([#13012](https://github.com/ggml-org/llama.cpp/pull/13012)) and `qwen2vl-cli` ([#13141](https://github.com/ggml-org/llama.cpp/pull/13141)), `libllava` will be deprecated
2122
- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
2223
- Universal [tool call support](./docs/function-calling.md) in `llama-server` https://github.com/ggml-org/llama.cpp/pull/9639
2324
- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim

common/arg.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ using json = nlohmann::ordered_json;
4040

4141
std::initializer_list<enum llama_example> mmproj_examples = {
4242
LLAMA_EXAMPLE_LLAVA,
43-
// TODO: add LLAMA_EXAMPLE_SERVER when it's ready
43+
LLAMA_EXAMPLE_SERVER,
4444
};
4545

4646
static std::string read_file(const std::string & fname) {

docs/multimodal.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Multimodal
2+
3+
llama.cpp supports multimodal input via `libmtmd`. Currently, there are 2 tools support this feature:
4+
- [llama-mtmd-cli](../tools/mtmd/README.md)
5+
- [llama-server](../tools/server/README.md) via OpenAI-compatible `/chat/completions` API
6+
7+
To enable it, can use use one of the 2 methods below:
8+
9+
- Use `-hf` option with a [supported model](../../docs/multimodal.md)
10+
- To load a model using `-hf` while disabling multimodal, use `--no-mmproj`
11+
- To load a model using `-hf` while using a custom mmproj file, use `--mmproj local_file.gguf`
12+
- Use `-m model.gguf` option with `--mmproj file.gguf` to specify text and multimodal projector respectively
13+
14+
By default, multimodal projector will be offloaded to GPU. To disable this, add `--no-mmproj-offload`
15+
16+
For example:
17+
18+
```sh
19+
# simple usage with CLI
20+
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF
21+
22+
# simple usage with server
23+
llama-server -hf ggml-org/gemma-3-4b-it-GGUF
24+
25+
# using local file
26+
llama-server -m gemma-3-4b-it-Q4_K_M.gguf --mmproj mmproj-gemma-3-4b-it-Q4_K_M.gguf
27+
28+
# no GPU offload
29+
llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload
30+
```
31+
32+
## Pre-quantized models
33+
34+
These are ready-to-use models, most of them come with `Q4_K_M` quantization by default.
35+
36+
Replaces the `(tool_name)` with the name of binary you want to use. For example, `llama-mtmd-cli` or `llama-server`
37+
38+
NOTE: some models may require large context window, for example: `-c 8192`
39+
40+
```sh
41+
# Gemma 3
42+
(tool_name) -hf ggml-org/gemma-3-4b-it-GGUF
43+
(tool_name) -hf ggml-org/gemma-3-12b-it-GGUF
44+
(tool_name) -hf ggml-org/gemma-3-27b-it-GGUF
45+
46+
# SmolVLM
47+
(tool_name) -hf ggml-org/SmolVLM-Instruct-GGUF
48+
(tool_name) -hf ggml-org/SmolVLM-256M-Instruct-GGUF
49+
(tool_name) -hf ggml-org/SmolVLM-500M-Instruct-GGUF
50+
(tool_name) -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
51+
(tool_name) -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
52+
(tool_name) -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF
53+
54+
# Pixtral 12B
55+
(tool_name) -hf ggml-org/pixtral-12b-GGUF
56+
57+
# Qwen 2 VL
58+
(tool_name) -hf ggml-org/Qwen2-VL-2B-Instruct-GGUF
59+
(tool_name) -hf ggml-org/Qwen2-VL-7B-Instruct-GGUF
60+
61+
# Qwen 2.5 VL
62+
(tool_name) -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF
63+
(tool_name) -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
64+
(tool_name) -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF
65+
(tool_name) -hf ggml-org/Qwen2.5-VL-72B-Instruct-GGUF
66+
67+
# Mistral Small 3.1 24B (IQ2_M quantization)
68+
(tool_name) -hf ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
69+
```

tools/mtmd/README.md

Lines changed: 1 addition & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -16,38 +16,7 @@ The naming and structure related to multimodal support have evolved, which might
1616

1717
## Pre-quantized models
1818

19-
These are ready-to-use models, most of them come with `Q4_K_M` quantization by default:
20-
21-
```sh
22-
# Gemma 3
23-
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF
24-
llama-mtmd-cli -hf ggml-org/gemma-3-12b-it-GGUF
25-
llama-mtmd-cli -hf ggml-org/gemma-3-27b-it-GGUF
26-
27-
# SmolVLM
28-
llama-mtmd-cli -hf ggml-org/SmolVLM-Instruct-GGUF
29-
llama-mtmd-cli -hf ggml-org/SmolVLM-256M-Instruct-GGUF
30-
llama-mtmd-cli -hf ggml-org/SmolVLM-500M-Instruct-GGUF
31-
llama-mtmd-cli -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
32-
llama-mtmd-cli -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
33-
llama-mtmd-cli -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF
34-
35-
# Pixtral 12B
36-
llama-mtmd-cli -hf ggml-org/pixtral-12b-GGUF
37-
38-
# Qwen 2 VL
39-
llama-mtmd-cli -hf ggml-org/Qwen2-VL-2B-Instruct-GGUF
40-
llama-mtmd-cli -hf ggml-org/Qwen2-VL-7B-Instruct-GGUF
41-
42-
# Qwen 2.5 VL
43-
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF
44-
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
45-
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF
46-
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-72B-Instruct-GGUF
47-
48-
# Mistral Small 3.1 24B (IQ2_M quantization)
49-
llama-mtmd-cli -hf ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
50-
```
19+
See the list of pre-quantized model [here](../../docs/multimodal.md)
5120

5221
## How it works and what is `mmproj`?
5322

tools/server/CMakeLists.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,9 @@ endforeach()
3434
add_executable(${TARGET} ${TARGET_SRCS})
3535
install(TARGETS ${TARGET} RUNTIME)
3636

37+
target_include_directories(${TARGET} PRIVATE ../llava)
3738
target_include_directories(${TARGET} PRIVATE ${CMAKE_SOURCE_DIR})
38-
target_link_libraries(${TARGET} PRIVATE common ${CMAKE_THREAD_LIBS_INIT})
39+
target_link_libraries(${TARGET} PRIVATE common mtmd ${CMAKE_THREAD_LIBS_INIT})
3940

4041
if (LLAMA_SERVER_SSL)
4142
find_package(OpenSSL REQUIRED)

tools/server/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,12 @@ services:
193193
LLAMA_ARG_PORT: 8080
194194
```
195195
196+
### Multimodal support
197+
198+
Multimodal support was added in [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) and is currently an experimental feature.
199+
200+
For more details, please refer to [multimodal documentation](../../docs/multimodal.md)
201+
196202
## Build
197203
198204
`llama-server` is built alongside everything else from the root of the project
@@ -749,6 +755,9 @@ This endpoint is public (no API key check). By default, it is read-only. To make
749755
"total_slots": 1,
750756
"model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
751757
"chat_template": "...",
758+
"modalities": {
759+
"vision": false
760+
},
752761
"build_info": "b(build number)-(build commit hash)"
753762
}
754763
```
@@ -757,6 +766,7 @@ This endpoint is public (no API key check). By default, it is read-only. To make
757766
- `total_slots` - the total number of slots for process requests (defined by `--parallel` option)
758767
- `model_path` - the path to model file (same with `-m` argument)
759768
- `chat_template` - the model's original Jinja2 prompt template
769+
- `modalities` - the list of supported modalities
760770

761771
### POST `/props`: Change server global properties.
762772

@@ -1069,6 +1079,8 @@ print(completion.choices[0].text)
10691079

10701080
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggml-org/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
10711081

1082+
If model supports multimodal, you can input the media file via `image_url` content part. We support both base64 and remote URL as input. See OAI documentation for more.
1083+
10721084
*Options:*
10731085

10741086
See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). llama.cpp `/completion`-specific features such as `mirostat` are also supported.

0 commit comments

Comments
 (0)