context : allow cache-less context for embeddings #13108

ggerganov · 2025-04-25T11:15:32Z

There is no need to create a KV cache when using embedding models such as BERT. This saves memory compared to master.

The llama_encode() method is now the recommended way to compute embeddings and rerank.
llama_decode() can still be used to compute embeddings as before.
For embedding models such as BERT, llama_decode() fallbacks to llama_encode() and prints a warning.

In short, whenever the KV cache is not needed - use llama_encode(). Otherwise - use llama_decode(). The changes are backwards compatible.

ggerganov · 2025-05-02T14:51:50Z

I'll work on rebasing and merging this next - it should be a good improvement for embedding models by reducing the allocated memory during inference.

examples/embedding/embedding.cpp

ggml-ci

aviallon · 2025-05-04T07:46:37Z

Thanks for your awesome work Georgi.
Is there a donation / sponsoring page btw?

ggerganov mentioned this pull request Apr 25, 2025

kv-cache : separate recurrent vs non-recurrent impl #12799

Merged

8 tasks

github-actions bot added examples server labels Apr 25, 2025

ggerganov force-pushed the gg/llama-kv-cache-v6 branch 5 times, most recently from 58115a2 to 7e79a42 Compare May 2, 2025 13:02

Base automatically changed from gg/llama-kv-cache-v6 to master May 2, 2025 14:48

Green-Sky reviewed May 2, 2025

View reviewed changes

examples/embedding/embedding.cpp Outdated Show resolved Hide resolved

ggerganov added 3 commits May 3, 2025 11:21

context : allow cache-less context for embeddings

9770efa

ggml-ci

context : enable reranking with encode()

a21ff6c

ggml-ci

context : encode() clears embd_seq

c14ee72

ggml-ci

ggerganov force-pushed the gg/embeddings-no-kv branch from 4f0ea9b to c14ee72 Compare May 3, 2025 08:23

ggerganov added 3 commits May 3, 2025 17:46

examples : use llama_encode() when appropriate

c709275

ggml-ci

models : nomic bert moe does not require KV cache

97b975d

llama : update comments for llama_decode/llama_encode

3b4f6c0

ggml-ci

ggerganov marked this pull request as ready for review May 3, 2025 15:23

ggerganov requested a review from ngxson as a code owner May 3, 2025 15:23

context : update warning log [no ci]

abe25e7

Provide feedback