Skip to content

Question: KV Cache Size in GGML_FLASH_ATTN_EXT Operator #13816

Answered by ggerganov
Zijie-Tian asked this question in Q&A
Discussion options

You must be logged in to vote

Yes, the graph is currently constructed for every ubatch and this introduces some overhead.

The padding is a necessary step to be able to reuse the graphs, but it's not everything. Currently, there is a dynamic offset head in the KV cache that changes for every ubatch:

ggml_tensor * llama_kv_cache_unified::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, int32_t il) const {
const int32_t ikv = map_layer_ids.at(il);
auto * k = layers[ikv].k;
const int64_t n_tokens = k_cur->ne[2];
ggml_tensor * k_view = ggml_view_1d(ctx, k,
n_tokens*hparams.n_embd_k_gqa(il),
ggml_row_size(k->type

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
2 replies
@ggerganov
Comment options

Answer selected by Zijie-Tian
@Zijie-Tian
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants