Question: KV Cache Size in GGML_FLASH_ATTN_EXT Operator #13816

Zijie-Tian · 2025-05-27T01:54:23Z

Zijie-Tian
May 27, 2025

I've been tracing the variable kqv_out and its inputs, and I'm quite confused by the attention operator. I've observed that for the FLASH_ATTN_EXT operator, the KV cache input seems to be consistently aligned to 256 in each operator call, rather than incrementally increasing as decoding progresses. Here's an example of the output I'm seeing:

=== GENERATION STEP 10/12 ===
Generated token 10: (id: 30661, logit: 15.5378)
--- Decoding token 10 ---

=== KQV_OUT TENSOR DETECTED ===
ggml_debug_kqv_trace:                kqv_out-0 = (f32)    RESHAPE(node_22{128, 32, 1, 1}, }) = {4096, 1, 1, 1}
[KQV-TRACE] Layer 0 - kqv_out-0: shape=[4096,1,1,1] type=f32 elements=4096
[KQV-TRACE]   stats: mean=0.000138, std=0.007083, min=-0.068372, max=0.053411

[KQV-TRACE] Source tensor hierarchy:
[OP] kqv_out-0: op=RESHAPE, shape=[4096,1,1,1], type=f32
[SRC-0] kqv_out-0: op=FLASH_ATTN_EXT, shape=[128,32,1,1], type=f32
  [SRC-0] node_22: op=PERMUTE, shape=[128,1,32,1], type=f32
    [SRC-0] Qcur-0 (permuted): op=ROPE, shape=[128,32,1,1], type=f32
      [SRC-0] Qcur-0: op=RESHAPE, shape=[128,32,1,1], type=f32
      [SRC-1] Qcur-0: op=NONE, shape=[1,1,1,1], type=i32
      [SRC-2] Qcur-0: op=NONE, shape=[64,1,1,1], type=f32
  [SRC-1] node_22: op=PERMUTE, shape=[128,256,8,1], type=f16
    [SRC-0] cache_k_l0 (view) (permuted): op=VIEW, shape=[128,8,256,1], type=f16
      [SRC-0] cache_k_l0 (view): op=NONE, shape=[1024,4096,1,1], type=f16
  [SRC-2] node_22: op=PERMUTE, shape=[128,256,8,1], type=f16
    [SRC-0] cache_v_l0 (view) (permuted): op=VIEW, shape=[128,8,256,1], type=f16
      [SRC-0] cache_v_l0 (view): op=NONE, shape=[1024,4096,1,1], type=f16
  [SRC-3] node_22: op=CPY, shape=[256,64,1,1], type=f16
    [SRC-0]  (copy): op=NONE, shape=[256,64,1,1], type=f32
    [SRC-1]  (copy): op=CPY, shape=[256,64,1,1], type=f16
      [SRC-0]  (copy): op=NONE, shape=[256,64,1,1], type=f32
      [SRC-1]  (copy): op=CPY, shape=[256,64,1,1], type=f16

I have a hypothesis about this: could it be to facilitate parallel processing? However, if that's the case, this approach would primarily benefit GPUs. So, would this get bad performance on the CPU?

Could someone clarify why the KV cache input for FLASH_ATTN_EXT appears to be aligned to 256 instead of growing with the decode sequence length?

Answered by ggerganov

May 28, 2025

Yes, the graph is currently constructed for every ubatch and this introduces some overhead.

The padding is a necessary step to be able to reuse the graphs, but it's not everything. Currently, there is a dynamic offset head in the KV cache that changes for every ubatch:

llama.cpp/src/llama-kv-cache.cpp

Lines 542 to 554 in 05f6ac6

     ggml_tensor * llama_kv_cache_unified::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, int32_t il) const {  
   const int32_t ikv = map_layer_ids.at(il);  
    
   auto * k = layers[ikv].k;  
    
   const int64_t n_tokens = k_cur->ne[2];  
    
   ggml_tensor * k_view = ggml_view_1d(ctx, k,  
   n_tokens*hparams.n_embd_k_gqa(il),  
   ggml_row_size(k->type

View full answer

ggerganov · 2025-05-27T08:56:20Z

ggerganov
May 27, 2025
Maintainer

The KV cache is padded here:

llama.cpp/src/llama-kv-cache.cpp

Lines 19 to 24 in 4f81b33

    
           uint32_t llama_kv_cache_unified::get_padding(const llama_cparams & cparams) { 
        
               // the FA kernels require padding to avoid extra runtime boundary checks 
        
               return cparams.flash_attn ? 256u : 32u; 
        
           }

The main reason is to be able to write more efficient GPU kernels that don't have to worry about out-of-bounds access. You are correct that this is technically an overhead for the CPU since it uses a scalar implementation. But in the future, the FA CPU implementation could be vectorized and also benefit from this padding. Also this padding is a necessary step for reusing compute graphs during generation.

0 replies

Zijie-Tian · 2025-05-28T05:07:46Z

Zijie-Tian
May 28, 2025
Author

Also this padding is a necessary step for reusing compute graphs during generation.

I generally understand the purpose of this 256 alignment, But based on the code below, the graph still be built multiple times on the CPU side?

llama.cpp/src/llama-context.cpp

Lines 977 to 985 in 1e8659e

    
           ggml_backend_sched_reset(sched.get()); 
        
           ggml_backend_sched_set_eval_callback(sched.get(), cparams.cb_eval, cparams.cb_eval_user_data); 
        
           auto * gf = graph_init(); 
        
           auto res = graph_build(ctx_compute.get(), gf, ubatch, LLM_GRAPH_TYPE_DECODER); 
        
           // LLAMA_LOG_INFO("graph build time: %.3f ms (%d nodes, %d leafs)\n", (ggml_time_us() - t_start_us)/1000.0, gf->n_nodes, gf->n_leafs); 
        
           ggml_backend_sched_alloc_graph(sched.get(), gf);

Furthermore, will this typically introduce a high performance overhead?

2 replies

ggerganov May 28, 2025
Maintainer

Yes, the graph is currently constructed for every ubatch and this introduces some overhead.

The padding is a necessary step to be able to reuse the graphs, but it's not everything. Currently, there is a dynamic offset head in the KV cache that changes for every ubatch:

llama.cpp/src/llama-kv-cache.cpp

Lines 542 to 554 in 05f6ac6

    
           ggml_tensor * llama_kv_cache_unified::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, int32_t il) const { 
        
               const int32_t ikv = map_layer_ids.at(il); 
        
               auto * k = layers[ikv].k; 
        
               const int64_t n_tokens = k_cur->ne[2]; 
        
               ggml_tensor * k_view = ggml_view_1d(ctx, k, 
        
                       n_tokens*hparams.n_embd_k_gqa(il), 
        
                       ggml_row_size(k->type, hparams.n_embd_k_gqa(il))*head); 
        
               return ggml_cpy(ctx, k_cur, k_view); 
        
           }

The head variable is different each time which prevents reuse of the graph, even if an ubatch with the same size is processed.

There is a way to avoid this dynamic offset by extending the ggml_cpy operation to read an optional offset from a tensor. We would then put the head value in this tensor as an input for the graph. This way, the graph would become static with respect to the input shapes and we will be able to introduce logic for reusing the ggml_cgraphs by simply checking that the shapes of all input tensors are the same as last time.

Answer selected by Zijie-Tian

Zijie-Tian May 28, 2025
Author

Ohhhhh, I know what you're talking about now, thank you so much! 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: KV Cache Size in GGML_FLASH_ATTN_EXT Operator #13816

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

	ggml_tensor * llama_kv_cache_unified::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, int32_t il) const {
	const int32_t ikv = map_layer_ids.at(il);

	auto * k = layers[ikv].k;

	const int64_t n_tokens = k_cur->ne[2];

	ggml_tensor * k_view = ggml_view_1d(ctx, k,
	n_tokens*hparams.n_embd_k_gqa(il),
	ggml_row_size(k->type

Question: KV Cache Size in GGML_FLASH_ATTN_EXT Operator #13816

Uh oh!

Zijie-Tian May 27, 2025

Replies: 2 comments · 2 replies

Uh oh!

ggerganov May 27, 2025 Maintainer

Uh oh!

Zijie-Tian May 28, 2025 Author

Uh oh!

Uh oh!

ggerganov May 28, 2025 Maintainer

Uh oh!

Zijie-Tian May 28, 2025 Author

Zijie-Tian
May 27, 2025

Replies: 2 comments 2 replies

ggerganov
May 27, 2025
Maintainer

Zijie-Tian
May 28, 2025
Author

ggerganov May 28, 2025
Maintainer

Zijie-Tian May 28, 2025
Author