kv cache slot search improvements #3493

KerfuffleV2 · 2023-10-05T22:15:38Z

Some improvements to KV cache head handling logic which may speed finding an empty slot and reduce fragmentation in some cases.

KerfuffleV2 · 2023-10-05T22:18:01Z

llama.cpp

@@ -1316,8 +1316,8 @@ static bool llama_kv_cache_find_slot(

    while (true) {
        if (cache.head + n_tokens > n_ctx) {
+            n_tested += cache.size - cache.head;


This might be wrong, but I really don't understand the logic of setting n_tested to the full cache size in this case. That means we just give up searching in that case, but it's possible there's an empty slot of n_tokens at a point before the current cache.head, right?

At the least n_tested += n_ctx - cache.head is kind of redundant since cache.head just got set to 0. So this is like n_tested += cache.head - 0

I also really wanted to write something like

// Check if there's an empty slot past what we allocated. If so, we can // set head to it and immediately find an empty slot next time. Otherwise // just reset head to 0. if (cache.head + n_tokens < n_ctx && cache.cells[cache.head + n_tokens].pos < 0 && cache.cells[cache.head + n_tokens].seq_id.empty()) { cache.head += n_tokens; } else { cache.head = 0; }

at the end of the function but it seems like cache.head is used to communicate the start of the tokens to evaluate for the actual evaluation (and it also does cache.head += n_tokens after).

Oops, that was a bug! The fix is OK, but we should either use cache.size or n_ctx at all places in the function. No need to mix it.

KerfuffleV2 · 2023-10-05T22:24:19Z

llama.cpp

 }

 static void llama_kv_cache_seq_rm(
        struct llama_kv_cache & cache,
                 llama_seq_id   seq_id,
                    llama_pos   p0,
                    llama_pos   p1) {
+    uint32_t new_head = cache.size;


The logic for these few functions follows basically the same pattern. We'll set new_head to the first freed slot if there is one. new_head = cache.size is a safe but invalid value we know will fit in uint32_t.

This could maybe be improved by only changing cache.head if it 1) doesn't already point at a free slot, and 2) points at an index greater than the one that got freed up. The idea would be to try to maximize using slots near the beginning of the cache. I'm not sure doing this is really worth the complexity though.

These changes (even in the simple form) will slow down the cache manipulation functions a little. I think that at least is worth it because searching for a slot is probably the most common case.

KerfuffleV2 · 2023-10-05T22:28:44Z

Regarding https://github.com/ggerganov/llama.cpp/blob/48edda30ee545fdac2e7a33d505382888f748bbf/llama.cpp#L1318-L1332

Because of

cache.head + n_tokens > n_ctx

rather than

cache.head + n_tokens >= n_ctx

I really feel like there's a way to hit line 1325 where cache.head == cache.size, which means doing cache.cells[cache.size] which would be invalid. So far I haven't come up with a scenario to prove it though.

If

https://github.com/ggerganov/llama.cpp/blob/48edda30ee545fdac2e7a33d505382888f748bbf/llama.cpp#L4545-L4547

got changed to set head to 0 if >= the cache size that would solve the problem. I was thinking about also adding a check if the slot it pointed at was free, but it feels like it's the wrong place for that kind of logic.

What do you think about changing it to something like:

if (ctx.kv_self.head + n_tokens >= ctx.kv_self.size) {
    ctx.kv_self.head = 0;
} else {
    ctx.kv_self.head += n_tokens;
}

ggerganov · 2023-10-06T12:39:50Z

I don't think cache.head + n_tokens >= n_ctx is correct. We need > instead of >=

Example:

head == 0
size == 10
n_tokens == 10

This fits in the cache

KerfuffleV2 · 2023-10-06T13:01:28Z

I don't think cache.head + n_tokens >= n_ctx is correct.

Right, that change isn't possible right now in llama_kv_cache_find_slot because we wouldn't find a slot even when one existed. What I meant is because of the way the logic is, later on in llama_decode_internal, cache.head can get set to cache.size.

So I was talking about changing llama_decode_internal to do

if (ctx.kv_self.head + n_tokens >= ctx.kv_self.size) {
    ctx.kv_self.head = 0;
} else {
    ctx.kv_self.head += n_tokens;
}

instead of just ctx.kv_self.head += n_tokens. That way cache.head will always be pointing to a valid index.

ggerganov · 2023-10-06T13:12:55Z

Yes, that makes sense

Add some comments to prevent dumb people (like me) from getting confused.

ggerganov

This should be OK to merge. Make sure you did a few tests and when you think it's stable - merge it

Btw, is there any measurable performance gain from the change?

KerfuffleV2 · 2023-10-06T16:09:06Z

Make sure you did a few tests and when you think it's stable - merge it

I had done some testing at the point I submitted, but I did more extensive tests as well. As far as I can see, it doesn't break anything - tested with main, parallel, batched and perplexity.

Btw, is there any measurable performance gain from the change?

I think any noticeable performance gains in a realistic situation would come from less fragmentation and being able to avoid splitting batches rather than the slot search part. I think mentioned in issue about defragging that I realized later that the slot search wasn't the issue. It seems like generation speeds just slow down over time when running in batched mode (this doesn't happen with only item in the batch).

Pretty much the ideal case for speeding up the slot search would be a small model and a very long context. I tried generating 8,000 tokens with a 3b model. The difference was pretty small, and may just have been random fluctuation. 29.56 TPS with this pull vs 29.21 TPS without it.

It seems like it possibly makes more of a difference when running parallel (because of potentially resisting fragmentation more) but because of the way the parallel example chooses prompts and generates I'm not sure if it's really an apples to apples comparison. My results there weren't really conclusive.

What will actually make a big difference is being able to defrag the KV cache but I haven't figured out how to do that. If I had any kind of example of the operations necessary to swap two slots (or swap based on a list of deltas like the shift stuff generates) then I could probably handle the rest of it. edit: I created #3507 to look for guidance...

…example * 'master' of github.com:ggerganov/llama.cpp: kv cache slot search improvements (ggml-org#3493) prompts : fix editorconfig checks after ggml-org#3416 parallel : add option to load external prompt file (ggml-org#3416) server : reuse llama_sample_token common util (ggml-org#3494) llama : correct hparams comparison (ggml-org#3446) ci : fix xcodebuild destinations (ggml-org#3491) convert : update Falcon script for new HF config (ggml-org#3448) build : use std::make_tuple() for compatibility with older GCC versions (ggml-org#3488) common : process escape sequences in reverse prompts (ggml-org#3461) CLBlast: Fix handling of on-device tensor data server : fix incorrect num_tokens_predicted (ggml-org#3480) swift : disable ACCELERATE_NEW_LAPACK (ggml-org#3481) ci : add swift build via xcodebuild (ggml-org#3482)

* kv cache slot search improvements * Use n_ctx in kv find slot for consistency * Ensure kv cache head points to a valid slot in llama_decode internal * Add some comments to prevent dumb people (like me) from getting confused.

kv cache slot search improvements

abafd01

KerfuffleV2 commented Oct 5, 2023

View reviewed changes

KerfuffleV2 requested a review from ggerganov October 5, 2023 22:29

Use n_ctx in kv find slot for consistency

3144563

Ensure kv cache head points to a valid slot in llama_decode internal

465b8f4

Add some comments to prevent dumb people (like me) from getting confused.

ggerganov approved these changes Oct 6, 2023

View reviewed changes

KerfuffleV2 merged commit 9ca79d5 into ggml-org:master Oct 6, 2023

KerfuffleV2 deleted the feat-kv_cache_improvements branch November 17, 2023 03:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv cache slot search improvements #3493

kv cache slot search improvements #3493

KerfuffleV2 commented Oct 5, 2023

KerfuffleV2 Oct 5, 2023 •

edited

Loading

ggerganov Oct 6, 2023

KerfuffleV2 Oct 5, 2023

KerfuffleV2 commented Oct 5, 2023 •

edited

Loading

ggerganov commented Oct 6, 2023

KerfuffleV2 commented Oct 6, 2023

ggerganov commented Oct 6, 2023

ggerganov left a comment

KerfuffleV2 commented Oct 6, 2023 •

edited

Loading

kv cache slot search improvements #3493

kv cache slot search improvements #3493

Conversation

KerfuffleV2 commented Oct 5, 2023

KerfuffleV2 Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov Oct 6, 2023

Choose a reason for hiding this comment

KerfuffleV2 Oct 5, 2023

Choose a reason for hiding this comment

KerfuffleV2 commented Oct 5, 2023 • edited Loading

ggerganov commented Oct 6, 2023

KerfuffleV2 commented Oct 6, 2023

ggerganov commented Oct 6, 2023

ggerganov left a comment

Choose a reason for hiding this comment

KerfuffleV2 commented Oct 6, 2023 • edited Loading

KerfuffleV2 Oct 5, 2023 •

edited

Loading

KerfuffleV2 commented Oct 5, 2023 •

edited

Loading

KerfuffleV2 commented Oct 6, 2023 •

edited

Loading