Skip to content

kv cache slot search improvements #3493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 6, 2023

Conversation

KerfuffleV2
Copy link
Collaborator

Some improvements to KV cache head handling logic which may speed finding an empty slot and reduce fragmentation in some cases.

llama.cpp Outdated
@@ -1316,8 +1316,8 @@ static bool llama_kv_cache_find_slot(

while (true) {
if (cache.head + n_tokens > n_ctx) {
n_tested += cache.size - cache.head;
Copy link
Collaborator Author

@KerfuffleV2 KerfuffleV2 Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be wrong, but I really don't understand the logic of setting n_tested to the full cache size in this case. That means we just give up searching in that case, but it's possible there's an empty slot of n_tokens at a point before the current cache.head, right?

At the least n_tested += n_ctx - cache.head is kind of redundant since cache.head just got set to 0. So this is like n_tested += cache.head - 0


I also really wanted to write something like

// Check if there's an empty slot past what we allocated. If so, we can
// set head to it and immediately find an empty slot next time. Otherwise
// just reset head to 0.
if (cache.head + n_tokens < n_ctx && cache.cells[cache.head + n_tokens].pos < 0 && cache.cells[cache.head + n_tokens].seq_id.empty()) {
    cache.head += n_tokens;
} else {
    cache.head = 0;
}

at the end of the function but it seems like cache.head is used to communicate the start of the tokens to evaluate for the actual evaluation (and it also does cache.head += n_tokens after).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that was a bug! The fix is OK, but we should either use cache.size or n_ctx at all places in the function. No need to mix it.

}

static void llama_kv_cache_seq_rm(
struct llama_kv_cache & cache,
llama_seq_id seq_id,
llama_pos p0,
llama_pos p1) {
uint32_t new_head = cache.size;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for these few functions follows basically the same pattern. We'll set new_head to the first freed slot if there is one. new_head = cache.size is a safe but invalid value we know will fit in uint32_t.

This could maybe be improved by only changing cache.head if it 1) doesn't already point at a free slot, and 2) points at an index greater than the one that got freed up. The idea would be to try to maximize using slots near the beginning of the cache. I'm not sure doing this is really worth the complexity though.

These changes (even in the simple form) will slow down the cache manipulation functions a little. I think that at least is worth it because searching for a slot is probably the most common case.

@KerfuffleV2
Copy link
Collaborator Author

KerfuffleV2 commented Oct 5, 2023

Regarding https://github.com/ggerganov/llama.cpp/blob/48edda30ee545fdac2e7a33d505382888f748bbf/llama.cpp#L1318-L1332

Because of

cache.head + n_tokens > n_ctx

rather than

cache.head + n_tokens >= n_ctx

I really feel like there's a way to hit line 1325 where cache.head == cache.size, which means doing cache.cells[cache.size] which would be invalid. So far I haven't come up with a scenario to prove it though.

If

https://github.com/ggerganov/llama.cpp/blob/48edda30ee545fdac2e7a33d505382888f748bbf/llama.cpp#L4545-L4547

got changed to set head to 0 if >= the cache size that would solve the problem. I was thinking about also adding a check if the slot it pointed at was free, but it feels like it's the wrong place for that kind of logic.

What do you think about changing it to something like:

if (ctx.kv_self.head + n_tokens >= ctx.kv_self.size) {
    ctx.kv_self.head = 0;
} else {
    ctx.kv_self.head += n_tokens;
}

@KerfuffleV2 KerfuffleV2 requested a review from ggerganov October 5, 2023 22:29
@ggerganov
Copy link
Member

I don't think cache.head + n_tokens >= n_ctx is correct. We need > instead of >=

Example:

head == 0
size == 10
n_tokens == 10

This fits in the cache

@KerfuffleV2
Copy link
Collaborator Author

I don't think cache.head + n_tokens >= n_ctx is correct.

Right, that change isn't possible right now in llama_kv_cache_find_slot because we wouldn't find a slot even when one existed. What I meant is because of the way the logic is, later on in llama_decode_internal, cache.head can get set to cache.size.

So I was talking about changing llama_decode_internal to do

if (ctx.kv_self.head + n_tokens >= ctx.kv_self.size) {
    ctx.kv_self.head = 0;
} else {
    ctx.kv_self.head += n_tokens;
}

instead of just ctx.kv_self.head += n_tokens. That way cache.head will always be pointing to a valid index.

@ggerganov
Copy link
Member

Yes, that makes sense

Add some comments to prevent dumb people (like me) from getting confused.
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be OK to merge. Make sure you did a few tests and when you think it's stable - merge it

Btw, is there any measurable performance gain from the change?

@KerfuffleV2
Copy link
Collaborator Author

KerfuffleV2 commented Oct 6, 2023

Make sure you did a few tests and when you think it's stable - merge it

I had done some testing at the point I submitted, but I did more extensive tests as well. As far as I can see, it doesn't break anything - tested with main, parallel, batched and perplexity.

Btw, is there any measurable performance gain from the change?

I think any noticeable performance gains in a realistic situation would come from less fragmentation and being able to avoid splitting batches rather than the slot search part. I think mentioned in issue about defragging that I realized later that the slot search wasn't the issue. It seems like generation speeds just slow down over time when running in batched mode (this doesn't happen with only item in the batch).

Pretty much the ideal case for speeding up the slot search would be a small model and a very long context. I tried generating 8,000 tokens with a 3b model. The difference was pretty small, and may just have been random fluctuation. 29.56 TPS with this pull vs 29.21 TPS without it.

It seems like it possibly makes more of a difference when running parallel (because of potentially resisting fragmentation more) but because of the way the parallel example chooses prompts and generates I'm not sure if it's really an apples to apples comparison. My results there weren't really conclusive.

What will actually make a big difference is being able to defrag the KV cache but I haven't figured out how to do that. If I had any kind of example of the operations necessary to swap two slots (or swap based on a list of deltas like the shift stuff generates) then I could probably handle the rest of it. edit: I created #3507 to look for guidance...

@KerfuffleV2 KerfuffleV2 merged commit 9ca79d5 into ggml-org:master Oct 6, 2023
joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 6, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp:
  kv cache slot search improvements (ggml-org#3493)
  prompts : fix editorconfig checks after ggml-org#3416
  parallel : add option to load external prompt file (ggml-org#3416)
  server : reuse llama_sample_token common util (ggml-org#3494)
  llama : correct hparams comparison (ggml-org#3446)
  ci : fix xcodebuild destinations (ggml-org#3491)
  convert : update Falcon script for new HF config (ggml-org#3448)
  build : use std::make_tuple() for compatibility with older GCC versions (ggml-org#3488)
  common : process escape sequences in reverse prompts (ggml-org#3461)
  CLBlast: Fix handling of on-device tensor data
  server : fix incorrect num_tokens_predicted (ggml-org#3480)
  swift : disable ACCELERATE_NEW_LAPACK (ggml-org#3481)
  ci : add swift build via xcodebuild (ggml-org#3482)
yusiwen pushed a commit to yusiwen/llama.cpp that referenced this pull request Oct 7, 2023
* kv cache slot search improvements

* Use n_ctx in kv find slot for consistency

* Ensure kv cache head points to a valid slot in llama_decode internal

* Add some comments to prevent dumb people (like me) from getting confused.
@KerfuffleV2 KerfuffleV2 deleted the feat-kv_cache_improvements branch November 17, 2023 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants