-
Notifications
You must be signed in to change notification settings - Fork 11.6k
kv cache slot search improvements #3493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv cache slot search improvements #3493
Conversation
llama.cpp
Outdated
@@ -1316,8 +1316,8 @@ static bool llama_kv_cache_find_slot( | |||
|
|||
while (true) { | |||
if (cache.head + n_tokens > n_ctx) { | |||
n_tested += cache.size - cache.head; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be wrong, but I really don't understand the logic of setting n_tested
to the full cache size in this case. That means we just give up searching in that case, but it's possible there's an empty slot of n_tokens
at a point before the current cache.head
, right?
At the least n_tested += n_ctx - cache.head
is kind of redundant since cache.head
just got set to 0. So this is like n_tested += cache.head - 0
I also really wanted to write something like
// Check if there's an empty slot past what we allocated. If so, we can
// set head to it and immediately find an empty slot next time. Otherwise
// just reset head to 0.
if (cache.head + n_tokens < n_ctx && cache.cells[cache.head + n_tokens].pos < 0 && cache.cells[cache.head + n_tokens].seq_id.empty()) {
cache.head += n_tokens;
} else {
cache.head = 0;
}
at the end of the function but it seems like cache.head
is used to communicate the start of the tokens to evaluate for the actual evaluation (and it also does cache.head += n_tokens
after).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that was a bug! The fix is OK, but we should either use cache.size
or n_ctx
at all places in the function. No need to mix it.
} | ||
|
||
static void llama_kv_cache_seq_rm( | ||
struct llama_kv_cache & cache, | ||
llama_seq_id seq_id, | ||
llama_pos p0, | ||
llama_pos p1) { | ||
uint32_t new_head = cache.size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for these few functions follows basically the same pattern. We'll set new_head
to the first freed slot if there is one. new_head = cache.size
is a safe but invalid value we know will fit in uint32_t
.
This could maybe be improved by only changing cache.head
if it 1) doesn't already point at a free slot, and 2) points at an index greater than the one that got freed up. The idea would be to try to maximize using slots near the beginning of the cache. I'm not sure doing this is really worth the complexity though.
These changes (even in the simple form) will slow down the cache manipulation functions a little. I think that at least is worth it because searching for a slot is probably the most common case.
Because of
rather than
I really feel like there's a way to hit line 1325 where If got changed to set What do you think about changing it to something like: if (ctx.kv_self.head + n_tokens >= ctx.kv_self.size) {
ctx.kv_self.head = 0;
} else {
ctx.kv_self.head += n_tokens;
} |
I don't think Example:
This fits in the cache |
Right, that change isn't possible right now in So I was talking about changing if (ctx.kv_self.head + n_tokens >= ctx.kv_self.size) {
ctx.kv_self.head = 0;
} else {
ctx.kv_self.head += n_tokens;
} instead of just |
Yes, that makes sense |
Add some comments to prevent dumb people (like me) from getting confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be OK to merge. Make sure you did a few tests and when you think it's stable - merge it
Btw, is there any measurable performance gain from the change?
I had done some testing at the point I submitted, but I did more extensive tests as well. As far as I can see, it doesn't break anything - tested with
I think any noticeable performance gains in a realistic situation would come from less fragmentation and being able to avoid splitting batches rather than the slot search part. I think mentioned in issue about defragging that I realized later that the slot search wasn't the issue. It seems like generation speeds just slow down over time when running in batched mode (this doesn't happen with only item in the batch). Pretty much the ideal case for speeding up the slot search would be a small model and a very long context. I tried generating 8,000 tokens with a 3b model. The difference was pretty small, and may just have been random fluctuation. 29.56 TPS with this pull vs 29.21 TPS without it. It seems like it possibly makes more of a difference when running parallel (because of potentially resisting fragmentation more) but because of the way the What will actually make a big difference is being able to defrag the KV cache but I haven't figured out how to do that. If I had any kind of example of the operations necessary to swap two slots (or swap based on a list of deltas like the shift stuff generates) then I could probably handle the rest of it. edit: I created #3507 to look for guidance... |
…example * 'master' of github.com:ggerganov/llama.cpp: kv cache slot search improvements (ggml-org#3493) prompts : fix editorconfig checks after ggml-org#3416 parallel : add option to load external prompt file (ggml-org#3416) server : reuse llama_sample_token common util (ggml-org#3494) llama : correct hparams comparison (ggml-org#3446) ci : fix xcodebuild destinations (ggml-org#3491) convert : update Falcon script for new HF config (ggml-org#3448) build : use std::make_tuple() for compatibility with older GCC versions (ggml-org#3488) common : process escape sequences in reverse prompts (ggml-org#3461) CLBlast: Fix handling of on-device tensor data server : fix incorrect num_tokens_predicted (ggml-org#3480) swift : disable ACCELERATE_NEW_LAPACK (ggml-org#3481) ci : add swift build via xcodebuild (ggml-org#3482)
* kv cache slot search improvements * Use n_ctx in kv find slot for consistency * Ensure kv cache head points to a valid slot in llama_decode internal * Add some comments to prevent dumb people (like me) from getting confused.
Some improvements to KV cache
head
handling logic which may speed finding an empty slot and reduce fragmentation in some cases.