-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used #10860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does the total throughput still increase when you use more slots? |
Hello, Python code was edited like this (python code in archive posted in first message) runParallel()
runSequential()
runParallel()
runSequential() (ran twice to fill/glitch/whatever all slots) When server is started with 15 slots : Time taken for parallel: 19.48640537261963
Time taken for sequential: 47.198315382003784
Time taken for parallel: 24.64862847328186
Time taken for sequential: 46.10104751586914 When server is ran with only 1slot : Time taken for parallel: 27.321959018707275
Time taken for sequential: 18.821592569351196
Time taken for parallel: 17.11741614341736
Time taken for sequential: 17.89683699607849 To answer your question : No, parallel run in a multislot server is not faster than sequential run in a single slot server Sequential speeds are expected to be the same in single or multi slot configuration. But a glitch prevents that. Parallel speeds in a multi slot server is expected to be faster than sequential single slot server. 1 slot server is always faster than any other slot configuration. Bonus run as I didn't see the bug in a 2 slots server : Time taken for parallel: 15.20603609085083
Time taken for sequential: 20.369807481765747
Time taken for parallel: 15.667339563369751
Time taken for sequential: 19.47208571434021 |
Context: I did a lot of work on CUDA performance with a focus on a single user/slot. So far I did not prioritize throughput for multiple users/slots. I'm currently working on llama.cpp training/finetuning though and will eventually require more throughput for evaluating model quality post finetuning. So I will likely look into better server throughput in a few months time. I cannot speak for the priorities of other devs though. |
But since you nailed down the problem to a specific commit there is a good chance that it can be fixed. I just meant to say more generally that in the future there will likely be more dev attention on server throughput. |
There is a confusion, it's not this commit that created this bug, it's this commit that easily revealed it, because before that if was only using the first slot. And as soon as you use more slot (even before that), performance were going down. |
I could be terribly wrong, but isn't batch processing supposed to provide higher total token/s throuput than single slot ? For example, with one prompt you will get 150 t/s but at 10 you will get 100 t/s per prompts which implies 1000 t/s in total, which is faster at the end of the day, right ? |
This is expected - it's a side effect of the unified KV cache. Effectively, all slots keep their context in the common context, so with each requests the KV cache grows. This will be fixed after we refactor the implementation to support parallel-friendly KV cache. For now, don't use more than 4 slots and use |
Hello and thanks for the answer Ggerganov. For later usage, I ran more test : 4 slots server + short prompt + low As soon as I use a longer prompt, or more than 4 slots or higher |
I also ran into the problem of speed degradation when using multiple slots. Is it possible as a temporary solution to clear the kv cache after the slot finishes generating? This can be added as an additional parameter. |
The problem is we don't know when the slot has finished. |
stale but not fixed ! (just in case it auto closes the issue) AFAIK |
#11213 will be a major step towards resolving this. It's one of the higher priorities now. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Bad bot ! |
Didn't take the time to re-re-run tests but I doubt it was fixed in #12181 as it was only a refactor, is it possible to reopen ? |
Name and Version
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
Hello,
Short version :
When using llama-server with only one slot (
--threads-http 1 -np 1
) you can sequentially send prompts to process and there will be no speed degradation.When you use multiple slots (starts showing up from slot # = 3, doesn't show up at slot = 2) generation will be slower and slower after each finished generation.
Used cli :
./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -ngl 99999
Also gave a try with :
--cache-reuse 50000
INEFECTIVE--defrag-thold 0.0
or--defrag-thold 0.99
INEFECTIVE--model /opt/IdExtend/models/llm/Mistral-7B-Instruct-v0.3.Q8_0.gguf
INEFECTIVE-sm none
INEFECTIVE--flash-attn --cache-type-k q8_0 --cache-type-v q8_0
INEFECTIVE (was using it from start but decided to reduce to as few args as possible`Yes I understand I have multiple slots and using them in sequence it dumb, issue is that I tried moving my backend from sequential use to parallel (so I had to create slots) but it doesn't go faster this is why i tried tracking down the issue cause and here I am.
Final run :
Python script logs :
You can find a zip with the python script to reproduce it attached.
responses.zip
Full server logs : server-logs.txt
Cleaned server logs :
First Bad Commit
No response
Relevant log output
No response
Edit :
I gave a try on another machine with build
Issue persists
Edit: i'm performing a binary search
->
version: 4149 (1bb30bf2)
fail ❌->
version: 4063 (505f3327)
fail ❌->
version: 4024 (329ed914)
fail ❌->
version: 4016 (42cadc74)
fail ❌->
version: 4015 (45950415)
no issue ✔️->
version: 4012 (7554aa46)
no issue ✔️Related PR
causingintroducing the issue : #10126I doubt it CREATED the bug, I think it just reveleated the existing bug
The more slots it used, the slowed it is :
The text was updated successfully, but these errors were encountered: