Skip to content

Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used #10860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ExtReMLapin opened this issue Dec 17, 2024 · 15 comments

Comments

@ExtReMLapin
Copy link
Contributor

ExtReMLapin commented Dec 17, 2024

Name and Version

 ./build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 4338 (7b1ec53f)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

Hello,

Short version :

When using llama-server with only one slot ( --threads-http 1 -np 1) you can sequentially send prompts to process and there will be no speed degradation.

When you use multiple slots (starts showing up from slot # = 3, doesn't show up at slot = 2) generation will be slower and slower after each finished generation.

Used cli :

./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -ngl 99999

Also gave a try with :

  • --cache-reuse 50000 INEFECTIVE
  • --defrag-thold 0.0 or --defrag-thold 0.99 INEFECTIVE
  • --model /opt/IdExtend/models/llm/Mistral-7B-Instruct-v0.3.Q8_0.gguf INEFECTIVE
  • -sm none INEFECTIVE
  • --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 INEFECTIVE (was using it from start but decided to reduce to as few args as possible`

Yes I understand I have multiple slots and using them in sequence it dumb, issue is that I tried moving my backend from sequential use to parallel (so I had to create slots) but it doesn't go faster this is why i tried tracking down the issue cause and here I am.

Final run :

./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -sm none -ngl 99999

Python script logs :

[0]Time taken:  1.1400415897369385
[1]Time taken:  0.9648196697235107
[2]Time taken:  1.002309799194336
[3]Time taken:  1.353079080581665
[4]Time taken:  0.8274390697479248
[5]Time taken:  1.4006707668304443
[6]Time taken:  1.5088953971862793
[7]Time taken:  2.5358529090881348
[8]Time taken:  1.6904234886169434
[9]Time taken:  2.6186017990112305
[10]Time taken:  2.290717601776123
[11]Time taken:  2.0220725536346436
[12]Time taken:  1.9455785751342773
[13]Time taken:  3.2140021324157715
[14]Time taken:  2.404296636581421
[15]Time taken:  2.5479960441589355
[16]Time taken:  3.0076818466186523
[17]Time taken:  6.665952205657959
TOTAL Time taken for sequential:  39.140857458114624

You can find a zip with the python script to reproduce it attached.

responses.zip

Full server logs : server-logs.txt

Cleaned server logs :

prompt eval time =     254.31 ms /  2310 tokens (    0.11 ms per token,  9083.40 tokens per second)
       eval time =     879.65 ms /    97 tokens (    9.07 ms per token,   110.27 tokens per second)
      total time =    1133.96 ms /  2407 tokens
prompt eval time =     261.95 ms /  2343 tokens (    0.11 ms per token,  8944.49 tokens per second)
       eval time =     694.21 ms /    85 tokens (    8.17 ms per token,   122.44 tokens per second)
      total time =     956.16 ms /  2428 tokens
prompt eval time =     284.46 ms /  2285 tokens (    0.12 ms per token,  8032.76 tokens per second)
       eval time =     707.39 ms /    80 tokens (    8.84 ms per token,   113.09 tokens per second)
      total time =     991.85 ms /  2365 tokens
prompt eval time =     409.38 ms /  2924 tokens (    0.14 ms per token,  7142.46 tokens per second)
       eval time =     930.37 ms /    95 tokens (    9.79 ms per token,   102.11 tokens per second)
      total time =    1339.75 ms /  3019 tokens
prompt eval time =     357.83 ms /  2282 tokens (    0.16 ms per token,  6377.29 tokens per second)
       eval time =     454.73 ms /    44 tokens (   10.33 ms per token,    96.76 tokens per second)
      total time =     812.57 ms /  2326 tokens
prompt eval time =     388.00 ms /  2277 tokens (    0.17 ms per token,  5868.57 tokens per second)
       eval time =     996.40 ms /    89 tokens (   11.20 ms per token,    89.32 tokens per second)
      total time =    1384.39 ms /  2366 tokens
prompt eval time =     556.35 ms /  3011 tokens (    0.18 ms per token,  5412.09 tokens per second)
       eval time =     930.15 ms /    76 tokens (   12.24 ms per token,    81.71 tokens per second)
      total time =    1486.50 ms /  3087 tokens
prompt eval time =     618.16 ms /  3027 tokens (    0.20 ms per token,  4896.82 tokens per second)
       eval time =    1890.54 ms /   144 tokens (   13.13 ms per token,    76.17 tokens per second)
      total time =    2508.70 ms /  3171 tokens
prompt eval time =     651.99 ms /  2935 tokens (    0.22 ms per token,  4501.60 tokens per second)
       eval time =    1008.49 ms /    72 tokens (   14.01 ms per token,    71.39 tokens per second)
      total time =    1660.48 ms /  3007 tokens
prompt eval time =     903.68 ms /  2957 tokens (    0.31 ms per token,  3272.17 tokens per second)
       eval time =    1681.54 ms /   112 tokens (   15.01 ms per token,    66.61 tokens per second)
      total time =    2585.22 ms /  3069 tokens
prompt eval time =     805.01 ms /  2965 tokens (    0.27 ms per token,  3683.17 tokens per second)
       eval time =    1447.53 ms /    91 tokens (   15.91 ms per token,    62.87 tokens per second)
      total time =    2252.55 ms /  3056 tokens
prompt eval time =     831.70 ms /  2965 tokens (    0.28 ms per token,  3564.97 tokens per second)
       eval time =    1149.78 ms /    69 tokens (   16.66 ms per token,    60.01 tokens per second)
      total time =    1981.48 ms /  3034 tokens
prompt eval time =     996.94 ms /  2940 tokens (    0.34 ms per token,  2949.01 tokens per second)
       eval time =     905.74 ms /    52 tokens (   17.42 ms per token,    57.41 tokens per second)
      total time =    1902.69 ms /  2992 tokens
prompt eval time =     960.80 ms /  3074 tokens (    0.31 ms per token,  3199.42 tokens per second)
       eval time =    2201.62 ms /   118 tokens (   18.66 ms per token,    53.60 tokens per second)
      total time =    3162.42 ms /  3192 tokens
prompt eval time =    1161.53 ms /  2977 tokens (    0.39 ms per token,  2562.99 tokens per second)
       eval time =    1189.15 ms /    62 tokens (   19.18 ms per token,    52.14 tokens per second)
      total time =    2350.68 ms /  3039 tokens
prompt eval time =    1017.35 ms /  2934 tokens (    0.35 ms per token,  2883.97 tokens per second)
       eval time =    1481.01 ms /    76 tokens (   19.49 ms per token,    51.32 tokens per second)
      total time =    2498.35 ms /  3010 tokens
prompt eval time =    1035.18 ms /  2966 tokens (    0.35 ms per token,  2865.20 tokens per second)
       eval time =    1915.50 ms /    97 tokens (   19.75 ms per token,    50.64 tokens per second)
      total time =    2950.68 ms /  3063 tokens
prompt eval time =     638.59 ms /  1778 tokens (    0.36 ms per token,  2784.25 tokens per second)
       eval time =    5996.03 ms /   303 tokens (   19.79 ms per token,    50.53 tokens per second)
      total time =    6634.62 ms /  2081 tokens

First Bad Commit

No response

Relevant log output

No response

Edit :

I gave a try on another machine with build

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: GRID A100-40C, compute capability 8.0, VMM: no
  Device 1: GRID A100-40C, compute capability 8.0, VMM: no
  Device 2: GRID A100-40C, compute capability 8.0, VMM: no
version: 4149 (1bb30bf2)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Issue persists

Edit: i'm performing a binary search

-> version: 4149 (1bb30bf2) fail ❌

-> version: 4063 (505f3327) fail ❌

-> version: 4024 (329ed914) fail ❌

-> version: 4016 (42cadc74) fail ❌

-> version: 4015 (45950415) no issue ✔️

-> version: 4012 (7554aa46) no issue ✔️

Related PR causing introducing the issue : #10126

I doubt it CREATED the bug, I think it just reveleated the existing bug

The more slots it used, the slowed it is :

image

image

@ExtReMLapin ExtReMLapin changed the title Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used Dec 17, 2024
@JohannesGaessler
Copy link
Collaborator

Does the total throughput still increase when you use more slots?

@ExtReMLapin
Copy link
Contributor Author

ExtReMLapin commented Dec 17, 2024

Hello,

Python code was edited like this (python code in archive posted in first message)

runParallel()
runSequential()

runParallel()
runSequential()

(ran twice to fill/glitch/whatever all slots)

When server is started with 15 slots :

Time taken for parallel:  19.48640537261963
Time taken for sequential:  47.198315382003784
Time taken for parallel:  24.64862847328186
Time taken for sequential:  46.10104751586914

When server is ran with only 1slot :

Time taken for parallel:  27.321959018707275
Time taken for sequential:  18.821592569351196
Time taken for parallel:  17.11741614341736
Time taken for sequential:  17.89683699607849

To answer your question :

No, parallel run in a multislot server is not faster than sequential run in a single slot server
Yes, parallel run is a multislot server is faster than sequential run in a multiple slot server because it uses different slots instead of reusing slot with 0.

Sequential speeds are expected to be the same in single or multi slot configuration. But a glitch prevents that.

Parallel speeds in a multi slot server is expected to be faster than sequential single slot server.

1 slot server is always faster than any other slot configuration.

Bonus run as I didn't see the bug in a 2 slots server :

Time taken for parallel:  15.20603609085083
Time taken for sequential:  20.369807481765747
Time taken for parallel:  15.667339563369751
Time taken for sequential:  19.47208571434021

@JohannesGaessler
Copy link
Collaborator

Context: I did a lot of work on CUDA performance with a focus on a single user/slot. So far I did not prioritize throughput for multiple users/slots. I'm currently working on llama.cpp training/finetuning though and will eventually require more throughput for evaluating model quality post finetuning. So I will likely look into better server throughput in a few months time. I cannot speak for the priorities of other devs though.

@JohannesGaessler
Copy link
Collaborator

But since you nailed down the problem to a specific commit there is a good chance that it can be fixed. I just meant to say more generally that in the future there will likely be more dev attention on server throughput.

@ExtReMLapin
Copy link
Contributor Author

There is a confusion, it's not this commit that created this bug, it's this commit that easily revealed it, because before that if was only using the first slot. And as soon as you use more slot (even before that), performance were going down.

@ExtReMLapin
Copy link
Contributor Author

Context: I did a lot of work on CUDA performance with a focus on a single user/slot. So far I did not prioritize throughput for multiple users/slots. I'm currently working on llama.cpp training/finetuning though and will eventually require more throughput for evaluating model quality post finetuning. So I will likely look into better server throughput in a few months time. I cannot speak for the priorities of other devs though.

I could be terribly wrong, but isn't batch processing supposed to provide higher total token/s throuput than single slot ?

For example, with one prompt you will get 150 t/s but at 10 you will get 100 t/s per prompts which implies 1000 t/s in total, which is faster at the end of the day, right ?

@ggerganov
Copy link
Member

This is expected - it's a side effect of the unified KV cache. Effectively, all slots keep their context in the common context, so with each requests the KV cache grows. This will be fixed after we refactor the implementation to support parallel-friendly KV cache. For now, don't use more than 4 slots and use -dt 0.1.

@ExtReMLapin
Copy link
Contributor Author

Hello and thanks for the answer Ggerganov.

For later usage, I ran more test :

4 slots server + short prompt + low n_predict leads to no issue and everything works well

As soon as I use a longer prompt, or more than 4 slots or higher n_predict performances goes down.

@elter-tef
Copy link

I also ran into the problem of speed degradation when using multiple slots. Is it possible as a temporary solution to clear the kv cache after the slot finishes generating? This can be added as an additional parameter.

@ggerganov
Copy link
Member

The problem is we don't know when the slot has finished.

@github-actions github-actions bot added the stale label Jan 21, 2025
@ExtReMLapin
Copy link
Contributor Author

ExtReMLapin commented Jan 27, 2025

stale but not fixed ! (just in case it auto closes the issue) AFAIK

@ggerganov ggerganov removed the stale label Jan 27, 2025
@ggerganov
Copy link
Member

#11213 will be a major step towards resolving this. It's one of the higher priorities now.

@github-actions github-actions bot added the stale label Feb 27, 2025
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@ExtReMLapin
Copy link
Contributor Author

Bad bot !

@ExtReMLapin
Copy link
Contributor Author

Didn't take the time to re-re-run tests but I doubt it was fixed in #12181 as it was only a refactor, is it possible to reopen ?

@ggerganov ggerganov reopened this Mar 26, 2025
@github-actions github-actions bot removed the stale label Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
@ggerganov @ExtReMLapin @JohannesGaessler @elter-tef and others