Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used #10860

ExtReMLapin · 2024-12-17T07:59:39Z

Name and Version

 ./build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 4338 (7b1ec53f)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

Hello,

Short version :

When using llama-server with only one slot ( --threads-http 1 -np 1) you can sequentially send prompts to process and there will be no speed degradation.

When you use multiple slots (starts showing up from slot # = 3, doesn't show up at slot = 2) generation will be slower and slower after each finished generation.

Used cli :

./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -ngl 99999

Also gave a try with :

--cache-reuse 50000 INEFECTIVE
--defrag-thold 0.0 or --defrag-thold 0.99 INEFECTIVE
--model /opt/IdExtend/models/llm/Mistral-7B-Instruct-v0.3.Q8_0.gguf INEFECTIVE
-sm none INEFECTIVE
--flash-attn --cache-type-k q8_0 --cache-type-v q8_0 INEFECTIVE (was using it from start but decided to reduce to as few args as possible`

Yes I understand I have multiple slots and using them in sequence it dumb, issue is that I tried moving my backend from sequential use to parallel (so I had to create slots) but it doesn't go faster this is why i tried tracking down the issue cause and here I am.

Final run :

./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -sm none -ngl 99999

Python script logs :

[0]Time taken:  1.1400415897369385
[1]Time taken:  0.9648196697235107
[2]Time taken:  1.002309799194336
[3]Time taken:  1.353079080581665
[4]Time taken:  0.8274390697479248
[5]Time taken:  1.4006707668304443
[6]Time taken:  1.5088953971862793
[7]Time taken:  2.5358529090881348
[8]Time taken:  1.6904234886169434
[9]Time taken:  2.6186017990112305
[10]Time taken:  2.290717601776123
[11]Time taken:  2.0220725536346436
[12]Time taken:  1.9455785751342773
[13]Time taken:  3.2140021324157715
[14]Time taken:  2.404296636581421
[15]Time taken:  2.5479960441589355
[16]Time taken:  3.0076818466186523
[17]Time taken:  6.665952205657959
TOTAL Time taken for sequential:  39.140857458114624

You can find a zip with the python script to reproduce it attached.

responses.zip

Full server logs : server-logs.txt

Cleaned server logs :

prompt eval time =     254.31 ms /  2310 tokens (    0.11 ms per token,  9083.40 tokens per second)
       eval time =     879.65 ms /    97 tokens (    9.07 ms per token,   110.27 tokens per second)
      total time =    1133.96 ms /  2407 tokens
prompt eval time =     261.95 ms /  2343 tokens (    0.11 ms per token,  8944.49 tokens per second)
       eval time =     694.21 ms /    85 tokens (    8.17 ms per token,   122.44 tokens per second)
      total time =     956.16 ms /  2428 tokens
prompt eval time =     284.46 ms /  2285 tokens (    0.12 ms per token,  8032.76 tokens per second)
       eval time =     707.39 ms /    80 tokens (    8.84 ms per token,   113.09 tokens per second)
      total time =     991.85 ms /  2365 tokens
prompt eval time =     409.38 ms /  2924 tokens (    0.14 ms per token,  7142.46 tokens per second)
       eval time =     930.37 ms /    95 tokens (    9.79 ms per token,   102.11 tokens per second)
      total time =    1339.75 ms /  3019 tokens
prompt eval time =     357.83 ms /  2282 tokens (    0.16 ms per token,  6377.29 tokens per second)
       eval time =     454.73 ms /    44 tokens (   10.33 ms per token,    96.76 tokens per second)
      total time =     812.57 ms /  2326 tokens
prompt eval time =     388.00 ms /  2277 tokens (    0.17 ms per token,  5868.57 tokens per second)
       eval time =     996.40 ms /    89 tokens (   11.20 ms per token,    89.32 tokens per second)
      total time =    1384.39 ms /  2366 tokens
prompt eval time =     556.35 ms /  3011 tokens (    0.18 ms per token,  5412.09 tokens per second)
       eval time =     930.15 ms /    76 tokens (   12.24 ms per token,    81.71 tokens per second)
      total time =    1486.50 ms /  3087 tokens
prompt eval time =     618.16 ms /  3027 tokens (    0.20 ms per token,  4896.82 tokens per second)
       eval time =    1890.54 ms /   144 tokens (   13.13 ms per token,    76.17 tokens per second)
      total time =    2508.70 ms /  3171 tokens
prompt eval time =     651.99 ms /  2935 tokens (    0.22 ms per token,  4501.60 tokens per second)
       eval time =    1008.49 ms /    72 tokens (   14.01 ms per token,    71.39 tokens per second)
      total time =    1660.48 ms /  3007 tokens
prompt eval time =     903.68 ms /  2957 tokens (    0.31 ms per token,  3272.17 tokens per second)
       eval time =    1681.54 ms /   112 tokens (   15.01 ms per token,    66.61 tokens per second)
      total time =    2585.22 ms /  3069 tokens
prompt eval time =     805.01 ms /  2965 tokens (    0.27 ms per token,  3683.17 tokens per second)
       eval time =    1447.53 ms /    91 tokens (   15.91 ms per token,    62.87 tokens per second)
      total time =    2252.55 ms /  3056 tokens
prompt eval time =     831.70 ms /  2965 tokens (    0.28 ms per token,  3564.97 tokens per second)
       eval time =    1149.78 ms /    69 tokens (   16.66 ms per token,    60.01 tokens per second)
      total time =    1981.48 ms /  3034 tokens
prompt eval time =     996.94 ms /  2940 tokens (    0.34 ms per token,  2949.01 tokens per second)
       eval time =     905.74 ms /    52 tokens (   17.42 ms per token,    57.41 tokens per second)
      total time =    1902.69 ms /  2992 tokens
prompt eval time =     960.80 ms /  3074 tokens (    0.31 ms per token,  3199.42 tokens per second)
       eval time =    2201.62 ms /   118 tokens (   18.66 ms per token,    53.60 tokens per second)
      total time =    3162.42 ms /  3192 tokens
prompt eval time =    1161.53 ms /  2977 tokens (    0.39 ms per token,  2562.99 tokens per second)
       eval time =    1189.15 ms /    62 tokens (   19.18 ms per token,    52.14 tokens per second)
      total time =    2350.68 ms /  3039 tokens
prompt eval time =    1017.35 ms /  2934 tokens (    0.35 ms per token,  2883.97 tokens per second)
       eval time =    1481.01 ms /    76 tokens (   19.49 ms per token,    51.32 tokens per second)
      total time =    2498.35 ms /  3010 tokens
prompt eval time =    1035.18 ms /  2966 tokens (    0.35 ms per token,  2865.20 tokens per second)
       eval time =    1915.50 ms /    97 tokens (   19.75 ms per token,    50.64 tokens per second)
      total time =    2950.68 ms /  3063 tokens
prompt eval time =     638.59 ms /  1778 tokens (    0.36 ms per token,  2784.25 tokens per second)
       eval time =    5996.03 ms /   303 tokens (   19.79 ms per token,    50.53 tokens per second)
      total time =    6634.62 ms /  2081 tokens

First Bad Commit

No response

Relevant log output

No response

Edit :

I gave a try on another machine with build

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: GRID A100-40C, compute capability 8.0, VMM: no
  Device 1: GRID A100-40C, compute capability 8.0, VMM: no
  Device 2: GRID A100-40C, compute capability 8.0, VMM: no
version: 4149 (1bb30bf2)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Issue persists

Edit: i'm performing a binary search

-> version: 4149 (1bb30bf2) fail ❌

-> version: 4063 (505f3327) fail ❌

-> version: 4024 (329ed914) fail ❌

-> version: 4016 (42cadc74) fail ❌

-> version: 4015 (45950415) no issue ✔️

-> version: 4012 (7554aa46) no issue ✔️

Related PR ~~causing~~ introducing the issue : #10126

I doubt it CREATED the bug, I think it just reveleated the existing bug

The more slots it used, the slowed it is :

The text was updated successfully, but these errors were encountered:

JohannesGaessler · 2024-12-17T10:32:43Z

Does the total throughput still increase when you use more slots?

ExtReMLapin · 2024-12-17T10:46:42Z

Hello,

Python code was edited like this (python code in archive posted in first message)

runParallel()
runSequential()

runParallel()
runSequential()

(ran twice to fill/glitch/whatever all slots)

When server is started with 15 slots :

Time taken for parallel:  19.48640537261963
Time taken for sequential:  47.198315382003784
Time taken for parallel:  24.64862847328186
Time taken for sequential:  46.10104751586914

When server is ran with only 1slot :

Time taken for parallel:  27.321959018707275
Time taken for sequential:  18.821592569351196
Time taken for parallel:  17.11741614341736
Time taken for sequential:  17.89683699607849

To answer your question :

No, parallel run in a multislot server is not faster than sequential run in a single slot server
Yes, parallel run is a multislot server is faster than sequential run in a multiple slot server because it uses different slots instead of reusing slot with 0.

Sequential speeds are expected to be the same in single or multi slot configuration. But a glitch prevents that.

Parallel speeds in a multi slot server is expected to be faster than sequential single slot server.

1 slot server is always faster than any other slot configuration.

Bonus run as I didn't see the bug in a 2 slots server :

Time taken for parallel:  15.20603609085083
Time taken for sequential:  20.369807481765747
Time taken for parallel:  15.667339563369751
Time taken for sequential:  19.47208571434021

JohannesGaessler · 2024-12-17T12:22:27Z

Context: I did a lot of work on CUDA performance with a focus on a single user/slot. So far I did not prioritize throughput for multiple users/slots. I'm currently working on llama.cpp training/finetuning though and will eventually require more throughput for evaluating model quality post finetuning. So I will likely look into better server throughput in a few months time. I cannot speak for the priorities of other devs though.

JohannesGaessler · 2024-12-17T12:28:26Z

But since you nailed down the problem to a specific commit there is a good chance that it can be fixed. I just meant to say more generally that in the future there will likely be more dev attention on server throughput.

ExtReMLapin · 2024-12-17T12:36:13Z

There is a confusion, it's not this commit that created this bug, it's this commit that easily revealed it, because before that if was only using the first slot. And as soon as you use more slot (even before that), performance were going down.

ExtReMLapin · 2024-12-17T12:37:45Z

Context: I did a lot of work on CUDA performance with a focus on a single user/slot. So far I did not prioritize throughput for multiple users/slots. I'm currently working on llama.cpp training/finetuning though and will eventually require more throughput for evaluating model quality post finetuning. So I will likely look into better server throughput in a few months time. I cannot speak for the priorities of other devs though.

I could be terribly wrong, but isn't batch processing supposed to provide higher total token/s throuput than single slot ?

For example, with one prompt you will get 150 t/s but at 10 you will get 100 t/s per prompts which implies 1000 t/s in total, which is faster at the end of the day, right ?

ggerganov · 2024-12-17T12:40:53Z

This is expected - it's a side effect of the unified KV cache. Effectively, all slots keep their context in the common context, so with each requests the KV cache grows. This will be fixed after we refactor the implementation to support parallel-friendly KV cache. For now, don't use more than 4 slots and use -dt 0.1.

ExtReMLapin · 2024-12-17T13:14:34Z

Hello and thanks for the answer Ggerganov.

For later usage, I ran more test :

4 slots server + short prompt + low n_predict leads to no issue and everything works well

As soon as I use a longer prompt, or more than 4 slots or higher n_predict performances goes down.

elter-tef · 2024-12-21T11:35:07Z

I also ran into the problem of speed degradation when using multiple slots. Is it possible as a temporary solution to clear the kv cache after the slot finishes generating? This can be added as an additional parameter.

ggerganov · 2024-12-21T14:46:26Z

The problem is we don't know when the slot has finished.

ExtReMLapin · 2025-01-27T07:33:48Z

stale but not fixed ! (just in case it auto closes the issue) AFAIK

ggerganov · 2025-01-27T07:43:44Z

#11213 will be a major step towards resolving this. It's one of the higher priorities now.

github-actions · 2025-03-13T01:08:00Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ExtReMLapin · 2025-03-13T03:17:35Z

Bad bot !

ExtReMLapin · 2025-03-26T05:49:08Z

Didn't take the time to re-re-run tests but I doubt it was fixed in #12181 as it was only a refactor, is it possible to reopen ?

ExtReMLapin added the bug-unconfirmed label Dec 17, 2024

ExtReMLapin changed the title ~~Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation~~ Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used Dec 17, 2024

github-actions bot added the stale label Jan 21, 2025

ggerganov removed the stale label Jan 27, 2025

github-actions bot added the stale label Feb 27, 2025

github-actions bot closed this as completed Mar 13, 2025

ggerganov reopened this Mar 26, 2025

github-actions bot removed the stale label Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used #10860

Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used #10860

ExtReMLapin commented Dec 17, 2024 •

edited

Loading

JohannesGaessler commented Dec 17, 2024

ExtReMLapin commented Dec 17, 2024 •

edited

Loading

JohannesGaessler commented Dec 17, 2024

JohannesGaessler commented Dec 17, 2024

ExtReMLapin commented Dec 17, 2024

ExtReMLapin commented Dec 17, 2024

ggerganov commented Dec 17, 2024

ExtReMLapin commented Dec 17, 2024

elter-tef commented Dec 21, 2024

ggerganov commented Dec 21, 2024

ExtReMLapin commented Jan 27, 2025 •

edited

Loading

ggerganov commented Jan 27, 2025

github-actions bot commented Mar 13, 2025

ExtReMLapin commented Mar 13, 2025

ExtReMLapin commented Mar 26, 2025

Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used #10860

Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used #10860

Comments

ExtReMLapin commented Dec 17, 2024 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Problem description & steps to reproduce

First Bad Commit

Relevant log output

JohannesGaessler commented Dec 17, 2024

ExtReMLapin commented Dec 17, 2024 • edited Loading

JohannesGaessler commented Dec 17, 2024

JohannesGaessler commented Dec 17, 2024

ExtReMLapin commented Dec 17, 2024

ExtReMLapin commented Dec 17, 2024

ggerganov commented Dec 17, 2024

ExtReMLapin commented Dec 17, 2024

elter-tef commented Dec 21, 2024

ggerganov commented Dec 21, 2024

ExtReMLapin commented Jan 27, 2025 • edited Loading

ggerganov commented Jan 27, 2025

github-actions bot commented Mar 13, 2025

ExtReMLapin commented Mar 13, 2025

ExtReMLapin commented Mar 26, 2025

ExtReMLapin commented Dec 17, 2024 •

edited

Loading

ExtReMLapin commented Dec 17, 2024 •

edited

Loading

ExtReMLapin commented Jan 27, 2025 •

edited

Loading