Skip to content

Conversation

@Matthew-Jenkins
Copy link
Contributor

Checklist:

$ pyflakes modules/shared.py
$ pycodestyle modules/shared.py
$

Default of 0 for threads and batch_threads means to spawn as many threads as there are tokens all at once. This can cause periods where a desktop can cease being interactive, high memory use and swapping on some setups. So I changed to thread=1/2 cpu count and batch_threads=2*cpu count. This should result in about 80% cpu util. People who have H100s that run completely on gpu hardware can up this as needed.

Lowered batch-size as well. 256 is 'safe' for consumer hardware. Uses slightly less vram and system ram and prevents periods of loss of interactivity on the desktop. batch-size -> number of tokens to process at once. When this was set to 2k and batch_threads was set to 0, it would drop 2k threads on cpu/gpu at a time. Obviously not acceptable for consumer hardware given io and interface bandwidth limits. A lower number will actually improve throughput on consumer hardware as it lower resource contention.

Added additional kv cache description. Cache should match the ggufs quantization level. There is no benefit using a fp16 cache for a q_4 model. It just means your cache can only hold 1/4 as many items with no accuracy benefit. Using the correct cache level can increase performance a lot.

@oobabooga oobabooga changed the base branch from main to dev April 22, 2025 15:07
@oobabooga
Copy link
Owner

I'm concerned about your findings potentially not being general and applying just to your particular system. My policy was to follow the defaults in llama.cpp; for the batch size, llama-cpp-python had 512 as the default for a long time, but after seeing the 2048 below, I changed it


-t,    --threads N                      number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-b,    --batch-size N                   logical maximum batch size (default: 2048)
                                        (env: LLAMA_ARG_BATCH)

About the system ceasing to be interactive, wouldn't that be best handled by setting the 'nice' level of llama-server in the launch command?

Added additional kv cache description. Cache should match the ggufs quantization level. There is no benefit using a fp16 cache for a q_4 model. It just means your cache can only hold 1/4 as many items with no accuracy benefit. Using the correct cache level can increase performance a lot.

This is the first time I read that as well, do you have evidence to support that?

@Matthew-Jenkins
Copy link
Contributor Author

My Goal

here is to be generalized to consumer level hardware. Power users with HEDTs or H100s can finagle settings later. But if you want this to be as accessible as possible to generic 'user' then conservative safe defaults are best. Ideally most of these settings should detect a 'safe' value or use the model defaults, whichever is lower. That can be done later. You know, 80/20 rule.

--threads --threads-batch and --batch-size

Just to clarify threads, thread-batch are cpu specific arguments. Only batch_size affects both cpu and gpu.
Sources:
https://github.com/ggml-org/llama.cpp/blob/658987cfc9d752dca7758987390d5fb1a7a0a54a/common/arg.cpp#L1174-L1178
https://github.com/ggml-org/llama.cpp/blob/658987cfc9d752dca7758987390d5fb1a7a0a54a/common/arg.cpp#L1183-L1188
https://github.com/ggml-org/llama.cpp/blob/658987cfc9d752dca7758987390d5fb1a7a0a54a/common/arg.cpp#L1313-L1316

KV Cache calculation

To clarify how batch size impacts gpu and memory pressure. Here is a breakdown of the KV cache formula:
KV Cache = 2 × batch_size × n_ctx × num_layers × num_heads × head_dim × bytes_per_element

On being nice

Niceness does not do anything about io resource contention. Io resource contention examples are PCIe saturation, Memory swap storms, IRQ floods and too many threads under pressure.
Being nice affects cpu scheduling, not io scheduling. But if you have thousands of threads, even when niced they will still flood out less nice threads.

Once pcie bandwidth is saturated, you get to sit there and either hard reboot or wait for your desktop to come back to life. This can take a long time sometimes.

In addition, suppose you flood your system with 2000 threads. And it dutifully chugs along checking each one. Each one starts a memory or pcie transfer and goes to sleep. Now you're in an IRQ storm for each transfer that completes. And each thread may need many transfers. Every irq is going to interrupt the cpu leading to very bad latency. Being nice here doesn't help either. For example.

In my own testing, changing batch-size from 2048 to 256 saved 600MB Vram with no noticable drop in tok/s on Gemma3 12B Q4 @ 14k context, but could very well prevent a lock up on many other systems.

So my tl;dr is, batch-size shouldn't be tuned by default for peak throughput. It should be tuned for the widest usable default on consumer hardware. Model aware defaults (Not now later) or just 256/512 keeps it accessible.

On quantization matching cache format

This is true by definition. Using FP16 KV cache for a Q4 model does not improve quality but does quad the memory cost. Once a model is quantized below FP16, you aren't going to magically get FP16 accuracy back out. The only exception would be for a lora but the gains would probably not be worth the memory.

@oobabooga
Copy link
Owner

I was looking at how ollama handles this to have an additional reference, and apparently they use

	defaultThreads := systemInfo.GetOptimalThreadCount()
	if opts.NumThread > 0 {
		params = append(params, "--threads", strconv.Itoa(opts.NumThread))
	} else if defaultThreads > 0 {
		params = append(params, "--threads", strconv.Itoa(defaultThreads))
	}

which itself uses

// Return the optimal number of threads to use for inference
func (si SystemInfo) GetOptimalThreadCount() int {
	if len(si.System.CPUs) == 0 {
		return 0
	}

	coreCount := 0
	for _, c := range si.System.CPUs {
		coreCount += c.CoreCount - c.EfficiencyCoreCount
	}

	return coreCount
}

see https://github.com/ollama/ollama/blob/main/llm/server.go and https://github.com/ollama/ollama/blob/main/discover/types.go.

The second function can't be done in a cross-platform way in Python because it doesn't have easy access to the number of 'efficiency cores'. How does your code compare to this?

As to the batch size, there the default is 512 https://github.com/ollama/ollama/blob/424f648632c925ce14a75018c4dcab395e035993/api/types.go#L671; since you mention that 256 was your optimal value but 256/512 are acceptable, I think that we could go for 512.

@oobabooga
Copy link
Owner

About recommending quantizing the kv cache, I haven't seen this recommendation in the context of EXL2, which uses a more sophisticated kv cache quantization algorithm than llama.cpp. There it was always framed as an optional feature. I think it's best to not recommend quantizing the cache by default.

@Matthew-Jenkins
Copy link
Contributor Author

Matthew-Jenkins commented Apr 23, 2025

My code just does a 'max(1, int(os.cpu_count() / 2))', which should be safe regardless of cpu. I know there are people without HT or who have efficiency cores but that should be a 'safe' number, I'm not aware of a way via python to detect an efficiency core. If you don't mind bringing psutils in, it has a function to identify physical cores.

https://psutil.readthedocs.io/en/latest/index.html#psutil.cpu_count

Bringing in psutils might be good idea overall because it would allow setting automatic 'safe' values for a whole host of things or possibly benchmarking at a later date to attempt to detect ideal cpu settings. I'm not aware of a cross platform way to detect ideal gpu settings. Maybe just a gpu reference list or something? Or platform specific modules could be developed at a later time.

My code doesn't change the quantization level, it just adds further description for it. EXL2 people seems to think Q4 is optimal.
https://github.com/turboderp-org/exllamav2/blob/master/doc/qcache_eval.md

Seeing how the behavior might be very different between llamacpp and exl, perhaps having separate settings for this is optimal. Reading up, it seems the EXL2 people consider their Q4 better than FP8 and almost as good as FP16. So the choice should really be FP16 or a Q type for EXL.2

As for batch-size, 512 is good. The only reason I picked 256 is because of no noticeable performance change on my rig and I figured someone with low end hardware might benefit from reduced vram util.

Let me know if you want me to try to break the settings out and/or bring in psutil. I'll make the all the appropriate changes then.

@oobabooga
Copy link
Owner

I wouldn't mind adding an additional requirement (preferably a simple, no-dependencies one). Two things would benefit from this:

  • Being able to compute the optimal cores for the purpose of this PR
  • Being able to estimate the n_gpu_layers value automatically from the available GPU memory

These things are discussed in this issue

ddh0/easy-llama#11

where this library is discussed

https://github.com/ModelCloud/Device-SMI/

but apparently it doesn't explicit say the number of 'efficiency cores', just 'cores' and 'threads'. Also the version on pypi doesn't work on my computer with

from device_smi import Device

dev = Device("cpu")
print(dev)

@oobabooga
Copy link
Owner

I ran 108 speed comparisons for

  • batch size: 256, 512, 2048
  • threads: physical cores, physical + logical cores, (physical + logical cores) * 2
  • portable webui version: cuda, cpu
  • system: laptop (with CPU offloading in the cuda run), server (full GPU in the cuda run)

where I measured prompt processing speed and text generation speed (3 times for each combination, taking the average of the 3 measurements). That gave me 8 columns (prompt processing and text generation for (2 systems) * (2 webui versions)); for each column, I divided values by the maximum for that column, and took the harmonic mean of those 8 values. I got this as the global maximum:

Best row number: 22 (0-indexed) with harmonic score: 0.9528

Corresponding rows from each file:

File: 1
batch                2048.00
threads                 8.00
threads-batch           8.00
prompt-processing      80.79
text-generation         4.56
Name: 22, dtype: float64
Raw PP: 80.79, Raw TG: 4.56
Normalized PP: 0.9908, Normalized TG: 1.0000

File: 2
batch                2048.00
threads                 8.00
threads-batch           8.00
prompt-processing     118.64
text-generation        19.65
Name: 22, dtype: float64
Raw PP: 118.64, Raw TG: 19.65
Normalized PP: 0.9239, Normalized TG: 1.0000

File: 3
batch                2048.00
threads                12.00
threads-batch          12.00
prompt-processing     241.64
text-generation        30.93
Name: 22, dtype: float64
Raw PP: 241.64, Raw TG: 30.93
Normalized PP: 0.9022, Normalized TG: 0.9511

File: 4
batch                2048.00
threads                12.00
threads-batch          12.00
prompt-processing    1895.28
text-generation        24.11
Name: 22, dtype: float64
Raw PP: 1895.28, Raw TG: 24.11
Normalized PP: 0.8807, Normalized TG: 0.9906

Here threads = threads-batch = (physical + logical cores).

If I look only at the GPU runs, I get

Best row number: 0 (0-indexed) with harmonic score: 0.9826

Corresponding rows from each file:

File: 1
batch                256.00
threads                4.00
threads-batch          4.00
prompt-processing     79.01
text-generation        4.39
Name: 0, dtype: float64
Raw PP: 79.01, Raw TG: 4.39
Normalized PP: 0.9690, Normalized TG: 0.9627

File: 2
batch                 256.00
threads                 6.00
threads-batch           6.00
prompt-processing    2151.97
text-generation        24.34
Name: 0, dtype: float64
Raw PP: 2151.97, Raw TG: 24.34
Normalized PP: 1.0000, Normalized TG: 1.0000

Here threads = threads-batch = (number of physical cores).

If I look only at CPU, I get

Best row number: 19 (0-indexed) with harmonic score: 0.9449

Corresponding rows from each file:

File: 1
batch                2048.00
threads                 4.00
threads-batch           8.00
prompt-processing     120.68
text-generation        18.66
Name: 19, dtype: float64
Raw PP: 120.68, Raw TG: 18.66
Normalized PP: 0.9398, Normalized TG: 0.9496

File: 2
batch                2048.00
threads                 6.00
threads-batch          12.00
prompt-processing     241.66
text-generation        32.27
Name: 19, dtype: float64
Raw PP: 241.66, Raw TG: 32.27
Normalized PP: 0.9023, Normalized TG: 0.9923

Where threads = physical cores and threads-batch = physical + logical cores.

llama.cpp by default sets threads and threads-batch according to the maximum for the GPU case (which includes a case with CPU offloading), so I don't think it's necessary to change that default. But the optimal batch size turned out to be 256, which aligns with what you found.

@oobabooga oobabooga merged commit 8f2493c into oobabooga:dev Apr 25, 2025
@Matthew-Jenkins Matthew-Jenkins deleted the patch-2 branch April 25, 2025 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants