-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Add slider for --ubatch-size for llama.cpp loader #7316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added new slider for --ubatch-size parameter of llama-server in llama.cpp loader
|
Thoughts on this? #6870 The default batch size had been lowered to 256 because 2048 seemed like a value for batching/servers/multiple users that led to bottlenecking and worse performance on both full offloading and GPU + CPU. Maybe things behave differently for MoE models? |
|
I can say that for mixed CPU+GPU inference of MoE model, increasing batch_size gave me at least 2x performance gain and ubatch_size increased it 3 times more. With defaults (batch_size 256, ubatch_size unset so 512) I got less than 100 T/s on GLM-4.5-Air, now I get over 500 T/s. For threads setting, I hear it is generally recommended to set it to less cores than the CPU has, at maximum to number of cores. So I didn't change these values in my tests. Some big server CPUs might work better with more threads. Later today I will quickly test some batch sizes (both batch_size and ubatch_size) with dense models and post some results |
|
I ran some tests using Magistral Small 24B, Q6 quant. Here are results for prompt processing (PP) (values were rounded, so it is +-10):
For full GPU offload (prompt length 8k):
Token generation (TG) consistently stayed at 72 T/s, for prompt processing there was some noticeable variance so I'd set margin of error at 100 T/s. For partial offload to CPU (35/40 layers on GPU, prompt length 2k because it is slower):
Token generation (TG) consistently stayed at 20 T/s, PP margin of error around 10 T/s And finally only CPU (prompt length 512 (default) and generate 32 tokens to not take forever):
Token generation (TG) consistently stayed at 4 T/s, PP margin of error around 10 T/s to be safe. Based on this short test I'd say that batch_szie could be safely increased to 512. What do you think about this? I think the "lockup" on consumer hardware mentioned in #6870 was due to number of threads, in my tests I left threads settings at default. |
|
I have run some new measurements with MoE models, in Q8_0 precision on a RTX 6000 Ada GPU:
If I apply the same harmonic mean score as in #6870, I get
Since the SOTA models that run on consumer hardware are all MoE models nowadays, I think it's fine to use 1024 as the default, even though possibly 256 or 512 are better for dense models. |
|
Added some new tests for 2048/512 and 256/512 (llama.cpp default and this project's default so far) to make sure and 1024/1024 still wins:
Thanks for the PR -- this is a nice performance improvement on MoE models. |
This is for issue #7309 which I made
Adding new slider for --ubatch-size parameter of llama-server in llama.cpp loader, right below batch_size.
Setting this to higher value along with batch_size allows for much better performance of prompt processing of mixed CPU+GPU inference (I've seen around 3x gain in speed - as mentioned in the issue).
Default value is set to default of llama-server which can be seen here
I think it would also be good to consider if default for
batch_sizeslider (--batch-sizeparameter in llama-server) should be only 256 in webui, when llama-server itself has default value of 2048 (Can also be seen in above link)?Checklist: