-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
Description
Currently, for llama.cpp loader there is "See more options" sub-window on the model page. It has batch_size slider. But llama-server has also parameter called ubatch-size and setting it to higher value can have huge impact on performance of mixed GPU-CPU inference (prompt processing) for MoE models when used together with batch_size. Right now users can set ubatch-size via the extra-flags field (like "ubatch-size=XXXX"). It would be great to have ubatch-size as a slider like batch_size is done now, for ease of access OR set ubatch-size to the same value as batch_size slider.
For example, here are my results that show how big performance improvement is possible with changing this option:
- batch_size slider set to 2048, no ubatch-size in extra-flags = around 170-180 tokens/s prompt processing
- batch_size slider set to 2048, no "ubatch-size=2048" in extra-flags = 530-540 tokens/s prompt processing
This gives almost 3x performance gain.
Above results are for 32GB VRAM + 128GB RAM (DDR5) system with consumer i7 CPU. model is GLM-4.5-Air with Q4 quant (it should also work with 96GB RAM) and offloading most experts to CPU to fit important stuff in VRAM. Tests were made on multi-turn chat of length of almost 15k tokens.
I've seen other people in discord servers and on reddit claiming they saw similar performance gains when setting both batch_size and ubatch-size on slightly weaker systems (like RTX 3090 + 64GB DDR4 RAM - it still gave at least around 2x speedup)
Additional Context
As mentioned above, currently the workaround is to use "extra-flags" and just write the llama-server parameter "ubatch-size=XXXX"