-
-
Notifications
You must be signed in to change notification settings - Fork 39
Open
Description
I'm using tabbyAPI and setting a list of splits but it doesn't seem to respect them anymore. Max vram allocated is about 23700 and anything bigger than that is ignored. Any granular settings like 22.1 simply load whatever they feel like, stopping early or stopping late.
As a result, I can't get over 8k context on qwen 235b but have one gpu with 95% and the main gpu with only 93%. 2 of the GPUs load to 97% and it doesn't really grow during inference.
On llama.cpp derivatives, I know I can allocate at least 24050 MiB and with exl2 there was at least some control by incrementing the decimal. I've seen even 98% before.
Metadata
Metadata
Assignees
Labels
No labels