GPU split isn't followed here?

I'm using tabbyAPI and setting a list of splits but it doesn't seem to respect them anymore. Max vram allocated is about 23700 and anything bigger than that is ignored. Any granular settings like 22.1 simply load whatever they feel like, stopping early or stopping late. 

As a result, I can't get over 8k context on qwen 235b but have one gpu with 95% and the main gpu with only 93%. 2 of the GPUs load to 97% and it doesn't really grow during inference.

On llama.cpp derivatives, I know I can allocate at least 24050 MiB and with exl2 there was at least some control by incrementing the decimal. I've seen even 98% before.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GPU split isn't followed here? #54

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

GPU split isn't followed here? #54

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions