Skip to content

GPU split isn't followed here? #54

@Ph0rk0z

Description

@Ph0rk0z

I'm using tabbyAPI and setting a list of splits but it doesn't seem to respect them anymore. Max vram allocated is about 23700 and anything bigger than that is ignored. Any granular settings like 22.1 simply load whatever they feel like, stopping early or stopping late.

As a result, I can't get over 8k context on qwen 235b but have one gpu with 95% and the main gpu with only 93%. 2 of the GPUs load to 97% and it doesn't really grow during inference.

On llama.cpp derivatives, I know I can allocate at least 24050 MiB and with exl2 there was at least some control by incrementing the decimal. I've seen even 98% before.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions