Skip to content

VRAM usage vs. EXL2 #60

@cmoncure

Description

@cmoncure

I'm working with Mistral Large and discovering that the VRAM footprint of the models loaded with EXL3 seems to be much larger than with EXL2.

For instance, the 2.5 bit quantization of EXL2 would fit in one 48 GB card.
The 5.0 bit quantization with EXL2 fits handily in 2x48 GB cards (~41 GB each).

However, the 2.5 bit quantization with EXL3 uses just over 50 GB and loading spills over onto my second GPU. The 5.0 bit quantization fails to load with MemoryError. Is this expected or have I done something wrong? If I have to down-size to a lower quant to fit in the same memory capacity, it somewhat cancels out the benefits of using the EXL3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions