VRAM usage vs. EXL2

I'm working with Mistral Large and discovering that the VRAM footprint of the models loaded with EXL3 seems to be much larger than with EXL2.

For instance, the 2.5 bit quantization of EXL2 would fit in one 48 GB card.
The 5.0 bit quantization with EXL2 fits handily in 2x48 GB cards (~41 GB each).

However, the 2.5 bit quantization with EXL3 uses just over 50 GB and loading spills over onto my second GPU.  The 5.0 bit quantization fails to load with MemoryError. Is this expected or have I done something wrong? If I have to down-size to a lower quant to fit in the same memory capacity, it somewhat cancels out the benefits of using the EXL3. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

VRAM usage vs. EXL2 #60

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

VRAM usage vs. EXL2 #60

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions