-
-
Notifications
You must be signed in to change notification settings - Fork 39
Open
Description
I'm working with Mistral Large and discovering that the VRAM footprint of the models loaded with EXL3 seems to be much larger than with EXL2.
For instance, the 2.5 bit quantization of EXL2 would fit in one 48 GB card.
The 5.0 bit quantization with EXL2 fits handily in 2x48 GB cards (~41 GB each).
However, the 2.5 bit quantization with EXL3 uses just over 50 GB and loading spills over onto my second GPU. The 5.0 bit quantization fails to load with MemoryError. Is this expected or have I done something wrong? If I have to down-size to a lower quant to fit in the same memory capacity, it somewhat cancels out the benefits of using the EXL3.
flflow
Metadata
Metadata
Assignees
Labels
No labels