Q2k interleaving implementation - x86/x64 SIMD #14373
Open
+3,592
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Block Interleaving Formats
Block_Q2_Kx8 :
Performance Impact :
Gains of ~5.5 % seen with the AVX2 version and gains of ~25.5% seen with the AVX512 Version over the base commit with GCC Linux
GCC Linux :
Q2_K Model :
GCC Version = 12.3
Clang Linux:
More gains of ~26.3% seen with the AVX2 version and gains of ~53.9% seen with the AVX512 Version over the base commit with Clang Linux
Q2_K Model :
Clang Version = 20.1.0
The model tested was - https://huggingface.co/bartowski/Phi-3-mini-4k-instruct-GGUF
The PR was tested in AMD Ryzen 5 9600X which supports the following flags by default :
CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Further the perplexity was tested and found to be similar with the Q2_K Model
The perplexity results are tabulated as follows :