Q2k interleaving implementation - x86/x64 SIMD #14373

Srihari-mcw · 2025-06-25T12:52:00Z

The PR contains block interleaving approach for Q2_K quantization for x64/x86 AVX2/AVX512 SIMD Architecture
AVX512 and AVX2 Versions are implemented for the GEMM function, whereas GEMV is implemented with AVX2 intrinsics
The existing quantize_q8_K_4x8 function quantizes the float values to block_q8_Kx4 format
repack_q2_K_to_q2_K_8_bl function rearranges the weight in Q2_K format to Q2_Kx8 format(block_q2_Kx8)

Block Interleaving Formats

Block_Q2_Kx8 :

Used to contain data of 8 Q2_K blocks in interleaved fashion
uint8 scales[128] - Scales and Mins from source Q2_K blocks are taken. Every 16 byte here is packed such that it contains scales and mins for corresponding sub blocks from Q2_K structure - There are 16 sub blocks in original Q2_K structure
The d and dmin values from source Q2_K blocks are stored together in an array
Quant values from the source Q2_K blocks are sequentially extracted and interleaved into groups of eight bytes

Performance Impact :

Gains of ~5.5 % seen with the AVX2 version and gains of ~25.5% seen with the AVX512 Version over the base commit with GCC Linux

GCC Linux :

Q2_K Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	84.64 ± 0.20		38de3fb - Base Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	89.26 ± 0.21	5.45%	ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	106.27 ± 0.32	25.54%	ef03580- AVX512 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	37.81 ± 0.02		38de3fb - Base Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	37.80 ± 0.02	-0.03%	ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	37.64 ± 0.01	-0.46%	ef03580 - AVX512 Commit

GCC Version = 12.3

Clang Linux:

More gains of ~26.3% seen with the AVX2 version and gains of ~53.9% seen with the AVX512 Version over the base commit with Clang Linux

Q2_K Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	92.33 ± 0.20		38de3fb - Base Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	116.68 ± 0.40	26.37%	ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	142.13 ± 0.63	53.93%	ef03580- AVX512 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	38.26 ± 0.00		38de3fb - Base Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	38.11 ± 0.01	-0.38%	ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	37.98 ± 0.01	-0.71%	ef03580 - AVX512 Commit

Clang Version = 20.1.0

The model tested was - https://huggingface.co/bartowski/Phi-3-mini-4k-instruct-GGUF

The PR was tested in AMD Ryzen 5 9600X which supports the following flags by default :

CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Further the perplexity was tested and found to be similar with the Q2_K Model

The perplexity results are tabulated as follows :

model	perplexity (Final estimate PPL)	Commit id
phi3 3B Q2_K - Medium	9.5511 +/- 0.064212	38de3fb - Base Commit
phi3 3B Q2_K - Medium	9.5488 +/- 0.06419	ef03580 - Updated Commit

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 25, 2025

Srihari-mcw changed the title ~~Q2k interleaving implementation~~ Q2k interleaving implementation - x86/x64 SIMD Jun 25, 2025

Srihari-mcw and others added 2 commits June 26, 2025 11:24

Initial Q2_K Block Interleaving Implementation

d82cdc2

Addressed review comments and clean up of the code

48b4d5a

Srihari-mcw force-pushed the q2k_interleaving_implementation branch 2 times, most recently from ba56a3c to 39ab344 Compare June 26, 2025 06:00

Post rebase fixes

c2c53bc

Srihari-mcw force-pushed the q2k_interleaving_implementation branch from 39ab344 to c2c53bc Compare June 26, 2025 06:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Q2k interleaving implementation - x86/x64 SIMD #14373

Q2k interleaving implementation - x86/x64 SIMD #14373

Srihari-mcw commented Jun 25, 2025

Uh oh!

Uh oh!

Q2k interleaving implementation - x86/x64 SIMD #14373

Are you sure you want to change the base?

Q2k interleaving implementation - x86/x64 SIMD #14373

Conversation

Srihari-mcw commented Jun 25, 2025

Uh oh!

Uh oh!