Skip to content

Q2k interleaving implementation - x86/x64 SIMD #14373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Srihari-mcw
Copy link
Collaborator

  • The PR contains block interleaving approach for Q2_K quantization for x64/x86 AVX2/AVX512 SIMD Architecture
  • AVX512 and AVX2 Versions are implemented for the GEMM function, whereas GEMV is implemented with AVX2 intrinsics
  • The existing quantize_q8_K_4x8 function quantizes the float values to block_q8_Kx4 format
  • repack_q2_K_to_q2_K_8_bl function rearranges the weight in Q2_K format to Q2_Kx8 format(block_q2_Kx8)

Block Interleaving Formats

Block_Q2_Kx8 :

  • Used to contain data of 8 Q2_K blocks in interleaved fashion
  • uint8 scales[128] - Scales and Mins from source Q2_K blocks are taken. Every 16 byte here is packed such that it contains scales and mins for corresponding sub blocks from Q2_K structure - There are 16 sub blocks in original Q2_K structure
  • The d and dmin values from source Q2_K blocks are stored together in an array
  • Quant values from the source Q2_K blocks are sequentially extracted and interleaved into groups of eight bytes

Performance Impact :

Gains of ~5.5 % seen with the AVX2 version and gains of ~25.5% seen with the AVX512 Version over the base commit with GCC Linux

GCC Linux :

Q2_K Model :

model size params backend threads test t/s speedup Commit id
phi3 3B Q2_K - Medium 1.32 GiB 3.82 B CPU 6 pp 512 84.64 ± 0.20 38de3fb - Base Commit
phi3 3B Q2_K - Medium 1.32 GiB 3.82 B CPU 6 pp 512 89.26 ± 0.21 5.45% ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium 1.32 GiB 3.82 B CPU 6 pp 512 106.27 ± 0.32 25.54% ef03580- AVX512 Commit
phi3 3B Q2_K - Medium 1.32 GiB 3.82 B CPU 6 tg 128 37.81 ± 0.02 38de3fb - Base Commit
phi3 3B Q2_K - Medium 1.32 GiB 3.82 B CPU 6 tg 128 37.80 ± 0.02 -0.03% ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium 1.32 GiB 3.82 B CPU 6 tg 128 37.64 ± 0.01 -0.46% ef03580 - AVX512 Commit

GCC Version = 12.3

Clang Linux:

More gains of ~26.3% seen with the AVX2 version and gains of ~53.9% seen with the AVX512 Version over the base commit with Clang Linux

Q2_K Model :

model size params backend threads test t/s speedup Commit id
phi3 3B Q2_K - Medium 1.32 GiB 3.82 B CPU 6 pp 512 92.33 ± 0.20 38de3fb - Base Commit
phi3 3B Q2_K - Medium 1.32 GiB 3.82 B CPU 6 pp 512 116.68 ± 0.40 26.37% ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium 1.32 GiB 3.82 B CPU 6 pp 512 142.13 ± 0.63 53.93% ef03580- AVX512 Commit
phi3 3B Q2_K - Medium 1.32 GiB 3.82 B CPU 6 tg 128 38.26 ± 0.00 38de3fb - Base Commit
phi3 3B Q2_K - Medium 1.32 GiB 3.82 B CPU 6 tg 128 38.11 ± 0.01 -0.38% ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium 1.32 GiB 3.82 B CPU 6 tg 128 37.98 ± 0.01 -0.71% ef03580 - AVX512 Commit

Clang Version = 20.1.0

The model tested was - https://huggingface.co/bartowski/Phi-3-mini-4k-instruct-GGUF

The PR was tested in AMD Ryzen 5 9600X which supports the following flags by default :

CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Further the perplexity was tested and found to be similar with the Q2_K Model

The perplexity results are tabulated as follows :

model perplexity (Final estimate PPL) Commit id
phi3 3B Q2_K - Medium 9.5511 +/- 0.064212 38de3fb - Base Commit
phi3 3B Q2_K - Medium 9.5488 +/- 0.06419 ef03580 - Updated Commit

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 25, 2025
@Srihari-mcw Srihari-mcw changed the title Q2k interleaving implementation Q2k interleaving implementation - x86/x64 SIMD Jun 25, 2025
@Srihari-mcw Srihari-mcw force-pushed the q2k_interleaving_implementation branch 2 times, most recently from ba56a3c to 39ab344 Compare June 26, 2025 06:00
@Srihari-mcw Srihari-mcw force-pushed the q2k_interleaving_implementation branch from 39ab344 to c2c53bc Compare June 26, 2025 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants