About a cache-friendly data layout #8432
luoyu-intel
started this conversation in
General
Replies: 1 comment 4 replies
-
It would be possible to write a ggml-backend buffer type that uses a different layout for some types. To do so, you would need to modify the |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @ggerganov @slaren !
I'm trying to improve the Intel GPUs' performance for next-token latency. I think the current data layout is not cache-friendly which makes the SYCL kernels very difficult to fully use the memory bandwidth. For example, Q4_0:
The delta value(2 bytes) has broken the quantized data's cache alignment.
If using the continuous layout, each GPU core can always read a cache line of quantized data from memory:
I did some tests on these two layouts and profiled the
llama-cli
with Intel Vtune. I wrote a new gemv kernel for the continuous weight and added a new kernel to convert ablock_q4_0
weight to a continuous weight layout before the gemv kernel. codeThe native
block_q4_0
took ~56us for a 4096x4096 weight. The continuous layout took ~33us. It's ~70% speedup. Intel A770m GPU can run llama2-q4_0 at 18.5ms/token(discard the latency of the conversion kernels).My question is: is it possible to convert
block_q4_0
weight to a continuous weight in llama.cpp? Or can it be done for the Intel-SYCL backend only?Beta Was this translation helpful? Give feedback.
All reactions