-
Notifications
You must be signed in to change notification settings - Fork 11.5k
PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
3ef106e
to
db890cc
Compare
why "mapping the entire ggml's computational graph to QNN graph"(the second technical approach in above post) is not practical in ggml-qnn backend
the key-point function in this complicated C++ source file is: we can clearly see that an ideal or expected QNN graph which has only single QNN graph with many many graph nodes would be generated/composed in this function. then we can understand that the codes in QnnSampleMain.cpp is just a route work or skeleton codes. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.
https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-100/introduction.html we can clearly see that the core process of offload inference to NPU(HTP) backend is 90%-99% same to the general approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's QNN Sample after tracking all relative codes in QNN SDK. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.
we can clearly see that a customized model which was trained and provided by XiaoMi's AI team and this customized binary model will be used in this open source project: they claimed they can got a 10x performance gain with NPU inference . at the same time, we can clearly see that the main logic of this open source project is 90% same to Qualcomm's QNN Sample after tracking codes carefully, but we still don't know how that single QNN graph was generated? what should we think at the moment???
this open source project comes from a famous China top university and it can be considered a derived or highly-customized project of llama.cpp. one of the highlights in this derived project is that the R&D developers implemented a closed-source QNN backend. recently I found a highly related project on GitHub with help from an unknown(here means I don't know) programmer @zhuipiaochen. we can clearly see the approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in this interesting project is 90% same to the approach in Qualcomm's Genie or 90% same to the approach in Qualcomm's QNN Sample after tracking codes carefully:
the last 3 steps are exactly similar to offload 2D/3D matrix mulipilication to QNN backend in this PR. the difference between these two scenarios is that there are only 2 QNN graph nodes in the QNN graph of 2D/3D mulmat on QNN backend. in this case, we still don't know how the single QNN graph was generated? what should we think at the moment?????
ok, let me doing an interesting experiment with ggml-qnn backend in this PR:
what we can see from the logs of adb logcat? we can clearly see that there is no such an entire or complete GGML graph in this function: accordingly, the logic or inference procedure in this function is exactly same to the original approach or general approach in all ggml backends.this is the limitation of the existing implementation of inference procedure or inference architecture in llama.cpp. conclusion:
[updated on 21:56, 03/12/2025], the conclusion here is incorrect because the analysis in case 5 is WRONG, the first tech approach in this PR is still meaningful(because all op functions can be used in the second tech approach after some minor adjustment) and the second tech approach should be finished in this PR or other similar PR, but the analysis in case 1/2/3/4 is completely correct and logic in this tech doc is correct:Qualcomm provides some binary dedicated tools to do LLM model conversion which is exactly hard work(compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single QNN graph) in the second tech approach of ggml-qnn backend. the second tech approach can be also implemented in this PR but I think I can't completely finish it because of my limited AI knowledge(there are hundreds of cgraph nodes and there are about 50+ ops) and real AI experts must be involved in the rest parts of ggml-qnn. so, good lucky to other similar PR. I made a wrong analysis in step 5 or misunderstanding in #12342 which already explained by slaren, the rootcause of these two-stupid mistakes is that I have very limited knowledge about real hard-core AI tech. |
2ceaaf5
to
3402e2c
Compare
Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs? |
thanks for your kind comment.
|
0065122
to
1f702df
Compare
1e98561
to
e4b0d8c
Compare
967be44
to
a26806a
Compare
df6dc3a
to
26a6fe1
Compare
@zhouwg |
currently about 2GiB shared memory was allocated by rpcmem API. |
Can I increase my NPU allocation? |
@zhouwg |
sorry for delayed answer. |
* [ ] Low
* [x] Medium(complexity of codes on ARM-AP side is medium, complexity of codes on cDSP side(hexagon-kernels) is high
* [ ] High
* [x]
test-backend-ops
andllama-cli
through HWACCEL_QNN on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone* [x]
test-backend-ops
andllama-cli
through HWACCEL_CDSP on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone* [x] the major features in ggml backend subsystem through HWACCEL_CDSP(the main approach in this PR) has verified on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone
PR Description
this PR is a continued effort of my original PR #6869 on 04/2024, focus on the final mission:
the fully and TLDR description of this PR can be found at my forked llama.cpp project:zhouwg#30.
the high-level data path or so-called high-level arch of ggml-hexagon can be found at my forked llama.cpp project:high-level data path of ggml-hexagon
Features
provide a concise reference implementation of HWACCEL_QNN in this PR: offload ggml op to QNN.
provide a very fast approach(HWACCEL_CDSP) which is exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl in this PR: offload some performance-sensitive ggml ops to Hexagon cDSP directly.
the Hexagon NPU performance between HWACCEL_QNN approach and HWACCEL_CDSP approach can be easily compared:provide a computation visualization approach in this PR to help other developers and AI experts to visualize the comparison between cDSP approach and QNN approach.
dynamic running parameter adjustment through ggml-hexagon.cfg(this idea comes from @ngxson in his draft AI-dedicated PR and more parameters can be added in this configuration file).

probe/detect Snapdragon SoC information at runtime, accordingly, code might-be/should-be running well on following Qualcomm dsp:


#v68 --- Snapdragon 888
#v69 --- Snapdragon 8 Gen1
#v73 --- Snapdragon 8 Gen2
#v75 --- Snapdragon 8 Gen3(verified)
#v79 --- Snapdragon 8 Elite(aka 8 Gen4) (verified)
provide a customized tiny ggmldsp which is exactly borrowed/reused/ported from original ggml and running well /works fine on Hexagon cDSP side, this feature will be very helpful for domain experts or AI experts whom can do anything AI innovation with Qualcomm's amazing lightweight/low-level(C/C++ and HVX assemble and can operate hardware directly) Hexagon SDK on cDSP side directly rather than learning Qualcomm's highly-designed heavyweight/high-level QNN SDK API on ARM-AP side.
provide big picture of ggm-hexagon backend in this PR for further or other relative dev activity in this great pure-tech community.
How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone
Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions or Linux VM or WSL on Windows10/11 might be also ok):
utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically, Qualcomm Hexagon SDK must be obtained with a Qualcomm Developer Account and cannot be downloaded automatically in this script.
we will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:
SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750-AB (Snapdragon 8 Elite) (aka Snapdragon 8 Gen 4)
we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-hexagon".
Hexagon NPU Performance
test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.32.0.250228, Hexagon SDK is v6.2.0.1.
case-1: GGML_OP_ADD performance comparison between QNN-NPU and cDSP in real LLM inference
case-2: GGML_OP_MUL_MAT performance comparison between QNN-NPU and cDSP(small matrix mulmat through test-backend-ops)
[updated on 04/09/2025,09:19] I suddenly found that QNN-NPU's performance was significantly improved after I upgrade QNN SDK to v2.33.0.250327.
test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.33.0.250327, Hexagon SDK is v6.2.0.1.
the details and how to reproduce above results can be found at my forked llama.cpp project:zhouwg#28.
Big picture of ggml-hexagon backend
there are three tech approaches to implement the ggml-hexagon backend for Qualcomm's Hexagon NPU:
the tech details of "the special approach through QNN" can be found at my forked llama.cpp project:zhouwg#24.
10+ reasons why I think HWACCEL_CDSP is correct direction can be found at my forked llama.cpp project:zhouwg#28.
Acknowledgement
Conclusion
after spent too much efforts on ggml-hexagon backend, I personally think:
[updated on 04/02/2025, 22:18] @ggerganov @slaren, sorry to bother you, I understand your time are both valuable, could you help to modify the label of this PR to "Qualcomm NPU" and remove the lable "testing" and "script" and "build"? thanks so much!