Skip to content

QNN: mobilebert failed to generate Qnn context binary #7946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
guangy10 opened this issue Jan 24, 2025 · 9 comments
Closed

QNN: mobilebert failed to generate Qnn context binary #7946

guangy10 opened this issue Jan 24, 2025 · 9 comments
Labels
module: examples Issues related to demos under examples/ module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@guangy10
Copy link
Contributor

guangy10 commented Jan 24, 2025

🐛 Describe the bug

python -m examples.qualcomm.scripts.mobilebert_fine_tune -b cmake-out -m SM8450 --compile_only --use_fp16

stacktrace:


[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Context binary size calculation failed

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to get serialized binary

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to get context binary size with err 0x7532

[ERROR] [Qnn ExecuTorch]: Can't determine the size of graph binary to be saved to cache. Error 30002
E 00:01:45.674883 executorch:QnnManager.cpp:484] Fail to get context binary.
Traceback (most recent call last):
  File "/data/users/guangyang/executorch/examples/qualcomm/scripts/mobilebert_fine_tune.py", line 399, in <module>
    main(args)
  File "/data/users/guangyang/executorch/examples/qualcomm/scripts/mobilebert_fine_tune.py", line 270, in main
    build_executorch_binary(
  File "/home/guangyang/executorch/examples/qualcomm/utils.py", line 329, in build_executorch_binary
    exported_program = to_backend(edge_prog.exported_program, qnn_partitioner)
  File "/home/guangyang/.conda/envs/executorch/lib/python3.10/functools.py", line 878, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/home/guangyang/executorch/exir/backend/backend_api.py", line 396, in _
    tagged_graph_module = _partition_and_lower(
  File "/home/guangyang/executorch/exir/backend/backend_api.py", line 319, in _partition_and_lower
    partitioned_module = _partition_and_lower_one_graph_module(
  File "/home/guangyang/executorch/exir/backend/backend_api.py", line 249, in _partition_and_lower_one_graph_module
    lowered_submodule = to_backend(
  File "/home/guangyang/.conda/envs/executorch/lib/python3.10/functools.py", line 878, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/home/guangyang/executorch/exir/backend/backend_api.py", line 113, in _
    preprocess_result: PreprocessResult = cls.preprocess(
  File "/home/guangyang/executorch/backends/qualcomm/qnn_preprocess.py", line 110, in preprocess
    assert len(qnn_context_binary) != 0, "Failed to generate Qnn context binary."
AssertionError: Failed to generate Qnn context binary.

Versions

trunk

cc @cccclai @winskuo-quic @shewu-quic

@guangy10 guangy10 added bug module: examples Issues related to demos under examples/ module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm labels Jan 24, 2025
@digantdesai digantdesai added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 4, 2025
@winskuo-quic
Copy link
Collaborator

Hi @guangy10,
Could you please share what QNN SDK you are currently using?
I tried it using QNN2.28.0 with executorch mainline(commit: 433e30b) and everything is working fine.
Below is the command I used to compile the model, which should almost be the same as yours:
python -m examples.qualcomm.scripts.mobilebert_fine_tune -b build-android -m SM8650 --compile_only --use_fp16 --pretrained_weight ../artifacts/mobilebert_fine_tune/finetuned_mobilebert_epoch_5.model
Thanks

@guangy10
Copy link
Contributor Author

Hi @guangy10, Could you please share what QNN SDK you are currently using? I tried it using QNN2.28.0 with executorch mainline(commit: 433e30b) and everything is working fine. Below is the command I used to compile the model, which should almost be the same as yours: python -m examples.qualcomm.scripts.mobilebert_fine_tune -b build-android -m SM8650 --compile_only --use_fp16 --pretrained_weight ../artifacts/mobilebert_fine_tune/finetuned_mobilebert_epoch_5.model Thanks

Mine is 2.26. Let me try upgrade to 2.28

@guangy10
Copy link
Contributor Author

Hi @guangy10, Could you please share what QNN SDK you are currently using? I tried it using QNN2.28.0 with executorch mainline(commit: 433e30b) and everything is working fine. Below is the command I used to compile the model, which should almost be the same as yours: python -m examples.qualcomm.scripts.mobilebert_fine_tune -b build-android -m SM8650 --compile_only --use_fp16 --pretrained_weight ../artifacts/mobilebert_fine_tune/finetuned_mobilebert_epoch_5.model Thanks

@winskuo-quic Got another issue after updating the QNN version to 2.28. What is the numpy version you are using? Getting the following error with numpy 2.2.2 in my setup.

RuntimeError: Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

@winskuo-quic
Copy link
Collaborator

Hi @guangy10, Could you please share what QNN SDK you are currently using? I tried it using QNN2.28.0 with executorch mainline(commit: 433e30b) and everything is working fine. Below is the command I used to compile the model, which should almost be the same as yours: python -m examples.qualcomm.scripts.mobilebert_fine_tune -b build-android -m SM8650 --compile_only --use_fp16 --pretrained_weight ../artifacts/mobilebert_fine_tune/finetuned_mobilebert_epoch_5.model Thanks

@winskuo-quic Got another issue after updating the QNN version to 2.28. What is the numpy version you are using? Getting the following error with numpy 2.2.2 in my setup.

RuntimeError: Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

I believe the error above should not be directly related to QNN version but more possibly caused by Python library version. For me, I have numpy 2.2.3 and transformers 4.47.1. Could you share the library installation process that you are currently using? For us, we are using python install_requirements.py

@guangy10
Copy link
Contributor Author

python -m examples.qualcomm.scripts.mobilebert_fine_tune -b build-android -m SM8650 --compile_only --use_fp16 --pretrained_weight ../artifacts/mobilebert_fine_tune/finetuned_mobilebert_epoch_5.model

Hi @guangy10, Could you please share what QNN SDK you are currently using? I tried it using QNN2.28.0 with executorch mainline(commit: 433e30b) and everything is working fine. Below is the command I used to compile the model, which should almost be the same as yours: python -m examples.qualcomm.scripts.mobilebert_fine_tune -b build-android -m SM8650 --compile_only --use_fp16 --pretrained_weight ../artifacts/mobilebert_fine_tune/finetuned_mobilebert_epoch_5.model Thanks

@winskuo-quic Got another issue after updating the QNN version to 2.28. What is the numpy version you are using? Getting the following error with numpy 2.2.2 in my setup.

RuntimeError: Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

I believe the error above should not be directly related to QNN version but more possibly caused by Python library version. For me, I have numpy 2.2.3 and transformers 4.47.1. Could you share the library installation process that you are currently using? For us, we are using python install_requirements.py

Yeah, I'm running the same installation script, Transformers version is same, numpy version is slightly different but I guess it doesn't matter since it's a patched version. Here is my detailed env:

Collecting environment information...
PyTorch version: 2.7.0.dev20250131+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64)
GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.34

Python version: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.19.0-0_fbk21_hardened_12633_g4db063a1bcb5-x86_64-with-glibc2.34
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   40 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          72
On-line CPU(s) list:             0-71
Vendor ID:                       GenuineIntel
Model name:                      Intel Core Processor (Broadwell)
CPU family:                      6
Model:                           61
Thread(s) per core:              1
Core(s) per socket:              36
Socket(s):                       2
Stepping:                        2
BogoMIPS:                        3990.61
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       2.3 MiB (72 instances)
L1i cache:                       2.3 MiB (72 instances)
L2 cache:                        288 MiB (72 instances)
L3 cache:                        32 MiB (2 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-71
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX vulnerable, SMT disabled
Vulnerability Mds:               Vulnerable; SMT Host state unknown
Vulnerability Meltdown:          Vulnerable
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, STIBP: disabled
Vulnerability Srbds:             Unknown: Dependent on hypervisor status
Vulnerability Tsx async abort:   Vulnerable

Versions of relevant libraries:
[pip3] executorch==0.6.0a0+f2720fa
[pip3] flake8==6.1.0
[pip3] flake8-breakpoint==1.1.0
[pip3] flake8-bugbear==24.4.26
[pip3] flake8-comprehensions==3.14.0
[pip3] flake8-executable==2.1.3
[pip3] flake8-logging-format==0.9.0
[pip3] flake8-plugin-utils==1.3.3
[pip3] flake8-pyi==23.5.0
[pip3] flake8-simplify==0.19.3
[pip3] mypy==1.14.1
[pip3] mypy-extensions==1.0.0
[pip3] numpy==2.2.2
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==8.9.2.26
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] nvidia-nvjitlink-cu12==12.5.82
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] optree==0.10.0
[pip3] torch==2.7.0.dev20250131+cpu
[pip3] torchao==0.8.0+git11333ba2
[pip3] torchaudio==2.6.0.dev20250131+cpu
[pip3] torchsr==1.0.4
[pip3] torchvision==0.22.0.dev20250131+cpu
[pip3] triton==3.0.0
[conda] executorch                0.6.0a0+f2720fa          pypi_0    pypi
[conda] numpy                     2.2.2                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         8.9.2.26                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-nccl-cu12          2.18.1                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.5.82                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] optree                    0.10.0                   pypi_0    pypi
[conda] torch                     2.7.0.dev20250131+cpu          pypi_0    pypi
[conda] torchao                   0.8.0+git11333ba2          pypi_0    pypi
[conda] torchaudio                2.6.0.dev20250131+cpu          pypi_0    pypi
[conda] torchfix                  0.6.0                    pypi_0    pypi
[conda] torchsr                   1.0.4                    pypi_0    pypi
[conda] torchvision               0.22.0.dev20250131+cpu          pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi

@guangy10
Copy link
Contributor Author

@winskuo-quic It's hard to debug the local dev env, if you think the model is working fine, should we just enable it in the CI, the setup there can be the source of truth for future reference. Here is the QNN models we are currently running on CI: https://github.com/pytorch/executorch/blob/main/.github/workflows/trunk.yml#L305-L329

Can you add mobilebert to it? The CI only need to test it in --compile_only mode, to ensure there is no complication issue

@winskuo-quic
Copy link
Collaborator

winskuo-quic commented Feb 21, 2025

@guangy10,
Yes, I think we can try to enable both mobilebert and wav2letter mentioned in #7634 (comment).

@guangy10
Copy link
Contributor Author

@guangy10, Yes, I think we can try to enable both mobilebert and wav2letter mentioned in #7634 (comment).

Same comment applies to here as well. #7634 (comment)

@cccclai
Copy link
Contributor

cccclai commented Mar 25, 2025

#8616 is merged. Can we close it now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: examples Issues related to demos under examples/ module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants