Missing Out Variants When Running Llama3.2 Example Without XNNPack #6975

sheetalarkadam · 2024-11-20T00:13:18Z

I am follwing the instructions in the Llama2 README to test llama model with Executorch.
I want to compare the performance of the model with and without XNNPack. From the code, it seems that DQLinear operations are delegated to XNNPack by default. However, I would like to understand how to use the quantized ops defined in Executorch, as listed in quantized.yaml. Could you provide guidance on configuring the model to use Executorch's quantized ops instead of XNNPack?

I encounter the following error when the -X(--xnnpack) flag is removed from the python export:
raise RuntimeError(f"Missing out variants: {missing_out_vars}") RuntimeError: Missing out variants: {'quantized_decomposed::choose_qparams_per_token_asymmetric', 'quantized_decomposed::dequantize_per_channel', 'quantized_decomposed::dequantize_per_channel_group', 'quantized_decomposed::dequantize_per_token', 'quantized_decomposed::quantize_per_token'}

LLAMA_QUANTIZED_CHECKPOINT=/content/SpinQuant_workspace/consolidated.00.pth
LLAMA_PARAMS= /src/gitrepo/llama/Llama3.2-1B/params.json
python -m examples.models.llama2.export_llama \
   --checkpoint "${LLAMA_QUANTIZED_CHECKPOINT:?}" \
   --params "${LLAMA_PARAMS:?}" \
   --use_sdpa_with_kv_cache \
   --preq_mode 8da4w_output_8da8w \
   --preq_group_size 32 \
   --max_seq_length 2048 \
   --output_name "llama3_2_noxnn.pte" \
   -kv \
   -d fp32 \
   --preq_embedding_quantize 8,0 \
   --use_spin_quant native \
   --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001], "get_n_bos": 0, "get_n_eos": 0}'

What adjustments are required to resolve the "missing out variants" error when the -X flag is omitted?
Thank you for your assistance!

Versions

Collecting environment information...
PyTorch version: 2.6.0.dev20240927+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.31.0
Libc version: glibc-2.35

Python version: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.167.1-1.cm2-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: 12.6.77

Versions of relevant libraries:
[pip3] executorch==0.5.0a0+20a157f
[pip3] numpy==1.26.4
[pip3] torch==2.6.0.dev20240927+cpu
[pip3] torchao==0.5.0+git0916b5b2
[pip3] torchaudio==2.5.0.dev20240927+cpu
[pip3] torchsr==1.0.4
[pip3] torchvision==0.20.0.dev20240927+cpu
[conda] executorch 0.5.0a0+20a157f pypi_0 pypi
[conda] numpy 1.26.4 pypi_0 pypi
[conda] torch 2.6.0.dev20240927+cpu pypi_0 pypi
[conda] torchaudio 2.5.0.dev20240927+cpu pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchvision 0.20.0.dev20240927+cpu pypi_0 pypi

cc @digantdesai @mcr229 @JacobSzwejbka @dbort

The text was updated successfully, but these errors were encountered:

metascroy · 2024-11-20T18:54:50Z

Can you add try adding import executorch.kernels.quantized to export_llama.py, like this:

executorch/kernels/quantized/test/test_out_variants.py

Line 10 in 84222a9

    
           import executorch.kernels.quantized  # noqa[F401] 'executorch.kernels.quantized' imported but unused

I don't think we have a quantized linear kernel in ExecuTorch outside of XNNPACK or torchao, so I guess using those ops probably dequantizes the weights and does the linear computation in float32, and it might not be a good comparison.

cc @larryliu0820 for missing ops and @digantdesai for XNNPACK

digantdesai · 2024-11-21T16:50:12Z

Hmm...We should have quantize_per_token_out for example in executorch/kernels/quantized/cpu/op_quantize.cpp. And we should link against the quantized_ops_lib. And we should have tests for running quantized Llama with portable-ops only, not sure about Llama 3.2 though.

sheetalarkadam · 2024-11-26T22:44:22Z

@digantdesai the only missing op in executorch/kernels/quantized/cpu is dequantize_per_channel_group. But even after adding import executorch.kernels.quantized I get the same error but it does find the op dequantize_per_channel. I also see the linkage target_link_options_shared_lib(quantized_ops_lib) in CMakelist.txt .

@metascroy To try using the torchao ops I am currently trying to use the main branch but hitting some minor issues like quantization args not getting passed to ModelArgs

AkiSakurai · 2025-01-05T02:38:36Z

I encountered this problem as well. I found that it is necessary to register the output variant into the PyTorch system. The quantized library depends on the portable library and has to be loaded explicitly. However, why doesn’t ExecuTorch load it implicitly?

import executorch.extension.pybindings.portable_lib
import executorch.kernels.quantized

sheetalarkadam · 2025-01-08T00:52:42Z

@AkiSakurai thanks linking the portable library helped with all the defined ops in executorch/kernels/quantized/cpu. But dequantize_per_channel_group is still missing. Were you able to get the op definition from somewhere?

RuntimeError: Missing out variants: {'quantized_decomposed::dequantize_per_channel_group'}

AkiSakurai · 2025-01-08T13:56:51Z

Were you able to get the op definition from somewhere?

No, It looks like this operation is not yet implemented.

sheetalarkadam · 2025-01-10T00:19:18Z

@digantdesai can you help with the implementation of the missing op quantized_decomposed::dequantize_per_channel_group?

digantdesai · 2025-01-10T03:28:29Z

As @AkiSakurai correctly said, it seems like we do not have that op implemented in the quantized library here - executorch/kernels/quantized/cpu, let's create an issue first. Then either me, you or someone from here or someone from ExecuTorch team can implement it.

sheetalarkadam · 2025-01-15T19:02:54Z

Got it, thanks

BodhiHu · 2025-01-20T11:12:40Z

is #7775 a duplicate of this ?

we got runtime error when trying to convert llama3.1 8b:

RuntimeError: Missing out variants: {'quantized_decomposed::dequantize_per_token', 'quantized_decomposed::choose_qparams_per_token_asymmetric', 'quantized_decomposed::dequantize_per_channel_group', 'quantized_decomposed::quantize_per_token'}

digantdesai · 2025-01-21T21:24:27Z

is #7775 a duplicate of this ?

I don't think so. #7775 is lowering with XNNPACK delegate and still running into missing q/dq ops. Vs. here we want the model to be able to run without XNNPACK delegate.

BodhiHu · 2025-01-22T02:13:29Z

is #7775 a duplicate of this ?

I don't think so. #7775 is lowering with XNNPACK delegate and still running into missing q/dq ops. Vs. here we want the model to be able to run without XNNPACK delegate.

Thanks a lot for the clarifications，we added some debug logs and turns out the SchemaKind is different when converting op:

INFO:root:Failed converting '<EdgeOpOverload: quantized_decomposed.dequantize_per_token.default>: schema = quantized_decomposed::dequantize_per_token(Tensor input, Tensor scales, Tensor zero_points, int quant_min, int quant_max, ScalarType dtype, ScalarType output_dtype) -> Tensor' to its out variant with error: 'SchemaKind.out variant of operator quantized_decomposed::dequantize_per_token can't be found. We've found the schemas of all the overloads: ['quantized_decomposed::dequantize_per_token(Tensor input, Tensor scales, Tensor zero_points, int quant_min, int quant_max, ScalarType dtype, ScalarType output_dtype) -> Tensor']'
>>>>>>>>>
  SchemaKind.functional == SchemaKind.out:
    equals: False
  quantized_decomposed::dequantize_per_channel_group(Tensor input, Tensor scales, Tensor? zero_points, int quant_min, int quant_max, ScalarType dtype, int group_size, ScalarType output_dtype) -> ()
  quantized_decomposed::dequantize_per_channel_group(Tensor input, Tensor scales, Tensor? zero_points, int quant_min, int quant_max, ScalarType dtype, int group_size, ScalarType output_dtype) -> ()
    equals: True

metascroy mentioned this issue Nov 21, 2024

Problems encountered when performing xnnpack quantization #6925

Closed

mcr229 added module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ module: runtime Issues related to the core runtime and code under runtime/ labels Jan 14, 2025

sheetalarkadam mentioned this issue Jan 15, 2025

Quantized op implementation: quantized_decomposed::dequantize_per_channel_group #7676

Open

BodhiHu mentioned this issue Jan 21, 2025

Export llama3.1 Runtime error: Missing out variants: {'quantized_decomposed::dequantize_per_token'....... #7775

Closed

metascroy added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 4, 2025

jackzhxng mentioned this issue Feb 11, 2025

RuntimeError: Missing out variants: {'quantized_decomposed::dequantize_per_tensor', 'quantized_decomposed::quantize_per_tensor'} #8369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Out Variants When Running Llama3.2 Example Without XNNPack #6975

Missing Out Variants When Running Llama3.2 Example Without XNNPack #6975

sheetalarkadam commented Nov 20, 2024 •

edited by pytorch-bot bot

Loading

metascroy commented Nov 20, 2024

digantdesai commented Nov 21, 2024

sheetalarkadam commented Nov 26, 2024 •

edited

Loading

AkiSakurai commented Jan 5, 2025

sheetalarkadam commented Jan 8, 2025 •

edited

Loading

AkiSakurai commented Jan 8, 2025 •

edited

Loading

sheetalarkadam commented Jan 10, 2025

digantdesai commented Jan 10, 2025

sheetalarkadam commented Jan 15, 2025

BodhiHu commented Jan 20, 2025

digantdesai commented Jan 21, 2025

BodhiHu commented Jan 22, 2025

Missing Out Variants When Running Llama3.2 Example Without XNNPack #6975

Missing Out Variants When Running Llama3.2 Example Without XNNPack #6975

Comments

sheetalarkadam commented Nov 20, 2024 • edited by pytorch-bot bot Loading

Versions

metascroy commented Nov 20, 2024

digantdesai commented Nov 21, 2024

sheetalarkadam commented Nov 26, 2024 • edited Loading

AkiSakurai commented Jan 5, 2025

sheetalarkadam commented Jan 8, 2025 • edited Loading

AkiSakurai commented Jan 8, 2025 • edited Loading

sheetalarkadam commented Jan 10, 2025

digantdesai commented Jan 10, 2025

sheetalarkadam commented Jan 15, 2025

BodhiHu commented Jan 20, 2025

digantdesai commented Jan 21, 2025

BodhiHu commented Jan 22, 2025

sheetalarkadam commented Nov 20, 2024 •

edited by pytorch-bot bot

Loading

sheetalarkadam commented Nov 26, 2024 •

edited

Loading

sheetalarkadam commented Jan 8, 2025 •

edited

Loading

AkiSakurai commented Jan 8, 2025 •

edited

Loading