[XNNPACK] Serialize weights as fp16 rather than fp32 #9753

mcr229 · 2025-03-29T00:02:44Z

Summary

Previously we've used FP32_STATIC_WEIGHTS flag in xnnpack to coerce fp32 weights into fp16 for linear and conv. This allowed us to mimc fp16 computation because the weights would be converted and packed as fp16 at runtime. However, this means we lose the benefit of the smaller .pte file because the weights are serialized as fp32 rather than fp16. Additionally, we still have to load the weights as fp32, since they are converted at runtime. This has some poor effects on performance

Test plan

python -m unittest backends.xnnpack.test.ops.test_linear.TestLinear.test_fp16_linear
python -m unittest backends.xnnpack.test.ops.test_linear.TestLinear
python -m unittest backends.xnnpack.test.ops.test_conv2d.TestConv2d

Llama 3.2 with bf16 weights:
Before:

-rw-r--r--  1 maxren  staff  5468937344 Mar 28 17:00 llama3_2_fp16_direct_convert_runtime.pte

After:

-rw-r--r--  1 maxren  staff  2997443712 Mar 28 16:57 llama3_2_fp16_direct_convert_runtime.pte

pytorch-bot · 2025-03-29T00:02:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9753

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cb31420 with merge base ce74f8e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

digantdesai · 2025-03-31T19:49:23Z

backends/xnnpack/operators/node_visitor.py

@@ -368,7 +368,7 @@ def define_tensor(  # noqa: C901
                        constant data. If used along with convert_to_nhwc, this
                        swap will happen before converting to nhwc.
            quant_params: Quantization meta data for this tensor, None if it is not quantized
-            fp32_static_weights: XNN_FLAG_FP32_STATIC_WEIGHTS for fp16 conv
+            force_fp32: forces tensor to be serialize as fp32, used for bias of dynamically quantized ops


s/fp32_static_weight/force_fp32 - seems a little too vague if you ask me.

### Summary Previously we've used FP32_STATIC_WEIGHTS flag in xnnpack to coerce fp32 weights into fp16 for linear and conv. This allowed us to mimc fp16 computation because the weights would be converted and packed as fp16 at runtime. However, this means we lose the benefit of the smaller .pte file because the weights are serialized as fp32 rather than fp16. Additionally, we still have to load the weights as fp32, since they are converted at runtime. This has some poor effects on performance ### Test plan ``` python -m unittest backends.xnnpack.test.ops.test_linear.TestLinear.test_fp16_linear python -m unittest backends.xnnpack.test.ops.test_linear.TestLinear python -m unittest backends.xnnpack.test.ops.test_conv2d.TestConv2d ``` Llama 3.2 with bf16 weights: Before: ``` -rw-r--r-- 1 maxren staff 5468937344 Mar 28 17:00 llama3_2_fp16_direct_convert_runtime.pte ``` After: ``` -rw-r--r-- 1 maxren staff 2997443712 Mar 28 16:57 llama3_2_fp16_direct_convert_runtime.pte ```

mcr229 requested a review from digantdesai March 29, 2025 00:02

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 29, 2025

mcr229 force-pushed the fp16 branch from 52d0566 to 28fe6bd Compare March 29, 2025 00:03

mcr229 changed the base branch from fp16 to main March 29, 2025 00:06

mcr229 force-pushed the fp16 branch from 28fe6bd to a4f4bc5 Compare March 29, 2025 00:09

mcr229 added the topic: not user facing label Mar 29, 2025

mcr229 requested a review from GregoryComer March 31, 2025 17:35

digantdesai reviewed Mar 31, 2025

View reviewed changes

digantdesai approved these changes Mar 31, 2025

View reviewed changes

[XNNPACK] Serialize fp16 weights as fp16

cb31420

mcr229 force-pushed the fp16 branch from a4f4bc5 to cb31420 Compare March 31, 2025 22:23

mcr229 merged commit a5994ac into pytorch:main Mar 31, 2025
82 checks passed

mcr229 mentioned this pull request Apr 1, 2025

Why does the weight data type of the Linear layer become FP32 during the runtime when load fp16.pte (fp16 Llama 3.2-1B model)? #9639

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XNNPACK] Serialize weights as fp16 rather than fp32 #9753

[XNNPACK] Serialize weights as fp16 rather than fp32 #9753

mcr229 commented Mar 29, 2025

pytorch-bot bot commented Mar 29, 2025 •

edited

Loading

digantdesai Mar 31, 2025

[XNNPACK] Serialize weights as fp16 rather than fp32 #9753

[XNNPACK] Serialize weights as fp16 rather than fp32 #9753

Conversation

mcr229 commented Mar 29, 2025

Summary

Test plan

pytorch-bot bot commented Mar 29, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9753

✅ No Failures

digantdesai Mar 31, 2025

Choose a reason for hiding this comment

pytorch-bot bot commented Mar 29, 2025 •

edited

Loading