You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[XNNPACK] Serialize weights as fp16 rather than fp32 (#9753)
### Summary
Previously we've used FP32_STATIC_WEIGHTS flag in xnnpack to coerce fp32
weights into fp16 for linear and conv. This allowed us to mimc fp16
computation because the weights would be converted and packed as fp16 at
runtime. However, this means we lose the benefit of the smaller .pte
file because the weights are serialized as fp32 rather than fp16.
Additionally, we still have to load the weights as fp32, since they are
converted at runtime. This has some poor effects on performance
### Test plan
```
python -m unittest backends.xnnpack.test.ops.test_linear.TestLinear.test_fp16_linear
python -m unittest backends.xnnpack.test.ops.test_linear.TestLinear
python -m unittest backends.xnnpack.test.ops.test_conv2d.TestConv2d
```
Llama 3.2 with bf16 weights:
Before:
```
-rw-r--r-- 1 maxren staff 5468937344 Mar 28 17:00 llama3_2_fp16_direct_convert_runtime.pte
```
After:
```
-rw-r--r-- 1 maxren staff 2997443712 Mar 28 16:57 llama3_2_fp16_direct_convert_runtime.pte
```
0 commit comments