You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -108,7 +108,7 @@ First, install vLLM with torchao support:
108
108
109
109
110
110
Inference with Transformers
111
-
---------------------------
111
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
112
112
113
113
Install the required packages:
114
114
@@ -160,12 +160,12 @@ Install the required packages:
160
160
print(output[0]['generated_text'])
161
161
162
162
Mobile Deployment with ExecuTorch
163
-
---------------------------------
163
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
164
164
165
165
ExecuTorch enables on-device inference using torchao's mobile-optimized quantization schemes. The 8da4w (8-bit dynamic activation, 4-bit weight) configuration is specifically designed for mobile deployment.
166
166
167
167
Step 1: Untie Embedding Weights
168
-
===============================
168
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
169
169
170
170
We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:
171
171
@@ -206,7 +206,7 @@ We want to quantize the embedding and lm_head differently. Since those layers ar
206
206
tokenizer.save_pretrained(save_to)
207
207
208
208
Step 2: Create Mobile-Optimized Quantization
209
-
============================================
209
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
210
210
211
211
Quantizing the model for mobile deployment using TorchAO's **Int8DynamicActivationIntxWeightConfig** configuration:
212
212
@@ -284,7 +284,7 @@ Quantizing the model for mobile deployment using TorchAO's **Int8DynamicActivati
284
284
285
285
286
286
Step 3: Export to ExecuTorch
287
-
============================
287
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
288
288
289
289
.. code-block:: bash
290
290
@@ -327,7 +327,7 @@ Evaluation
327
327
###########
328
328
329
329
Model Quality Assessment
330
-
------------------------
330
+
~~~~~~~~~~~~~~~~~~~~~~~~
331
331
332
332
Evaluate quantized models using lm-evaluation-harness:
333
333
@@ -343,7 +343,7 @@ Evaluate quantized models using lm-evaluation-harness:
0 commit comments