Skip to content

Commit bd2600f

Browse files
committed
Fix formatting
1 parent 17b7cb8 commit bd2600f

File tree

1 file changed

+13
-13
lines changed

1 file changed

+13
-13
lines changed

docs/source/serving.rst

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ Serving and Inference
7575
######################
7676

7777
Serving and Inference with vLLM
78-
-------------------------------
78+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7979

8080
vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.
8181

@@ -108,7 +108,7 @@ First, install vLLM with torchao support:
108108

109109

110110
Inference with Transformers
111-
---------------------------
111+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
112112

113113
Install the required packages:
114114

@@ -160,12 +160,12 @@ Install the required packages:
160160
print(output[0]['generated_text'])
161161
162162
Mobile Deployment with ExecuTorch
163-
---------------------------------
163+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
164164

165165
ExecuTorch enables on-device inference using torchao's mobile-optimized quantization schemes. The 8da4w (8-bit dynamic activation, 4-bit weight) configuration is specifically designed for mobile deployment.
166166

167167
Step 1: Untie Embedding Weights
168-
===============================
168+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
169169

170170
We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:
171171

@@ -206,7 +206,7 @@ We want to quantize the embedding and lm_head differently. Since those layers ar
206206
tokenizer.save_pretrained(save_to)
207207
208208
Step 2: Create Mobile-Optimized Quantization
209-
============================================
209+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
210210

211211
Quantizing the model for mobile deployment using TorchAO's **Int8DynamicActivationIntxWeightConfig** configuration:
212212

@@ -284,7 +284,7 @@ Quantizing the model for mobile deployment using TorchAO's **Int8DynamicActivati
284284
285285
286286
Step 3: Export to ExecuTorch
287-
============================
287+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
288288

289289
.. code-block:: bash
290290
@@ -327,7 +327,7 @@ Evaluation
327327
###########
328328

329329
Model Quality Assessment
330-
------------------------
330+
~~~~~~~~~~~~~~~~~~~~~~~~
331331

332332
Evaluate quantized models using lm-evaluation-harness:
333333

@@ -343,7 +343,7 @@ Evaluate quantized models using lm-evaluation-harness:
343343
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
344344
345345
Memory Benchmarking
346-
--------------------
346+
~~~~~~~~~~~~~~~~~~~
347347

348348
.. code-block:: python
349349
@@ -392,10 +392,10 @@ Memory Benchmarking
392392
+-------------------+---------------------+------------------------------+
393393

394394
Performance Benchmarking
395-
------------------------------
395+
~~~~~~~~~~~~~~~~~~~~~~~~
396396

397397
**Latency Benchmarking**:
398-
=========================
398+
^^^^^^^^^^^^^^^^^^^^^^^^^
399399

400400
.. code-block:: bash
401401
@@ -406,7 +406,7 @@ Performance Benchmarking
406406
VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
407407
408408
**Serving Benchmarking**:
409-
=========================
409+
^^^^^^^^^^^^^^^^^^^^^^^^^
410410

411411
We benchmarked the throughput in a serving environment.
412412

@@ -439,7 +439,7 @@ We benchmarked the throughput in a serving environment.
439439
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
440440
441441
**Results (H100 machine)**:
442-
============================
442+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
443443

444444
+----------------------------+---------------------+------------------------------+
445445
| Benchmark | Phi-4-mini-instruct | Phi-4-mini-instruct-float8dq |
@@ -454,7 +454,7 @@ We benchmarked the throughput in a serving environment.
454454
+----------------------------+---------------------+------------------------------+
455455

456456
**Conclusion**
457-
==============
457+
^^^^^^^^^^^^^^^
458458

459459
This tutorial demonstrated how torchao's quantization and sparsity techniques integrate seamlessly across the entire ML deployment stack:
460460

0 commit comments

Comments
 (0)