Update QAT: add grad clipping, torch.compile, collate fn #1854

andrewor14 · 2024-10-16T19:55:46Z

Summary:

Update the qat_distributed recipe to match the full_finetune_distributed recipe. This commit adds features to QAT
like gradient clipping, torch.compile, and user configurable collate function for data pre-processing. Mirrors all changes in
full_finetune_distributed as of 506e099.

Helpful commands for quick review:

diff --color recipes/full_finetune_distributed.py recipes/qat_distributed.py
diff --color recipes/configs/llama2/7B_full.yaml recipes/configs/llama2/7B_qat_full.yaml
diff --color recipes/configs/llama3/8B_full.yaml recipes/configs/llama3/8B_qat_full.yaml

Test Plan:

Fine-tune on alpaca dataset for 1 epoch with and without QAT:

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 tune run --nnodes 1 --nproc_per_node 6 qat_distributed --config llama3/8B_qat_full \
    epochs=1 \
    checkpointer.output_dir="$LOG_DIR" \
    metric_logger.output_dir="${LOG_DIR}/metrics" \
    quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQATQuantizer

CUDA_VISIBLE_DEVICES=1 tune run quantize --config recipes/configs/quantization.yaml \
    model._component_=torchtune.models.llama3.llama3_8b \
    checkpointer._component_=torchtune.training.FullModelMetaCheckpointer \
    checkpointer.checkpoint_dir="$LOG_DIR" \
    checkpointer.output_dir="$LOG_DIR" \
    checkpointer.checkpoint_files=[meta_model_0.pt] \
    checkpointer.model_type=LLAMA3 \
    quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQuantizer

CUDA_VISIBLE_DEVICES=1 tune run eleuther_eval --config eleuther_evaluation \
    tasks=[wikitext] \
    model._component_=torchtune.models.llama3.llama3_8b \
    checkpointer._component_=torchtune.training.FullModelTorchTuneCheckpointer \
    checkpointer.checkpoint_dir="$LOG_DIR" \
    checkpointer.output_dir="$LOG_DIR" \
    checkpointer.checkpoint_files=[meta_model_0-8da4w.pt] \
    checkpointer.model_type=LLAMA3 \
    tokenizer._component_=torchtune.models.llama3.llama3_tokenizer \
    tokenizer.path=/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model \
    quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQuantizer

With QAT:

| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|------|---------------|---|------:|---|------|
|wikitext|      2|none  |None  |bits_per_byte  |↓  | 0.9821|±  |   N/A|
|        |       |none  |None  |byte_perplexity|↓  | 1.9754|±  |   N/A|
|        |       |none  |None  |word_perplexity|↓  |38.1039|±  |   N/A|

Without QAT:

| Tasks  |Version|Filter|n-shot|    Metric     |   |  Value  |   |Stderr|
|--------|------:|------|------|---------------|---|--------:|---|------|
|wikitext|      2|none  |None  |bits_per_byte  |↓  |   2.2017|±  |   N/A|
|        |       |none  |None  |byte_perplexity|↓  |   4.6003|±  |   N/A|
|        |       |none  |None  |word_perplexity|↓  |3501.1122|±  |   N/A|

pytorch-bot · 2024-10-16T19:55:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1854

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 258cd8b with merge base 506e099 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-10-16T20:34:02Z

Codecov Report

Attention: Patch coverage is 0% with 23 lines in your changes missing coverage. Please review.

Project coverage is 69.19%. Comparing base (c70ad29) to head (5aef800).
Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
recipes/qat_distributed.py	0.00%	23 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1854      +/-   ##
==========================================
+ Coverage   67.30%   69.19%   +1.89%     
==========================================
  Files         304      305       +1     
  Lines       16000    16031      +31     
==========================================
+ Hits        10768    11092     +324     
+ Misses       5232     4939     -293

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

joecummings

Can you attach some output from your run? Then this looks good to me.

**Summary:** Update the qat_distributed recipe to match the full_finetune_distributed recipe. This commit adds features to QAT like gradient clipping, torch.compile, and user configurable collate function for data pre-processing. Mirrors all changes in full_finetune_distributed as of 506e099. Helpful commands for quick review: ``` diff --color recipes/full_finetune_distributed.py recipes/qat_distributed.py diff --color recipes/configs/llama2/7B_full.yaml recipes/configs/llama2/7B_qat_full.yaml diff --color recipes/configs/llama3/8B_full.yaml recipes/configs/llama3/8B_qat_full.yaml ``` **Test Plan:** Fine-tune on alpaca dataset for 1 epoch with and without QAT: ``` CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 tune run --nnodes 1 --nproc_per_node 6 qat_distributed --config llama3/8B_qat_full \ epochs=1 \ checkpointer.output_dir="$LOG_DIR" \ metric_logger.output_dir="${LOG_DIR}/metrics" \ quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQATQuantizer CUDA_VISIBLE_DEVICES=1 tune run quantize --config recipes/configs/quantization.yaml \ model._component_=torchtune.models.llama3.llama3_8b \ checkpointer._component_=torchtune.training.FullModelMetaCheckpointer \ checkpointer.checkpoint_dir="$LOG_DIR" \ checkpointer.output_dir="$LOG_DIR" \ checkpointer.checkpoint_files=[meta_model_0.pt] \ checkpointer.model_type=LLAMA3 \ quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQuantizer CUDA_VISIBLE_DEVICES=1 tune run eleuther_eval --config eleuther_evaluation \ tasks=[wikitext] \ model._component_=torchtune.models.llama3.llama3_8b \ checkpointer._component_=torchtune.training.FullModelTorchTuneCheckpointer \ checkpointer.checkpoint_dir="$LOG_DIR" \ checkpointer.output_dir="$LOG_DIR" \ checkpointer.checkpoint_files=[meta_model_0-8da4w.pt] \ checkpointer.model_type=LLAMA3 \ tokenizer._component_=torchtune.models.llama3.llama3_tokenizer \ tokenizer.path=/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model \ quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQuantizer ``` With QAT: ``` | Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr| |--------|------:|------|------|---------------|---|------:|---|------| |wikitext| 2|none |None |bits_per_byte |↓ | 0.9821|± | N/A| | | |none |None |byte_perplexity|↓ | 1.9754|± | N/A| | | |none |None |word_perplexity|↓ |38.1039|± | N/A| ``` Without QAT: ``` | Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr| |--------|------:|------|------|---------------|---|--------:|---|------| |wikitext| 2|none |None |bits_per_byte |↓ | 2.2017|± | N/A| | | |none |None |byte_perplexity|↓ | 4.6003|± | N/A| | | |none |None |word_perplexity|↓ |3501.1122|± | N/A| ```

**Summary:** Similar to pytorch#1854. **Test Plan:** TBD

**Summary:** Similar to pytorch#1854. Update `qat_distributed` recipe to mirror `full_finetune_distributed` up until a6db644. The new major feature that is excluded from `qat_distributed` is FP8 finetuning (pytorch#2546), since QAT FP8 is not supported in torchao yet. Diff between full finetune and QAT recipes: P1809370361 ``` diff --color recipes/full_finetune_distributed.py recipes/qat_distributed.py ``` **Test Plan:** Finetune: ``` tune run --nnodes 1 --nproc_per_node 4 qat_distributed --config llama3_2/3B_qat_full \ epochs=1 \ batch_size=16 \ dataset._component_=torchtune.datasets.alpaca_cleaned_dataset \ checkpointer.output_dir=/home/andrewor/local/logs/tune/Llama3.2-3B_alpaca_qat \ output_dir=/home/andrewor/local/logs/tune/Llama3.2-3B_alpaca_qat/metrics \ metric_logger.log_dir=/home/andrewor/local/logs/tune/Llama3.2-3B_alpaca_qat/metrics \ quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQATQuantizer \ quantizer.groupsize=32 ``` Quantize: ``` tune run quantize --config quantization \ model._component_=torchtune.models.llama3_2.llama3_2_3b \ checkpointer._component_=torchtune.training.FullModelHFCheckpointer \ checkpointer.checkpoint_dir=/home/andrewor/local/logs/tune/Llama3.2-3B_alpaca_qat/epoch_0 \ checkpointer.output_dir=/home/andrewor/local/logs/tune/Llama3.2-3B_alpaca_qat/epoch_0_out \ 'checkpointer.checkpoint_files=[model-00001-of-00002.safetensors,model-00002-of-00002.safetensors]' \ checkpointer.model_type=LLAMA3 \ quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQuantizer \ quantizer.groupsize=32 ``` Eval: ``` tune run eleuther_eval --config eleuther_evaluation \ batch_size=1 \ 'tasks=[wikitext]' \ model._component_=torchtune.models.llama3_2.llama3_2_3b \ checkpointer._component_=torchtune.training.FullModelTorchTuneCheckpointer \ checkpointer.checkpoint_dir=/home/andrewor/local/logs/tune/Llama3.2-3B_alpaca_qat/epoch_0 \ checkpointer.output_dir=/home/andrewor/local/logs/tune/Llama3.2-3B_alpaca_qat/epoch_0_out \ 'checkpointer.checkpoint_files=[model-00001-of-00002-8da4w.ckpt]' \ checkpointer.model_type=LLAMA3 \ tokenizer._component_=torchtune.models.llama3.llama3_tokenizer \ tokenizer.path=/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model \ quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQuantizer \ quantizer.groupsize=32 ``` Results: ``` experiment_name tok/s peak_mem_active peak_mem_alloc peak_mem_reserved ----------------------- ------------------- ----------------- ---------------- ------------------- Llama3.2-3B_alpaca_full 4677.163 (+0.000%) 12.261 (+0.000%) 12.261 (+0.000%) 15.778 (+0.000%) Llama3.2-3B_alpaca_qat 1873.316 (-59.948%) 13.047 (+6.409%) 13.047 (+6.409%) 17.226 (+9.176%) experiment_name hellaswag_acc wikitext_word_perplexity ----------------------- ------------------------------ ------------------------------- Llama3.2-3B_alpaca_full 0.470 quant, 0.534 float 18.563 quant, 12.364 float Llama3.2-3B_alpaca_qat 0.511 quant, recovered 63.043% 13.792 quant, recovered 76.962% ```

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 16, 2024

andrewor14 force-pushed the update-qat branch from 5a2aa35 to 5aef800 Compare October 16, 2024 19:59

andrewor14 force-pushed the update-qat branch from 5aef800 to 8b55ce7 Compare October 16, 2024 20:41

joecummings reviewed Oct 21, 2024

View reviewed changes

andrewor14 mentioned this pull request Oct 31, 2024

Restore backward after each batch for grad accum #1917

Merged

andrewor14 force-pushed the update-qat branch 2 times, most recently from bb0cc85 to 3df9045 Compare November 1, 2024 23:29

ebsmothers mentioned this pull request Nov 5, 2024

Remove unused FSDP1 components #1933

Closed

13 tasks

andrewor14 requested a review from joecummings November 7, 2024 21:48

andrewor14 force-pushed the update-qat branch from 3df9045 to 258cd8b Compare November 7, 2024 22:00

joecummings approved these changes Nov 8, 2024

View reviewed changes

joecummings merged commit 96dea61 into pytorch:main Nov 8, 2024
17 checks passed

ebsmothers mentioned this pull request Nov 26, 2024

v0.5.0 tracker #2008

Closed

44 tasks

andrewor14 added a commit to andrewor14/torchtune that referenced this pull request May 12, 2025

Update QAT recipe to match full finetune recipe (5/12/25)

0bb8049

**Summary:** Similar to pytorch#1854. **Test Plan:** TBD

andrewor14 mentioned this pull request May 12, 2025

Update QAT recipe to match full finetune recipe (5/12/25) #2721

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update QAT: add grad clipping, torch.compile, collate fn #1854

Update QAT: add grad clipping, torch.compile, collate fn #1854

Uh oh!

andrewor14 commented Oct 16, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 16, 2024 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 16, 2024 •

edited

Loading

Uh oh!

joecummings left a comment

Uh oh!

Uh oh!

Uh oh!

Update QAT: add grad clipping, torch.compile, collate fn #1854

Update QAT: add grad clipping, torch.compile, collate fn #1854

Uh oh!

Conversation

andrewor14 commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1854

✅ No Failures

Uh oh!

codecov-commenter commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andrewor14 commented Oct 16, 2024 •

edited

Loading

pytorch-bot bot commented Oct 16, 2024 •

edited

Loading

codecov-commenter commented Oct 16, 2024 •

edited

Loading