Use Core ML Quantizer in Llama Export #4458

YifanShenSZ · 2024-07-30T04:52:19Z

This PR is an initial step to add Core ML quantizer in Llama export. We start with "quantize model with XNNPack quantizer then fully delegate to Core ML backend". "Quantize with Core ML quantizer" is under development

This PR does 2 things:

Add Core ML quantizer options then use them in Llama export
Use different iOS versions for different features: fp16 model can run on iOS 15, while 8a8w quantization requires iOS 17, and 4w quantization requires iOS 18

pytorch-bot · 2024-07-30T04:52:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4458

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 68d345c with merge base 5a20a49 ():

NEW FAILURE - The following job has failed:

pull / test-export-llava-linux / linux-job (gh)
RuntimeError: Command docker exec -t e81de0580fa36108059fa62166e293629e6eff10d8e7ff27ff52c4798276d2da /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

YifanShenSZ · 2024-07-30T04:55:34Z

Add @cccclai @shoumikhin as reviewers

(smh I cannot assign reviewer 😂 so @ you here)

YifanShenSZ · 2024-07-30T04:58:12Z

@cymbalrush since now we have skip_model_load=True in preprocess, the compute unit in llama export script is no longer necessary, right?

-        # using `ComputeUnit.ALL` can increase the model load time, default to `ComputeUnit.CPU_AND_GPU`
-        compute_unit=ct.ComputeUnit[ct.ComputeUnit.CPU_AND_GPU.name.upper()],

cccclai · 2024-07-30T05:02:51Z

Thanks for putting up the PR! For the change in llama_transformer.py, actually there was some CI regression as shown in #3786.

examples/models/llama2/source_transformation/sdpa.py

YifanShenSZ · 2024-07-30T05:14:14Z

Thanks for putting up the PR! For the change in llama_transformer.py, actually there was some CI regression as shown in #3786.

Yeah this is quite interesting

I can export stories 110M on my local without issue
I took a look at the failed CI in #3786, and that is the same error we discussed on slack when I tried to tag mutated buffer... So it's probably a flakiness e.g. in exported program constructor?

Anyway, since that's just a minor fix, I reverted the llama_transformer.py and sdpa.py changes to make CI green.

cccclai

Looks good in general. Can we fix the CI and rename the coreml_xnnpack, coreml_xnnpack_qc4?

examples/models/llama2/export_llama_lib.py

extension/llm/export/quantizer_lib.py

cymbalrush · 2024-07-30T16:07:19Z

@cymbalrush since now we have skip_model_load=True in preprocess, the compute unit in llama export script is no longer necessary, right?
-        # using `ComputeUnit.ALL` can increase the model load time, default to `ComputeUnit.CPU_AND_GPU`
-        compute_unit=ct.ComputeUnit[ct.ComputeUnit.CPU_AND_GPU.name.upper()],

We still want it CPU_AND_GPU , I am concerned about the model load time.

YifanShenSZ · 2024-07-30T17:36:31Z

We still want it CPU_AND_GPU , I am concerned about the model load time.

Ok, reverted that change to still keep CPU_AND_GPU there

examples/models/llama2/export_llama_lib.py

cccclai · 2024-07-30T18:16:40Z

There is also a lint error in the CI. Mind addressing it?

facebook-github-bot · 2024-07-30T18:16:52Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

YifanShenSZ · 2024-07-30T20:51:35Z

There is also a lint error in the CI. Mind addressing it?

Applied the lint fix

facebook-github-bot · 2024-07-30T20:52:15Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai

there are some pyre error but they aren't part of the oss ci...

extension/llm/export/quantizer_lib.py

extension/llm/export/partitioner_lib.py

cccclai · 2024-07-31T00:18:55Z

lint error again 😅

…opriate iOS version accordingly

facebook-github-bot · 2024-07-31T04:44:51Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

YifanShenSZ · 2024-07-31T04:46:34Z

lint error again 😅

Fixed 😅

facebook-github-bot · 2024-07-31T07:39:34Z

@cccclai merged this pull request in 6bfefa8.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 30, 2024

cccclai reviewed Jul 30, 2024

View reviewed changes

examples/models/llama2/source_transformation/sdpa.py Outdated Show resolved Hide resolved

YifanShenSZ force-pushed the llama_coreml-quantizer branch from 2902624 to 7ca8eba Compare July 30, 2024 05:29

cccclai reviewed Jul 30, 2024

View reviewed changes

examples/models/llama2/export_llama_lib.py Outdated Show resolved Hide resolved

extension/llm/export/quantizer_lib.py Outdated Show resolved Hide resolved

YifanShenSZ force-pushed the llama_coreml-quantizer branch 2 times, most recently from 7b993ae to 4ef6875 Compare July 30, 2024 17:35

cccclai reviewed Jul 30, 2024

View reviewed changes

examples/models/llama2/export_llama_lib.py Outdated Show resolved Hide resolved

cccclai approved these changes Jul 30, 2024

View reviewed changes

cccclai reviewed Jul 30, 2024

View reviewed changes

extension/llm/export/quantizer_lib.py Show resolved Hide resolved

extension/llm/export/quantizer_lib.py Show resolved Hide resolved

extension/llm/export/partitioner_lib.py Outdated Show resolved Hide resolved

Add Core ML quantizer options then use them in Llama export; use appr…

68d345c

…opriate iOS version accordingly

YifanShenSZ force-pushed the llama_coreml-quantizer branch from d88bcad to 68d345c Compare July 31, 2024 04:34

facebook-github-bot closed this in 6bfefa8 Jul 31, 2024

facebook-github-bot added the Merged label Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Core ML Quantizer in Llama Export #4458

Use Core ML Quantizer in Llama Export #4458

YifanShenSZ commented Jul 30, 2024 •

edited

Loading

pytorch-bot bot commented Jul 30, 2024 •

edited

Loading

YifanShenSZ commented Jul 30, 2024

YifanShenSZ commented Jul 30, 2024 •

edited

Loading

cccclai commented Jul 30, 2024

YifanShenSZ commented Jul 30, 2024 •

edited

Loading

cccclai left a comment

cymbalrush commented Jul 30, 2024

YifanShenSZ commented Jul 30, 2024

cccclai commented Jul 30, 2024

facebook-github-bot commented Jul 30, 2024

YifanShenSZ commented Jul 30, 2024

facebook-github-bot commented Jul 30, 2024

cccclai left a comment

cccclai commented Jul 31, 2024

facebook-github-bot commented Jul 31, 2024

YifanShenSZ commented Jul 31, 2024

facebook-github-bot commented Jul 31, 2024

Use Core ML Quantizer in Llama Export #4458

Use Core ML Quantizer in Llama Export #4458

Conversation

YifanShenSZ commented Jul 30, 2024 • edited Loading

pytorch-bot bot commented Jul 30, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4458

❌ 1 New Failure

YifanShenSZ commented Jul 30, 2024

YifanShenSZ commented Jul 30, 2024 • edited Loading

cccclai commented Jul 30, 2024

YifanShenSZ commented Jul 30, 2024 • edited Loading

cccclai left a comment

Choose a reason for hiding this comment

cymbalrush commented Jul 30, 2024

YifanShenSZ commented Jul 30, 2024

cccclai commented Jul 30, 2024

facebook-github-bot commented Jul 30, 2024

YifanShenSZ commented Jul 30, 2024

facebook-github-bot commented Jul 30, 2024

cccclai left a comment

Choose a reason for hiding this comment

cccclai commented Jul 31, 2024

facebook-github-bot commented Jul 31, 2024

YifanShenSZ commented Jul 31, 2024

facebook-github-bot commented Jul 31, 2024

YifanShenSZ commented Jul 30, 2024 •

edited

Loading

pytorch-bot bot commented Jul 30, 2024 •

edited

Loading

YifanShenSZ commented Jul 30, 2024 •

edited

Loading

YifanShenSZ commented Jul 30, 2024 •

edited

Loading