Skip to content

Use Core ML Quantizer in Llama Export #4458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

YifanShenSZ
Copy link
Collaborator

@YifanShenSZ YifanShenSZ commented Jul 30, 2024

This PR is an initial step to add Core ML quantizer in Llama export. We start with "quantize model with XNNPack quantizer then fully delegate to Core ML backend". "Quantize with Core ML quantizer" is under development

This PR does 2 things:

  1. Add Core ML quantizer options then use them in Llama export
  2. Use different iOS versions for different features: fp16 model can run on iOS 15, while 8a8w quantization requires iOS 17, and 4w quantization requires iOS 18

Copy link

pytorch-bot bot commented Jul 30, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4458

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 68d345c with merge base 5a20a49 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 30, 2024
@YifanShenSZ
Copy link
Collaborator Author

Add @cccclai @shoumikhin as reviewers

(smh I cannot assign reviewer 😂 so @ you here)

@YifanShenSZ
Copy link
Collaborator Author

YifanShenSZ commented Jul 30, 2024

@cymbalrush since now we have skip_model_load=True in preprocess, the compute unit in llama export script is no longer necessary, right?

-        # using `ComputeUnit.ALL` can increase the model load time, default to `ComputeUnit.CPU_AND_GPU`
-        compute_unit=ct.ComputeUnit[ct.ComputeUnit.CPU_AND_GPU.name.upper()],

@cccclai
Copy link
Contributor

cccclai commented Jul 30, 2024

Thanks for putting up the PR! For the change in llama_transformer.py, actually there was some CI regression as shown in #3786.

@YifanShenSZ
Copy link
Collaborator Author

YifanShenSZ commented Jul 30, 2024

Thanks for putting up the PR! For the change in llama_transformer.py, actually there was some CI regression as shown in #3786.

Yeah this is quite interesting

  1. I can export stories 110M on my local without issue
  2. I took a look at the failed CI in #3786, and that is the same error we discussed on slack when I tried to tag mutated buffer... So it's probably a flakiness e.g. in exported program constructor?

Anyway, since that's just a minor fix, I reverted the llama_transformer.py and sdpa.py changes to make CI green.

@YifanShenSZ YifanShenSZ force-pushed the llama_coreml-quantizer branch from 2902624 to 7ca8eba Compare July 30, 2024 05:29
Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. Can we fix the CI and rename the coreml_xnnpack, coreml_xnnpack_qc4?

@cymbalrush
Copy link
Contributor

@cymbalrush since now we have skip_model_load=True in preprocess, the compute unit in llama export script is no longer necessary, right?

-        # using `ComputeUnit.ALL` can increase the model load time, default to `ComputeUnit.CPU_AND_GPU`
-        compute_unit=ct.ComputeUnit[ct.ComputeUnit.CPU_AND_GPU.name.upper()],

We still want it CPU_AND_GPU , I am concerned about the model load time.

@YifanShenSZ YifanShenSZ force-pushed the llama_coreml-quantizer branch 2 times, most recently from 7b993ae to 4ef6875 Compare July 30, 2024 17:35
@YifanShenSZ
Copy link
Collaborator Author

We still want it CPU_AND_GPU , I am concerned about the model load time.

Ok, reverted that change to still keep CPU_AND_GPU there

@cccclai
Copy link
Contributor

cccclai commented Jul 30, 2024

There is also a lint error in the CI. Mind addressing it?

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@YifanShenSZ
Copy link
Collaborator Author

There is also a lint error in the CI. Mind addressing it?

Applied the lint fix

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some pyre error but they aren't part of the oss ci...

@cccclai
Copy link
Contributor

cccclai commented Jul 31, 2024

lint error again 😅

@YifanShenSZ YifanShenSZ force-pushed the llama_coreml-quantizer branch from d88bcad to 68d345c Compare July 31, 2024 04:34
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@YifanShenSZ
Copy link
Collaborator Author

lint error again 😅

Fixed 😅

@facebook-github-bot
Copy link
Contributor

@cccclai merged this pull request in 6bfefa8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants