Feature request
Unlock a new workflow for on-device use-cases via torch.export and ExecuTorch.
So ideally the users can have an e2e experience by loading a pretrained transformer model from HuggingFace, export and lower it to ExecuTorch and get reasonable performance out-of-the-box.
For example:
- Load a model with StaticCache:
model = AutoModelForCausalLM.from_pretrained(
hf_model_repo,
config=config,
attn_implementation="sdpa",
cache_config={
"use_cache": True,
"cache_implementation": "static",
"max_cache_length": 128,
}, # Mandatory field to set ONLY for "Export to ExecuTorch" workflow, optional in other use-cases
)
- Then export the model with StaticCache.
exported_program = convert_and_export_with_cache(
model,
args=(model_inputs,),
kwargs={"position_ids": <val>, "inputs_embeds": <val>, "cache_position": <val>}
and then further lower the exported program to ExecuTorch with delegates for performance:
executorch_m = lower_to_executorch(
model,
recipes="xnnpack_fp32", # Delegate to XNNPACK backend
)
# The lowered artifact can be saved into a `.pte` binary format for integration and distribution.
With that you may get a model for on-device with reasonable performance to start with.
From there and still within ExecuTorch stack, you can easily tailor the experience for your use-cases, of course, with better performance! Note that ExecuTorch supports delegatation to XNNPACK backend, Apple Core ML and MPS, Qualcomm QNN, ARM Ethos-U, Vulkan GPU and more. You can learn more by reading our tutorial.
- Use the exported/lowered artifact for inference:
# The lowered artifact can run on a local device in the ExecuTorch runtime in c++ or via pybind, providing the same experience as how users run inference with the eager model on server.
generate(model=executorch_m, prompt="Hello world") # Will generate up to the maximal sequence length/cache length
The example workflow above shows direct integration between ExecuTorch and HF transformers models. Eventually this workflow could be accessible via optimum exporters-et, Transformers.js or in ExecuTorch and torchchat.
Motivation
Unlock a whole new on-device experience of using HuggingFace models w/o leaving the PyTorch ecosystem (ExecuTorch is native PyTorch!)
Issues Tracker
Fundamental
E2E workflow
Optimization
Models
And more! We're ambitious to expanding the model coverage massively. Please comment below if you are interested in a particular model for on-device use-case!
Even better, we warmly welcome direct contributions from the community to support more models in exporting to ExecuTorch!
Your contribution
- Co-design the "Export to ExecuTorch" workflow.
- Co-design the
generate for exported model and the integration in Optimum
- Identify and fill gaps in DevX and UX
Here is how ExecuTorch implements the generate() for llama2/3 in eager python and c++.
cc: @amyeroberts @gante @ArthurZucker @michaelbenayoun
Feature request
Unlock a new workflow for on-device use-cases via torch.export and ExecuTorch.
So ideally the users can have an e2e experience by loading a pretrained transformer model from HuggingFace, export and lower it to
ExecuTorchand get reasonable performance out-of-the-box.For example:
and then further lower the exported program to
ExecuTorchwith delegates for performance:With that you may get a model for on-device with reasonable performance to start with.
From there and still within
ExecuTorchstack, you can easily tailor the experience for your use-cases, of course, with better performance! Note thatExecuTorchsupports delegatation to XNNPACK backend, Apple Core ML and MPS, Qualcomm QNN, ARM Ethos-U, Vulkan GPU and more. You can learn more by reading our tutorial.The example workflow above shows direct integration between
ExecuTorchand HFtransformersmodels. Eventually this workflow could be accessible viaoptimum exporters-et,Transformers.jsor inExecuTorchandtorchchat.Motivation
Unlock a whole new on-device experience of using HuggingFace models w/o leaving the PyTorch ecosystem (
ExecuTorchis native PyTorch!)Issues Tracker
Fundamental
StaticCachecompatible withtorch.export: PR Make static cache compatible with torch.export #32168StaticCache: PR [WIP] Dynamic length in static cache #30862generate(inference) for torch exported text-generation models #32504: PR Generate using exported model and enable gemma2-2b in ExecuTorch #33707llm_runnerconsumable: How to convert tokenizer of SmolLM model as accepted by executorch pytorch/executorch#6813E2E workflow
Optimumenablement: Export-to-ExecuTorch via Optimum integration optimum#2128Tranformers.jsenablement: Export-to-ExecuTorch via transformers.js integration transformers.js#1039Optimization
Models
And more! We're ambitious to expanding the model coverage massively. Please comment below if you are interested in a particular model for on-device use-case!
Even better, we warmly welcome direct contributions from the community to support more models in exporting to ExecuTorch!
torch.export#34103Your contribution
generatefor exported model and the integration inOptimumHere is how ExecuTorch implements the
generate()for llama2/3 in eager python and c++.cc: @amyeroberts @gante @ArthurZucker @michaelbenayoun