Export to ExecuTorch

## Feature request

Unlock a new workflow for on-device use-cases via [**torch.export**](https://pytorch.org/tutorials/intermediate/torch_export_tutorial.html) and [**ExecuTorch**](https://pytorch.org/executorch/main/intro-overview.html).

So ideally the users can have an e2e experience by loading a pretrained transformer model from HuggingFace, export and lower it to `ExecuTorch` and get reasonable performance out-of-the-box. 

For example:

1. Load a model with StaticCache:
```
model = AutoModelForCausalLM.from_pretrained(
    hf_model_repo,
    config=config,
    attn_implementation="sdpa",
    cache_config={
        "use_cache": True, 
        "cache_implementation": "static", 
        "max_cache_length": 128,
    },  # Mandatory field to set ONLY for "Export to ExecuTorch" workflow, optional in other use-cases
)
```

2. Then export the model with StaticCache. 
```
exported_program = convert_and_export_with_cache(
    model, 
    args=(model_inputs,), 
    kwargs={"position_ids": <val>, "inputs_embeds": <val>, "cache_position": <val>}
```
and then further lower the exported program to `ExecuTorch` with delegates for performance:
```
executorch_m = lower_to_executorch(
    model, 
    recipes="xnnpack_fp32",  # Delegate to XNNPACK backend
)

# The lowered artifact can be saved into a `.pte` binary format for integration and distribution.
```
With that you may get a model for on-device with reasonable performance to start with. 

From there and still within `ExecuTorch` stack, you can easily tailor the experience for your use-cases, of course, with better performance! Note that `ExecuTorch` supports delegatation to [XNNPACK backend](https://pytorch.org/executorch/main/native-delegates-executorch-xnnpack-delegate.html), [Apple Core ML](https://pytorch.org/executorch/main/build-run-coreml.html) and [MPS](https://github.com/pytorch/executorch/tree/main/examples/apple/mps), [Qualcomm QNN](https://pytorch.org/executorch/main/build-run-qualcomm-ai-engine-direct-backend.html), [ARM Ethos-U](https://pytorch.org/executorch/stable/executorch-arm-delegate-tutorial.html), [Vulkan GPU](https://pytorch.org/executorch/main/build-run-vulkan.html) and more. You can learn more by reading our [tutorial](https://pytorch.org/executorch/main/examples-end-to-end-to-lower-model-to-delegate.html).


3. Use the exported/lowered artifact for inference:
```

# The lowered artifact can run on a local device in the ExecuTorch runtime in c++ or via pybind, providing the same experience as how users run inference with the eager model on server.

generate(model=executorch_m, prompt="Hello world")  # Will generate up to the maximal sequence length/cache length 
```

The example workflow above shows direct integration between `ExecuTorch` and HF `transformers` models. Eventually this workflow could be accessible via `optimum exporters-et`, `Transformers.js` or in [`ExecuTorch`](https://github.com/pytorch/executorch) and [`torchchat`](https://github.com/pytorch/torchchat).

## Motivation

Unlock a whole new on-device experience of using HuggingFace models w/o leaving the PyTorch ecosystem ([`ExecuTorch`](https://pytorch.org/executorch/main/intro-overview.html) is native PyTorch!)


##  Issues Tracker

### Fundamental
- [x] Make `StaticCache` compatible with `torch.export`: PR #32168
- [x] #32500: PR #32830
- [x] #32503
- [ ] Support dynamic length slicing in `StaticCache`: PR #30862
- [x] #32504: PR #33707
- [ ] Convert Hugging Face tokenizer files to be the c++ `llm_runner` consumable: https://github.com/pytorch/executorch/issues/6813 

### E2E workflow
- [x] Umbrella task for `Optimum` enablement: https://github.com/huggingface/optimum/issues/2128
- [ ] Umbrella task for `Tranformers.js` enablement: https://github.com/huggingface/transformers.js/issues/1039

### Optimization
- [ ] Support quantized models w/ ExecuTorch + TorchAO #34787

### Models
- [x] #33709: PR #33707
- [x] #32505: PR #34101
- [ ] #32506
- [x] #32507: PR #34424
- [ ] #32508
- [ ] #32509
- [x] #33833: PR #34102
- [x] #33834: PR #36486
- [x] #33835: PR #34475
- [x] #33836: PR #34476
- [ ] #33837
- [x] #33838
- [ ] #33839
- [x] #33840: PR #34181
- [x] #33841: PR #34425
- [x] #33843: PR #34473
- [x] #34879
- [ ] #35327
- [x] #37727: PR https://github.com/huggingface/transformers/pull/37728

And more! We're ambitious to expanding the model coverage massively. Please comment below if you are interested in a particular model for on-device use-case!

Even better, we warmly welcome direct contributions from the community to support more models in exporting to ExecuTorch!
- [x] Cohere2: #35224
- [x] OLMo2: #34551
- [x] DPT, DepthAnything & ZoeDepth: #34103
- [x] #33842: PR https://github.com/huggingface/optimum-executorch/pull/45
- [x] https://github.com/huggingface/transformers/issues/37844: https://github.com/huggingface/transformers/pull/36878


## Your contribution

1. Co-design the "Export to ExecuTorch" workflow.
2. Co-design the `generate` for exported model and the integration in `Optimum`
3. Identify and fill gaps in DevX and UX

Here is how ExecuTorch implements the `generate()` for llama2/3 in [eager python](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/runner/generation.py) and [c++](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/runner/runner.cpp).


cc: @amyeroberts @gante @ArthurZucker @michaelbenayoun 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export to ExecuTorch #32253

Feature request

Motivation

Issues Tracker

Fundamental

E2E workflow

Optimization

Models

Your contribution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Export to ExecuTorch #32253

Description

Feature request

Motivation

Issues Tracker

Fundamental

E2E workflow

Optimization

Models

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions