Description
🚀 The feature, motivation and pitch
In order to make language models easy to use on ExecuTorch, we likely need to define the "core" components for running LLMs on ET. This is an evolving area, of course, but there is a lot of interest now on running decoder-only auto-regressive generation, which we want to provide a clean solution for. The most important use, in my opinion, is providing a streamlined path for runtime integration of HF models, though the more general, the better.
In a nutshell, I hope to be able to give users "the" way to run LLMs on ET. Don't make them make technical decisions when we can avoid it. We can have lower-level composable components and power-user APIs, but I hope to handle the majority of use cases by being able to tell users, "get your model from HF, export with Optimum, and here's the runner APIs in C++, Java, and Obj-C/Swift". Minimal decision making or lower-level understanding required.
The Android + iOS bindings are likely most critical here, but I want to ensure that we standardize on the components and API design before building these.
Requirements
- Define "core" runtime components needed to run text generation decoder models, particularly with a focus on models from HF through Optimum.
- Move all reusable components out of the examples directory (may already be done).
- Clearly document these components.
- Provide API bindings for these components from Java and Objective-C / Swift. Ideally, the API surface should match as closely as possible while respecting the appropriate language conventions, such that there is parity in usage and capability between Java, Obj-C/Swift, and C++ runners.
CC @guangy10 @larryliu0820 for core APIs, @kirklandsign @shoumikhin for language bindings, @byjlw @mergennachin for usability workstream
Alternatives
No response
Additional context
No response
RFC (Optional)
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status