Support for Activated LoRA (https://github.com/ggml-org/llama.cpp/issues/15212) #15213
Replies: 1 comment
-
Initial support for Next Steps The current implementation focuses on enabling/disabling the adapter based on the presence of the invocation sequence by splitting the prefill batch and setting the scale to
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Apologies if this is slightly out of order - I have created an issue #15212 requesting support for Activated LoRA adapters (see issue for details and motivation). These adapters are invoked by including an invocation sequence in the prompt, and only affect the weights for all tokens after the invocation sequence appears. This means that the adapter can re-use the KV cache from base model, leading to huge improvements in TTFT (compared to hot-swapping LoRA adapters) if you apply the adapter deep into a multi-turn interaction with the model. Appreciate any feedback or thoughts on this!
Our plan would be to start this integration work ourselves and submit a PR for this feature in the near future, building on the existing support for hot-swapping LoRA adapters.
This complements existing PRs to both Huggingface PEFT (huggingface/peft#2609) and vLLM (vllm-project/vllm#19710).
cc @gabe-l-hart
Beta Was this translation helpful? Give feedback.
All reactions