-
Notifications
You must be signed in to change notification settings - Fork 13k
aLoRA Support #15327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aLoRA Support #15327
Conversation
One interesting update: For the specific adapters I'm using to test here, the I've updated my sniff test script above to use client-side template expansion and the raw NOTE: This is a property of these adapters and not of aLoRA in general. Theoretically, an adapter could be trained to invoke on the full |
Add the following to your request to remove the assistant generation prompt: "add_generation_prompt": false, |
Ah, yep, that will definitely help, but it won't eliminate the |
Ah, didn't notice that, I suppose that's just because the template doesn't properly handle unknown roles? |
Yeah, the real issue is that it was trained to act like the generation prompt, so the activation sequence is intentionally an incomplete turn, but with a different role. |
UpdateI've now added support for correctly applying the
TestingI've got a few tweaks to my test script that allow it to stimulate these conditions: uq-req.pyimport json
import time
from transformers import AutoTokenizer
import requests
tokenizer = AutoTokenizer.from_pretrained("/Users/ghart/models/granite-3.2-8b-instruct")
url = "http://localhost:8081"
documents = [
{"text": "My name is Gabe"},
{"text": "I work for IBM"}
]
messages = [{"role": "user", "content": "Who does Gabe work for?"}]
adapter_message = {
"role": "certainty",
"content": ""
}
# Run base messages
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/chat/completions", json={
"model": "unused",
"messages": messages,
"chat_template_kwargs": {
"documents": documents,
},
"temperature": 0.0,
"lora": [
# alora
{"id": 0, "scale": 0.0},
# lora
{"id": 1, "scale": 0.0},
],
})
end = time.time()
assistant_resp = resp.json()["choices"][0]["message"]
print(f"ASSISTANT RESPONSE ({end-start}s):")
print(assistant_resp["content"])
# UNCOMMENT this to extend the assistant's response so that it isn't cached
"""
assistant_resp["content"] = assistant_resp["content"] + "\nRespect my authority!"
"""
# Create the serialized version as a string so we can append the right prompt
messages.append(assistant_resp)
raw_prompt = tokenizer.apply_chat_template(messages, documents=documents, tokenize=False)
uq_prompt = raw_prompt + "<|start_of_role|>certainty<|end_of_role|>"
# Run with both adapters disabled
# UNCOMMENT this to exercise the case where the invocation string itself has
# been cached without the adapter
"""
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
"model": "unused",
"prompt": uq_prompt,
"temperature": 0.0,
"lora": [
# alora
{"id": 0, "scale": 0.0},
# lora
{"id": 1, "scale": 0.0},
],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/out adapters ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))
"""
# Run with the adapter and the prompt for UQ with the alora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
"model": "unused",
"prompt": uq_prompt,
"temperature": 0.0,
"max_tokens": 100,
"lora": [
# alora
{"id": 0, "scale": 1.0},
# lora
{"id": 1, "scale": 0.0},
],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ aLoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))
# Run with the adapter and the prompt for UQ with the lora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
"model": "unused",
"prompt": uq_prompt,
"temperature": 0.0,
"max_tokens": 100,
"lora": [
# alora
{"id": 0, "scale": 0.0},
# lora
{"id": 1, "scale": 1.0},
],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ LoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2)) Don't use cached invocation sequence from base modelThis stimulates the case where the user ran the invocation sequence through the base model without the adapter and those tokens are cached (uncomment starting at line 57)
Don't use adapter for uncached tokens before invocation sequenceThis stimulates the case where for some reason there are additional tokens not pulled from cache that come before the invocation sequence (uncomment line 45)
|
I've now extended this to test with multiple Adapters
Adapters are converted using Boot with adapters./bin/llama-server \
-m ~/models/granite-3.2-8b-instruct/granite-3.2-8B-instruct-F16.gguf \
--lora ~/models/granite-3.2-8b-alora-uncertainty/granite-3.2-8B-alora-uncertainty-F16-LoRA.gguf \
--lora ~/models/granite-3.2-8b-alora-rag-answerability-prediction/granite-3.2-8B-alora-rag-answerability-prediction-F16-LoRA.gguf \
--port 8081 \
--jinja \
--reasoning-budget 0 Test Script(sorry, it requires my personal logging framework just 'cuz 😉... alora-chat.py#!/usr/bin/env bash
"""
This is a simple implementation of an interactive chat that leverages several
aLoRA adapters during the flow
"""
# Standard
import argparse
import os
# First Party
import alog
# Third Party
import requests
log = alog.use_channel("MAIN")
def make_document(i: int, doc: str) -> dict:
"""Make a document dict from the given doc as either text or a path"""
log.info("Adding document: %s", doc)
if os.path.exists(doc):
with open(doc, "r") as handle:
return {"text": handle.read(), "doc_id": i, "title": doc}
return{"text": doc, "doc_id": i}
def make_lora_req(adapter_ids: list[int], loras: list[int]) -> list[dict]:
return [
{"id": i, "scale": 1.0 if i in loras else 0.0}
for i in adapter_ids
]
def make_chat_req(messages: list[dict], documents: list[dict], adapter_ids: list[int], loras: list[int]) -> dict:
return {
"messages": messages,
"chat_template_kwargs": {
"documents": documents,
},
"temperature": 0.0,
"lora": make_lora_req(adapter_ids, loras),
}
def make_completion_req(prompt: str, documents: list[dict], adapter_ids: list[int], loras: list[int], **kwargs) -> dict:
kwargs.update({
"prompt": prompt,
"chat_template_kwargs": {
"documents": documents,
},
"temperature": 0.0,
"lora": make_lora_req(adapter_ids, loras),
})
return kwargs
def run_main_loop(host: str, documents: list[dict], uq_id: int, ans_id: int, adapter_ids: list[int]):
"""Run the main loop with questions"""
help_cmd = "/?"
doc_cmd = "/doc"
reset_cmd = "/reset"
quit_cmd = "/quit"
doc_pfx = f"{doc_cmd} "
def print_help():
print("Commands:")
print(f"{help_cmd}: Print help")
print(f"{doc_cmd}: Add a document")
print(f"{reset_cmd}: Reset the chat history")
print(f"{quit_cmd}: Quit")
messages = []
print_help()
while True:
inp = input("?> ").strip()
if inp == quit_cmd:
break
if not inp:
continue
if inp == help_cmd:
print_help()
continue
if inp == reset_cmd:
messages.clear()
continue
if inp.startswith(doc_pfx):
doc = inp[len(doc_pfx):].lstrip()
documents.append(make_document(len(documents), doc))
continue
# Apply the chat template with the user query
user_message = {"role": "user", "content": inp}
resp = requests.post(f"{host}/apply-template", json=make_chat_req(messages + [user_message], documents, adapter_ids, []))
resp.raise_for_status()
formatted_prompt = resp.json()["prompt"]
log.debug4("Formatted prompt: %s", formatted_prompt)
# Run the Answerability query
ans_prompt = formatted_prompt + "<|end_of_text|>\n<|start_of_role|>answerability<|end_of_role|>"
resp = requests.post(f"{host}/v1/completions", json=make_completion_req(ans_prompt, documents, adapter_ids, [ans_id], max_tokens=3))
resp.raise_for_status()
js = resp.json()
answerability = js["choices"][0]["text"]
log.debug("Answerability: %s", answerability)
log.debug2("Usage: %s", js["usage"])
log.debug2("Timings: %s", js["timings"])
answerable = not answerability.split()[0].lower().startswith("unanswerable")
if answerable:
print(">> The question is answerable!")
else:
print(">> I'm sorry, but that question isn't answerable with the given context")
if input("?> Do you want to try anyway [yN]? ").strip().lower() not in ["y", "yes"]:
continue
messages.append(user_message)
# If not unanswerable, run the question and get the assistant's response
resp = requests.post(f"{host}/v1/chat/completions", json=make_chat_req(messages, documents, adapter_ids, []))
resp.raise_for_status()
js = resp.json()
assistant_msg = js["choices"][0]["message"]
answer = assistant_msg["content"]
messages.append(assistant_msg)
print(f"ASSISTANT: {answer}")
# Get the uncertainty
formatted_prompt = requests.post(f"{host}/apply-template", json=make_chat_req(messages, documents, adapter_ids, [])).json()["prompt"]
uq_prompt = formatted_prompt + "<|end_of_text|>\n<|start_of_role|>certainty<|end_of_role|>"
resp = requests.post(f"{host}/v1/completions", json=make_completion_req(uq_prompt, documents, adapter_ids, [uq_id], max_tokens=5))
resp.raise_for_status()
js = resp.json()
uq = js["choices"][0]["text"]
print(f">> CERTAINTY: {uq}")
log.debug2("Usage: %s", js["usage"])
log.debug2("Timings: %s", js["timings"])
print()
def main():
parser = argparse.ArgumentParser(description=__doc__)
# Logging
parser.add_argument("--log-level", "-l", default=os.getenv("LOG_LEVEL", "info"))
parser.add_argument("--log-filters", "-lf", default=os.getenv("LOG_FILTERS", "urllib3.connectionpool:info"))
parser.add_argument("--log-json", "-lj", action="store_true", default=os.getenv("LOG_JSON", "").lower() == "true")
# Models
parser.add_argument("--alora-uq", "-u", type=int, default=None, help="Adapter ID for the UQ adapter")
parser.add_argument("--alora-answerability", "-a", type=int, default=None, help="Adapter ID for the Answerability adapter")
# Server
parser.add_argument("--host", "-s", default="http://localhost:8081", help="Host where llama-server is running")
# Docs
parser.add_argument("--document", "-d", nargs="+", help="document (text or path) to add as context")
# Configure logging
args = parser.parse_args()
alog.configure(
default_level=args.log_level,
filters=args.log_filters,
formatter="json" if args.log_json else "pretty",
thread_id=True,
)
# Make sure llama-server is up!
resp = requests.get(f"{args.host}/health")
resp.raise_for_status()
log.info("llama-server is up at %s", args.host)
# Get the loaded adapters
resp = requests.get(f"{args.host}/lora-adapters")
adapters = resp.json()
adapter_ids = [entry["id"] for entry in adapters]
# Figure out which adapter is which
uq_id = args.alora_uq
if uq_id is None:
candidates = [entry for entry in adapters if "uncertainty" in entry["path"]]
assert len(candidates) == 1, "Couldn't auto-deduce UQ adapter ID"
uq_id = candidates[0]["id"]
ans_id = args.alora_answerability
if ans_id is None:
candidates = [entry for entry in adapters if "answerability" in entry["path"]]
assert len(candidates) == 1, "Couldn't auto-deduce Answerability adapter ID"
ans_id = candidates[0]["id"]
log.info("UQ aLoRA ID: %d, Answerability aLoRA ID: %d", uq_id, ans_id)
# Load documents
documents = []
for i, doc in enumerate(args.document or []):
documents.append(make_document(i, doc))
# Start the prompt loop
log.info("Starting main loop")
run_main_loop(args.host, documents, uq_id, ans_id, adapter_ids)
if __name__ == "__main__":
main() Example Output
(NOTE: It's clear from my experiments that these adapters are not particularly robust, but that's a property of these specific ones that are being continuously refined!) |
I realized that my local |
The other contingency for this PR is #15404. The functionality is not linked at all, but the above chat script will fail out when trying to perform the chat template expansion without the fix there. |
One additional note: These adapters seem to still work well when attached to a quantized model, so they don't require losing the speed/footprint benefits of quantization. ./bin/llama-server -m ~/models/granite-3.2-8b-instruct/ggml-model-Q4_K_M.gguf --lora ~/models/granite-3.2-8b-alora-uncertainty/granite-3.2-8B-alora-uncertainty-F16-LoRA.gguf --lora ~/models/granite-3.2-8b-alora-rag-answerability-prediction/granite-3.2-8B-alora-rag-answerability-prediction-F16-LoRA.gguf --port 8081 --jinja --reasoning-budget 0
EDIT: I also tried with |
Also important to test will be concurrent requests to the same |
@ryan-mangeno if you have any cycles for testing, I'd love some help putting this one through its paces |
…ation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
This is the preferred method in PEFT which is the source of ground truth https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
This does not yet do the part to identify the invocation tokens and only apply the lora adapter afterwards, but it does seem to produce correct results if the invocation tokens are the beginning of the uncached input. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
This currently limits to a single enabled alora per slot. Multiple aloras with different invocation sequences would be possible, but it would require a more complex integration of the adapter toggling and is not really a well studied case for alora since it's unclear if one alora can reuse cache from previous prefill computed with a different alora. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
This is a bit of an edge case, but theoretically a user could try the same query with the alora disabled (just using the base model), then retry with the alora. The cached tokens from the first pass should be invalid. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
The solution is to only fill up to the token before the invocation start in the batch if there are any tokens to be prefilled between those pulled from cache and the invocation start. When this is detected, the alora is temporarily disabled with a scale of 0.0, then immediately re-enabled after it has been initialized for the internal graph. Since the batch does not complete the prompt tokens, the remaining prompt tokens are handled in the next task, pulling all of the non-alora tokens from cache and proceeding with prefill for the alora tokens. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
Too much python 🤦 Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
This was the cause of the inconsistent results from the dummy test script with and without the turn that runs the prompt without the adapter before running it with the adapter. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
…er_config.json While this has been replaced in the PEFT PR in favor of alora_invocation_tokens, the existing adapters in the ibm-granite org on HF use "invocation_string," so this will enable backwards compatibility and enable testing now (before PEFT PR changes have percolated everywhere). Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
532f0ff
to
da2e8c1
Compare
@ngxson @ggerganov @CISC this PR should be ready for review. The first question, though, is whether you all think this is a feature that we should add. It should be a net-addition (no change to existing functionality), but it's certainly a new feature that will require maintenance going forward (which I'm happy to be on the hook for). |
I think it's a very interesting feature with definite worth, and doesn't seem to add much maintenance burden even if you hadn't committed yourself to it. :) |
…upport * origin/master: (61 commits) scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633) kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) gguf-py: byteswapping improvements (ggml-org#12851) cli : change log to warning to explain reason for stopping (ggml-org#15604) model-conversion : add mmproj conversion target (ggml-org#15628) cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622) server: higher timeout for tests (ggml-org#15621) presets : add qwen3-30B-a3b FIM (ggml-org#15616) HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615) kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610) CANN: refactor mask handling and improve performance in FA (ggml-org#15561) ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057) common : add -m to bash completion for --model [no ci] (ggml-org#15591) OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314) tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599) SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592) mtmd : fix mtmd ios build (ggml-org#15579) tests: add performance test for mul mat id (ggml-org#15543) llamafile: PowerPC Sgemm Optimization (ggml-org#15558) graph : fix assert in memory-less build_attn (ggml-org#15590) ...
@CISC It looks like there are some new conflicts with #13693. Looking over the changes there, I don't think |
Yeah, they are somewhat similar, but I think the purpose they serve is different enough that it justifies separating them, esp. as |
Great, I'll resolve the conflicts the simple way then. |
…upport * origin/master: ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625) kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638) model : jina-embeddings-v3 support (ggml-org#13693) Signed-off-by: Gabe Goodhart <[email protected]>
…upport * origin/master: (72 commits) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) vulkan : update ggml_vk_instance_validation_ext_available (ggml-org#15666) ggml vulkan: add hardsigmoid and hardswish operations (ggml-org#15762) CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E (ggml-org#15715) model-conversion : fix pyright errors (ggml-org#15770) sampling : optimize dist sampler (ggml-org#15704) llama : fix incorrect model type for Gemma 270M (ggml-org#15764) model-conversion : remove hardcoded /bin/bash shebangs [no ci] (ggml-org#15765) CANN: Add RoPE contiguous check for 310I DUP device (ggml-org#15735) ggml-cpu : optimize RVV kernels (ggml-org#15720) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it perhaps make sense to expose the invocation tokens here?
llama.cpp/tools/server/server.cpp
Lines 5082 to 5088 in d8651c8
result.push_back({ | |
{"id", i}, | |
{"path", lora.path}, | |
{"scale", lora.scale}, | |
{"task_name", lora.task_name}, | |
{"prompt_prefix", lora.prompt_prefix}, | |
}); |
Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
…upport * origin/master: Thinking model disabled assistant prefill (ggml-org#15404) Implement --log-colors with always/never/auto (ggml-org#15792) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798)
Yep, great idea. The only tricky bit would be whether it's better to detokenize it and show the string version? |
If so, have both (since you can send raw tokens), but really the user can detokenize if necessary. |
… /lora-adapters Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
It was pretty easy to detokenize, so I added both. This will be very nice for reflection! |
* feat: Add python-side constants and conversion for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add c++ side constants for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Parse invocation string for adapters from GGUF Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix(python): Update conversion to alora_invocation_tokens This is the preferred method in PEFT which is the source of ground truth https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix(cpp): Update to alora_invocation_tokens on c++ side Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add C APIs to get alora invocation token array from lora Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Initial implementation of alora cache logic in server This does not yet do the part to identify the invocation tokens and only apply the lora adapter afterwards, but it does seem to produce correct results if the invocation tokens are the beginning of the uncached input. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Identify alora invocation sequences This currently limits to a single enabled alora per slot. Multiple aloras with different invocation sequences would be possible, but it would require a more complex integration of the adapter toggling and is not really a well studied case for alora since it's unclear if one alora can reuse cache from previous prefill computed with a different alora. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Only reuse cache for tokens before the alora invocation start This is a bit of an edge case, but theoretically a user could try the same query with the alora disabled (just using the base model), then retry with the alora. The cached tokens from the first pass should be invalid. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Handle un-cached tokens that come before the alora activation The solution is to only fill up to the token before the invocation start in the batch if there are any tokens to be prefilled between those pulled from cache and the invocation start. When this is detected, the alora is temporarily disabled with a scale of 0.0, then immediately re-enabled after it has been initialized for the internal graph. Since the batch does not complete the prompt tokens, the remaining prompt tokens are handled in the next task, pulling all of the non-alora tokens from cache and proceeding with prefill for the alora tokens. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use || instead of 'or' Too much python 🤦 Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix off-by-one for limiting cached tokens to before alora start This was the cause of the inconsistent results from the dummy test script with and without the turn that runs the prompt without the adapter before running it with the adapter. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Support backwards-compatibility for "invocation_string" in adapter_config.json While this has been replaced in the PEFT PR in favor of alora_invocation_tokens, the existing adapters in the ibm-granite org on HF use "invocation_string," so this will enable backwards compatibility and enable testing now (before PEFT PR changes have percolated everywhere). Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove duplicate logging Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> * feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
Just noticed (also: if this is a bug it's nothing to do with this PR! I first noticed it a couple of months ago - it's just now I might need to distribute some quite high rank LoRAs and trying to work out if it's possible to create a |
@jukofyork I have not tried creating quantized adapters beyond an initial attempt to point |
* feat: Add python-side constants and conversion for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add c++ side constants for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Parse invocation string for adapters from GGUF Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix(python): Update conversion to alora_invocation_tokens This is the preferred method in PEFT which is the source of ground truth https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix(cpp): Update to alora_invocation_tokens on c++ side Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add C APIs to get alora invocation token array from lora Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Initial implementation of alora cache logic in server This does not yet do the part to identify the invocation tokens and only apply the lora adapter afterwards, but it does seem to produce correct results if the invocation tokens are the beginning of the uncached input. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Identify alora invocation sequences This currently limits to a single enabled alora per slot. Multiple aloras with different invocation sequences would be possible, but it would require a more complex integration of the adapter toggling and is not really a well studied case for alora since it's unclear if one alora can reuse cache from previous prefill computed with a different alora. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Only reuse cache for tokens before the alora invocation start This is a bit of an edge case, but theoretically a user could try the same query with the alora disabled (just using the base model), then retry with the alora. The cached tokens from the first pass should be invalid. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Handle un-cached tokens that come before the alora activation The solution is to only fill up to the token before the invocation start in the batch if there are any tokens to be prefilled between those pulled from cache and the invocation start. When this is detected, the alora is temporarily disabled with a scale of 0.0, then immediately re-enabled after it has been initialized for the internal graph. Since the batch does not complete the prompt tokens, the remaining prompt tokens are handled in the next task, pulling all of the non-alora tokens from cache and proceeding with prefill for the alora tokens. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use || instead of 'or' Too much python 🤦 Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix off-by-one for limiting cached tokens to before alora start This was the cause of the inconsistent results from the dummy test script with and without the turn that runs the prompt without the adapter before running it with the adapter. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Support backwards-compatibility for "invocation_string" in adapter_config.json While this has been replaced in the PEFT PR in favor of alora_invocation_tokens, the existing adapters in the ibm-granite org on HF use "invocation_string," so this will enable backwards compatibility and enable testing now (before PEFT PR changes have percolated everywhere). Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove duplicate logging Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> * feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
CISC aswered here: so it looks like it could easily be hacked to create quantised LoRAs if needed. |
DRAFT STATUSThis PR is in draft as a proof-of-concept while we discuss the best path forward.The implementation is now robust enough to be ready for full review. The changes were a bit more involved than I had originally hoped based on Georgi's comment, but they are all contained to the
tools/server
except for the changes to support the new GGUF field.Description
Closes #15212
Supports #15213
This PR adds support for Activated LoRA (aLoRA) in
llama-server
and in the GGUF representation of a LoRA adapter. The primary benefit of aLoRA is the ability to hot-swap adapters without needing to clear cache. This enables a much more efficient multi-adapter model where individual adapters provide "add-on" features to a model and can be applied during a model flow without redoing the prefill work.Current Changes
adapter.alora.invocation_tokens
GGUF KVadapter.alora.invocation_tokens
from"alora_invocation_tokens"
inconvert_lora_to_gguf.py
adapter.alora.invocation_tokens
when loading an adapteralora_invocation_tokens
tollama_lora_adapter
structllama.h
to support getting the invocation tokens from aconst llama_lora_adapter *
server
to conditionally not clear cache when a request with an adapter change arrives under the following conditions:aloras
TODO
alora
is to identify the invocation tokens within an input request and only use the adapter for tokens starting with the invocation sequence. This may require a much deeper intervention to support adapter scaling on a per-token basis rather than on a per computation basis.Testing
I'm testing this using the following models and adapters:
Conversion
Execution
Sniff test
This script simply verifies that the two adapters can be toggled and that the cache is cleared appropriately. The example inputs are trivial, so the timings are not particularly valuable.
server-req.py
Response