aLoRA Support #15327

gabe-l-hart · 2025-08-14T19:32:40Z

DRAFT STATUS

~~This PR is in draft as a proof-of-concept while we discuss the best path forward.~~

The implementation is now robust enough to be ready for full review. The changes were a bit more involved than I had originally hoped based on Georgi's comment, but they are all contained to the tools/server except for the changes to support the new GGUF field.

Description

Closes #15212
Supports #15213

This PR adds support for Activated LoRA (aLoRA) in llama-server and in the GGUF representation of a LoRA adapter. The primary benefit of aLoRA is the ability to hot-swap adapters without needing to clear cache. This enables a much more efficient multi-adapter model where individual adapters provide "add-on" features to a model and can be applied during a model flow without redoing the prefill work.

Current Changes

Add adapter.alora.invocation_tokens GGUF KV
- Support parsing adapter.alora.invocation_tokens from "alora_invocation_tokens" in convert_lora_to_gguf.py
- Support reading adapter.alora.invocation_tokens when loading an adapter
Add alora_invocation_tokens to llama_lora_adapter struct
Add C-style APIs to llama.h to support getting the invocation tokens from a const llama_lora_adapter *
Add support to server to conditionally not clear cache when a request with an adapter change arrives under the following conditions:
- The current cache was populated without any adapters
- The enabled new adapters are all aloras

TODO

The correct way to apply an alora is to identify the invocation tokens within an input request and only use the adapter for tokens starting with the invocation sequence. This may require a much deeper intervention to support adapter scaling on a per-token basis rather than on a per computation basis.

Testing

I'm testing this using the following models and adapters:

base model: https://huggingface.co/ibm-granite/granite-3.2-8b-instruct
LoRA adapter: https://huggingface.co/ibm-granite/granite-3.2-8b-lora-uncertainty
aLoRA adapter: https://huggingface.co/ibm-granite/granite-3.2-8b-alora-uncertainty

Conversion

# Convert base model
convert_hf_to_gguf.py ~/models/granite-3.2-8b-instruct/

# Convert alora
python convert_lora_to_gguf.py ~/models/granite-3.2-8b-alora-uncertainty/ --base ~/models/granite-3.2-8b-instruct/ --verbose
# NOTE! Look for "DEBUG:lora-to-gguf:GGUF KV: adapter.alora.invocation_tokens = [6989, 24933, 49153]"

# Convert lora
python convert_lora_to_gguf.py ~/models/granite-3.2-8b-lora-uncertainty/ --base ~/models/granite-3.2-8b-instruct/ --verbose
# NOTE! You should not see log about adapter.alora.invocation_tokens

Execution

# Boot with both adapters (0: alora, 1: lora)
# NOTE: Disabling reasoning budget is critical for these adapters!
./bin/llama-server \
  -m ~/models/granite-3.2-8b-instruct/granite-3.2-8B-instruct-F16.gguf \
  --lora ~/models/granite-3.2-8b-alora-uncertainty/granite-3.2-8B-alora-uncertainty-F16-LoRA.gguf \
  --lora ~/models/granite-3.2-8b-lora-uncertainty/granite-3.2-8B-uncertainty-F16-LoRA.gguf \
  --port 8081 \
  --jinja \
  --reasoning-budget 0

Sniff test

This script simply verifies that the two adapters can be toggled and that the cache is cleared appropriately. The example inputs are trivial, so the timings are not particularly valuable.

server-req.py

import json
import time

from transformers import AutoTokenizer
import requests

tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.2-8b-instruct")

url = "http://localhost:8081"

messages = [
    {
      "role": "document A",
      "content": "The first document"
    },
    {
        "role": "document B",
        "content": "The second document"
    },
    {
        "role": "user",
        "content": "Which document is first?"
    },
]

adapter_message = {
    "role": "certainty",
    "content": ""
}

# Run base messages
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/chat/completions", json={
    "model": "unused",
    "messages": messages,
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
assistant_resp = resp.json()["choices"][0]["message"]
print(f"ASSISTANT RESPONSE ({end-start}s):")
print(assistant_resp["content"])

# Create the serialized version as a string so we can append the right prompt
messages.append(assistant_resp)
raw_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
uq_prompt = raw_prompt + "<|start_of_role|>certainty<|end_of_role|>"

# Run with the adapter and the prompt for UQ with the alora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 1.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ aLoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))

# Run with the adapter and the prompt for UQ with the lora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 1.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ LoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))

Response

----
ASSISTANT RESPONSE (0.39293885231018066s):
The first document is document A.
----
UQ RESPONSE w/ aLoRA (0.18337106704711914s)
85
>>
{
  "completion_tokens": 3,
  "prompt_tokens": 95,
  "total_tokens": 98
}
{
  "prompt_n": 6,
  "prompt_ms": 78.906,
  "prompt_per_token_ms": 13.151000000000002,
  "prompt_per_second": 76.03984487871644,
  "predicted_n": 3,
  "predicted_ms": 102.247,
  "predicted_per_token_ms": 34.08233333333333,
  "predicted_per_second": 29.340714152982482
}
----
UQ RESPONSE w/ LoRA (0.3489878177642822s)
85%
>>
{
  "completion_tokens": 4,
  "prompt_tokens": 95,
  "total_tokens": 99
}
{
  "prompt_n": 95,
  "prompt_ms": 193.069,
  "prompt_per_token_ms": 2.0323052631578946,
  "prompt_per_second": 492.05206428789705,
  "predicted_n": 4,
  "predicted_ms": 153.721,
  "predicted_per_token_ms": 38.43025,
  "predicted_per_second": 26.021168220347253
}

gabe-l-hart · 2025-08-14T20:41:32Z

One interesting update: For the specific adapters I'm using to test here, the invocation_tokens look like the beginning of a turn with the role certainty. I had originally been attempting to append this to the chat using the /chat/completions endpoint and appending {"role": "certainty", "content": ""}. This, however, resulted in the template expanding to <|start_of_role|>certainty<|end_of_role|>None<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|> which is not correct for these adapters.

I've updated my sniff test script above to use client-side template expansion and the raw /completions endpoint for the UQ requests. This is not ideal since it means that this style of adapter would require careful orchestration on the client side to use.

CISC · 2025-08-14T22:16:08Z

One interesting update: For the specific adapters I'm using to test here, the invocation_tokens look like the beginning of a turn with the role certainty. I had originally been attempting to append this to the chat using the /chat/completions endpoint and appending {"role": "certainty", "content": ""}. This, however, resulted in the template expanding to <|start_of_role|>certainty<|end_of_role|>None<|end_of_turn|>\n<|start_of_role|>assistant<|end_of_role|> which is not correct for these adapters.

Add the following to your request to remove the assistant generation prompt:

"add_generation_prompt": false,

gabe-l-hart · 2025-08-14T22:20:48Z

Add the following to your request to remove the assistant generation prompt:

Ah, yep, that will definitely help, but it won't eliminate the None<|end_of_text|> portion. Talking with @kgreenewald, it sounds like the team will be moving to a training pattern for these adapters that will be more friendly to the chat template going forward.

CISC · 2025-08-14T22:28:31Z

Add the following to your request to remove the assistant generation prompt:

Ah, yep, that will definitely help, but it won't eliminate the None<|end_of_text|> portion. Talking with @kgreenewald, it sounds like the team will be moving to a training pattern for these adapters that will be more friendly to the chat template going forward.

Ah, didn't notice that, I suppose that's just because the template doesn't properly handle unknown roles?

gabe-l-hart · 2025-08-14T22:30:00Z

Yeah, the real issue is that it was trained to act like the generation prompt, so the activation sequence is intentionally an incomplete turn, but with a different role.

gabe-l-hart · 2025-08-15T18:23:13Z

Update

I've now added support for correctly applying the alora only to the tokens starting with the invocation sequence. The changes look like the following:

When an alora is requested, search the prompt tokens backwards for the invocation sequence

if no invocation sequence is found, simply disable it
NOTE (to self): Would this have any impact on subsequent calls? I don't think so since the slot loras are always initialized from the server loras on each task

When processing a slot, only pull tokens from cache up to the token before the invocation sequence start

NOTE (to self): We may need to allow for the case where we do want to pull tokens for the invocation sequence if it's from the same alora. This would be a strange use though, since it would require the user to send a request with the alora enabled, but with no un-cached alora invocation strings since the last one is always what gets found.

Once cached tokens have been filled, identify tokens that fall between the end of the cached tokens (slot.n_past) and the invocation start sequence. These should be prefilled without the alora, so the alora is temporarily disabled and the batch filling breaks at the token before the invocation start. The alora is then re-enabled with the correct scale so that the next task can finish prefill from the invocation start with the adapter enabled.

Testing

I've got a few tweaks to my test script that allow it to stimulate these conditions:

uq-req.py

import json
import time

from transformers import AutoTokenizer
import requests

tokenizer = AutoTokenizer.from_pretrained("/Users/ghart/models/granite-3.2-8b-instruct")

url = "http://localhost:8081"

documents = [
    {"text": "My name is Gabe"},
    {"text": "I work for IBM"}
]
messages = [{"role": "user", "content": "Who does Gabe work for?"}]

adapter_message = {
    "role": "certainty",
    "content": ""
}

# Run base messages
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/chat/completions", json={
    "model": "unused",
    "messages": messages,
    "chat_template_kwargs": {
        "documents": documents,
    },
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
assistant_resp = resp.json()["choices"][0]["message"]
print(f"ASSISTANT RESPONSE ({end-start}s):")
print(assistant_resp["content"])

# UNCOMMENT this to extend the assistant's response so that it isn't cached
"""
assistant_resp["content"] = assistant_resp["content"] + "\nRespect my authority!"
"""

# Create the serialized version as a string so we can append the right prompt
messages.append(assistant_resp)
raw_prompt = tokenizer.apply_chat_template(messages, documents=documents, tokenize=False)
uq_prompt = raw_prompt + "<|start_of_role|>certainty<|end_of_role|>"

# Run with both adapters disabled
# UNCOMMENT this to exercise the case where the invocation string itself has
# been cached without the adapter
"""
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/out adapters ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))
"""

# Run with the adapter and the prompt for UQ with the alora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "max_tokens": 100,
    "lora": [
        # alora
        {"id": 0, "scale": 1.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ aLoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))

# Run with the adapter and the prompt for UQ with the lora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "max_tokens": 100,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 1.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ LoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))

Don't use cached invocation sequence from base model

This stimulates the case where the user ran the invocation sequence through the base model without the adapter and those tokens are cached (uncomment starting at line 57)

----
ASSISTANT RESPONSE (0.3047969341278076s):
IBM
----
UQ RESPONSE w/out adapters (0.1103658676147461s)
high
>>
{
  "completion_tokens": 2,
  "prompt_tokens": 140,
  "total_tokens": 142
}
{
  "prompt_n": 6,
  "prompt_ms": 60.143,
  "prompt_per_token_ms": 10.023833333333334,
  "prompt_per_second": 99.7622333438638,
  "predicted_n": 2,
  "predicted_ms": 47.758,
  "predicted_per_token_ms": 23.879,
  "predicted_per_second": 41.877800577913646
}
----
UQ RESPONSE w/ aLoRA (0.2115638256072998s)
87%
>>
{
  "completion_tokens": 4,
  "prompt_tokens": 140,
  "total_tokens": 144
}
{
  "prompt_n": 3,
  "prompt_ms": 56.051,
  "prompt_per_token_ms": 18.683666666666667,
  "prompt_per_second": 53.52268469786445,
  "predicted_n": 4,
  "predicted_ms": 153.106,
  "predicted_per_token_ms": 38.2765,
  "predicted_per_second": 26.125690697947828
}
----
UQ RESPONSE w/ LoRA (2.164383888244629s)
85%

Based on the information provided, Gabe works for IBM. The document states "I work for IBM" and the name associated with this statement is Gabe.
>>
{
  "completion_tokens": 38,
  "prompt_tokens": 140,
  "total_tokens": 178
}
{
  "prompt_n": 140,
  "prompt_ms": 283.203,
  "prompt_per_token_ms": 2.022878571428571,
  "prompt_per_second": 494.3450457798823,
  "predicted_n": 38,
  "predicted_ms": 1877.568,
  "predicted_per_token_ms": 49.409684210526315,
  "predicted_per_second": 20.238947404301733
}

Don't use adapter for uncached tokens before invocation sequence

This stimulates the case where for some reason there are additional tokens not pulled from cache that come before the invocation sequence (uncomment line 45)

----
ASSISTANT RESPONSE (0.30304694175720215s):
IBM
----
UQ RESPONSE w/ aLoRA (0.39620471000671387s)
87.5%
>>
{
  "completion_tokens": 6,
  "prompt_tokens": 146,
  "total_tokens": 152
}
{
  "prompt_n": 12,
  "prompt_ms": 136.993,
  "prompt_per_token_ms": 11.416083333333333,
  "prompt_per_second": 87.59571656945977,
  "predicted_n": 6,
  "predicted_ms": 256.634,
  "predicted_per_token_ms": 42.772333333333336,
  "predicted_per_second": 23.379598961945803
}
----
UQ RESPONSE w/ LoRA (1.1565330028533936s)
85%

Based on the information provided, Gabe works for IBM.
>>
{
  "completion_tokens": 18,
  "prompt_tokens": 146,
  "total_tokens": 164
}
{
  "prompt_n": 146,
  "prompt_ms": 287.421,
  "prompt_per_token_ms": 1.9686369863013697,
  "prompt_per_second": 507.9656670876519,
  "predicted_n": 18,
  "predicted_ms": 866.559,
  "predicted_per_token_ms": 48.14216666666667,
  "predicted_per_second": 20.771811267322825
}

gabe-l-hart · 2025-08-18T21:25:42Z

I've now extended this to test with multiple aloras in the same conversation. Here's the setup:

Adapters

Uncertainty Quantification (UQ): https://huggingface.co/ibm-granite/granite-3.2-8b-alora-uncertainty
- As above, this one estimates the certainty of the given answer relative to the provided context
Answerability: https://huggingface.co/ibm-granite/granite-3.2-8b-lora-rag-answerability-prediction
- This adapter is used before running the full generation to determine if the given question can be answered relative to the given context

Adapters are converted using convert_lora_to_gguf.py.

Boot with adapters

./bin/llama-server \
  -m ~/models/granite-3.2-8b-instruct/granite-3.2-8B-instruct-F16.gguf \
  --lora ~/models/granite-3.2-8b-alora-uncertainty/granite-3.2-8B-alora-uncertainty-F16-LoRA.gguf \
  --lora ~/models/granite-3.2-8b-alora-rag-answerability-prediction/granite-3.2-8B-alora-rag-answerability-prediction-F16-LoRA.gguf \
  --port 8081 \
  --jinja \
  --reasoning-budget 0

Test Script

(sorry, it requires my personal logging framework just 'cuz 😉... pip install alchemy-logging)

alora-chat.py

#!/usr/bin/env bash
"""
This is a simple implementation of an interactive chat that leverages several
aLoRA adapters during the flow
"""

# Standard
import argparse
import os

# First Party
import alog

# Third Party
import requests

log = alog.use_channel("MAIN")


def make_document(i: int, doc: str) -> dict:
    """Make a document dict from the given doc as either text or a path"""
    log.info("Adding document: %s", doc)
    if os.path.exists(doc):
        with open(doc, "r") as handle:
            return {"text": handle.read(), "doc_id": i, "title": doc}
    return{"text": doc, "doc_id": i}


def make_lora_req(adapter_ids: list[int], loras: list[int]) -> list[dict]:
    return [
        {"id": i, "scale": 1.0 if i in loras else 0.0}
        for i in adapter_ids
    ]


def make_chat_req(messages: list[dict], documents: list[dict], adapter_ids: list[int], loras: list[int]) -> dict:
    return {
        "messages": messages,
        "chat_template_kwargs": {
            "documents": documents,
        },
        "temperature": 0.0,
        "lora": make_lora_req(adapter_ids, loras),
    }


def make_completion_req(prompt: str, documents: list[dict], adapter_ids: list[int], loras: list[int], **kwargs) -> dict:
    kwargs.update({
        "prompt": prompt,
        "chat_template_kwargs": {
            "documents": documents,
        },
        "temperature": 0.0,
        "lora": make_lora_req(adapter_ids, loras),
    })
    return kwargs


def run_main_loop(host: str, documents: list[dict], uq_id: int, ans_id: int, adapter_ids: list[int]):
    """Run the main loop with questions"""
    help_cmd = "/?"
    doc_cmd = "/doc"
    reset_cmd = "/reset"
    quit_cmd = "/quit"
    doc_pfx = f"{doc_cmd} "

    def print_help():
        print("Commands:")
        print(f"{help_cmd}: Print help")
        print(f"{doc_cmd}: Add a document")
        print(f"{reset_cmd}: Reset the chat history")
        print(f"{quit_cmd}: Quit")

    messages = []
    print_help()
    while True:
        inp = input("?> ").strip()
        if inp == quit_cmd:
            break
        if not inp:
            continue
        if inp == help_cmd:
            print_help()
            continue
        if inp == reset_cmd:
            messages.clear()
            continue
        if inp.startswith(doc_pfx):
            doc = inp[len(doc_pfx):].lstrip()
            documents.append(make_document(len(documents), doc))
            continue

        # Apply the chat template with the user query
        user_message = {"role": "user", "content": inp}
        resp = requests.post(f"{host}/apply-template", json=make_chat_req(messages + [user_message], documents, adapter_ids, []))
        resp.raise_for_status()
        formatted_prompt = resp.json()["prompt"]
        log.debug4("Formatted prompt: %s", formatted_prompt)

        # Run the Answerability query
        ans_prompt = formatted_prompt + "<|end_of_text|>\n<|start_of_role|>answerability<|end_of_role|>"
        resp = requests.post(f"{host}/v1/completions", json=make_completion_req(ans_prompt, documents, adapter_ids, [ans_id], max_tokens=3))
        resp.raise_for_status()
        js = resp.json()
        answerability = js["choices"][0]["text"]
        log.debug("Answerability: %s", answerability)
        log.debug2("Usage: %s", js["usage"])
        log.debug2("Timings: %s", js["timings"])
        answerable = not answerability.split()[0].lower().startswith("unanswerable")
        if answerable:
            print(">> The question is answerable!")
        else:
            print(">> I'm sorry, but that question isn't answerable with the given context")
            if input("?> Do you want to try anyway [yN]? ").strip().lower() not in ["y", "yes"]:
                continue
        messages.append(user_message)

        # If not unanswerable, run the question and get the assistant's response
        resp = requests.post(f"{host}/v1/chat/completions", json=make_chat_req(messages, documents, adapter_ids, []))
        resp.raise_for_status()
        js = resp.json()
        assistant_msg = js["choices"][0]["message"]
        answer = assistant_msg["content"]
        messages.append(assistant_msg)
        print(f"ASSISTANT: {answer}")

        # Get the uncertainty
        formatted_prompt = requests.post(f"{host}/apply-template", json=make_chat_req(messages, documents, adapter_ids, [])).json()["prompt"]
        uq_prompt = formatted_prompt + "<|end_of_text|>\n<|start_of_role|>certainty<|end_of_role|>"
        resp = requests.post(f"{host}/v1/completions", json=make_completion_req(uq_prompt, documents, adapter_ids, [uq_id], max_tokens=5))
        resp.raise_for_status()
        js = resp.json()
        uq = js["choices"][0]["text"]
        print(f">> CERTAINTY: {uq}")
        log.debug2("Usage: %s", js["usage"])
        log.debug2("Timings: %s", js["timings"])

        print()


def main():
    parser = argparse.ArgumentParser(description=__doc__)
    # Logging
    parser.add_argument("--log-level", "-l", default=os.getenv("LOG_LEVEL", "info"))
    parser.add_argument("--log-filters", "-lf", default=os.getenv("LOG_FILTERS", "urllib3.connectionpool:info"))
    parser.add_argument("--log-json", "-lj", action="store_true", default=os.getenv("LOG_JSON", "").lower() == "true")
    # Models
    parser.add_argument("--alora-uq", "-u", type=int, default=None, help="Adapter ID for the UQ adapter")
    parser.add_argument("--alora-answerability", "-a", type=int, default=None, help="Adapter ID for the Answerability adapter")
    # Server
    parser.add_argument("--host", "-s", default="http://localhost:8081", help="Host where llama-server is running")
    # Docs
    parser.add_argument("--document", "-d", nargs="+", help="document (text or path) to add as context")

    # Configure logging
    args = parser.parse_args()
    alog.configure(
        default_level=args.log_level,
        filters=args.log_filters,
        formatter="json" if args.log_json else "pretty",
        thread_id=True,
    )

    # Make sure llama-server is up!
    resp = requests.get(f"{args.host}/health")
    resp.raise_for_status()
    log.info("llama-server is up at %s", args.host)

    # Get the loaded adapters
    resp = requests.get(f"{args.host}/lora-adapters")
    adapters = resp.json()
    adapter_ids = [entry["id"] for entry in adapters]

    # Figure out which adapter is which
    uq_id = args.alora_uq
    if uq_id is None:
        candidates = [entry for entry in adapters if "uncertainty" in entry["path"]]
        assert len(candidates) == 1, "Couldn't auto-deduce UQ adapter ID"
        uq_id = candidates[0]["id"]
    ans_id = args.alora_answerability
    if ans_id is None:
        candidates = [entry for entry in adapters if "answerability" in entry["path"]]
        assert len(candidates) == 1, "Couldn't auto-deduce Answerability adapter ID"
        ans_id = candidates[0]["id"]
    log.info("UQ aLoRA ID: %d, Answerability aLoRA ID: %d", uq_id, ans_id)

    # Load documents
    documents = []
    for i, doc in enumerate(args.document or []):
        documents.append(make_document(i, doc))

    # Start the prompt loop
    log.info("Starting main loop")
    run_main_loop(args.host, documents, uq_id, ans_id, adapter_ids)

if __name__ == "__main__":
    main()

Example Output

(llama.cpp) ghart@Mac [llama.cpp gabe-l-hart/alora-support ?~]$ python alora-chat.py -d "My name is Gabe" "I work for IBM" 
2025-08-18T21:20:46.381939 [MAIN :INFO:8299700416] llama-server is up at http://localhost:8081
2025-08-18T21:20:46.383507 [MAIN :INFO:8299700416] UQ aLoRA ID: 0, Answerability aLoRA ID: 1
2025-08-18T21:20:46.383543 [MAIN :INFO:8299700416] Adding document: My name is Gabe
2025-08-18T21:20:46.383571 [MAIN :INFO:8299700416] Adding document: I work for IBM
2025-08-18T21:20:46.383591 [MAIN :INFO:8299700416] Starting main loop
Commands:
/?: Print help
/doc: Add a document
/reset: Reset the chat history
/quit: Quit
?> Where does Gabe work?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? 
?> What company does Gabe work for?
>> The question is answerable!
ASSISTANT: IBM
>> CERTAINTY: 88%

?> How about Bob? Who does he work for?                                                           
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: I am sorry, the question is unanswerable from the provided document.
>> CERTAINTY: 60.75

?> /doc Bob works for Widgets Inc.
2025-08-18T21:22:20.766356 [MAIN :INFO:8299700416] Adding document: Bob works for Widgets Inc.
?> Try again. Where does Bob work?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: Bob works for Widgets Inc.
>> CERTAINTY: 75.85

?> Alright, time for something different. Write a haiku about python logging frameworks
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: Python logs with ease,
Structured, colored, or plain,
Logging, a breeze.
>> CERTAINTY: 40.55

?>

(NOTE: It's clear from my experiments that these adapters are not particularly robust, but that's a property of these specific ones that are being continuously refined!)

gabe-l-hart · 2025-08-18T21:36:18Z

I realized that my local adapter_config.json files have the updates from "invocation_string" to "alora_invocation_tokens". These changes will eventually be pushed up to the hosted adapters. For compatibility, I'm going to add the automated tokenization in the python conversion layer.

gabe-l-hart · 2025-08-18T21:53:28Z

The other contingency for this PR is #15404. The functionality is not linked at all, but the above chat script will fail out when trying to perform the chat template expansion without the fix there.

gabe-l-hart · 2025-08-18T22:15:29Z

One additional note: These adapters seem to still work well when attached to a quantized model, so they don't require losing the speed/footprint benefits of quantization.

./bin/llama-server -m ~/models/granite-3.2-8b-instruct/ggml-model-Q4_K_M.gguf --lora ~/models/granite-3.2-8b-alora-uncertainty/granite-3.2-8B-alora-uncertainty-F16-LoRA.gguf --lora ~/models/granite-3.2-8b-alora-rag-answerability-prediction/granite-3.2-8B-alora-rag-answerability-prediction-F16-LoRA.gguf --port 8081 --jinja --reasoning-budget 0

(llama.cpp) ghart@Mac [llama.cpp gabe-l-hart/alora-support ?~]$ python alora-chat.py -d "My name is Gabe" "I work for IBM" 
2025-08-18T22:08:48.310437 [MAIN :INFO:8299700416] llama-server is up at http://localhost:8081
2025-08-18T22:08:48.311779 [MAIN :INFO:8299700416] UQ aLoRA ID: 0, Answerability aLoRA ID: 1
2025-08-18T22:08:48.311808 [MAIN :INFO:8299700416] Adding document: My name is Gabe
2025-08-18T22:08:48.311832 [MAIN :INFO:8299700416] Adding document: I work for IBM
2025-08-18T22:08:48.311851 [MAIN :INFO:8299700416] Starting main loop
Commands:
/?: Print help
/doc: Add a document
/reset: Reset the chat history
/quit: Quit
?> What company does Gabe work for?
>> The question is answerable!
ASSISTANT: IBM
>> CERTAINTY: 87%

?> How about Bob? Who does Bob work for?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: I am sorry, the information about Bob's employer is not available in the provided document.
>> CERTAINTY: 60.64

?> /doc Bob works for Widgets Inc
2025-08-18T22:10:43.037856 [MAIN :INFO:8299700416] Adding document: Bob works for Widgets Inc
?> /doc Bob's favorite ice cream is Mint Chocolate Chip
2025-08-18T22:10:58.262979 [MAIN :INFO:8299700416] Adding document: Bob's favorite ice cream is Mint Chocolate Chip
?> Try again. Can you tell me who Bob works for now?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: Apologies for any confusion, but the document provided does not contain information about Bob's employer.
>> CERTAINTY: 30.64

?> What company does Bob work for?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: I'm sorry, the information about Bob's employer is not available in the provided document.
>> CERTAINTY: 40.00

?> Who works for Widgets Inc?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: Bob works for Widgets Inc, as per the information in the provided document.
>> CERTAINTY: 50.55

?> What's Bob's favorite Ice Cream?
>> The question is answerable!
ASSISTANT: Bob's favorite ice cream is Mint Chocolate Chip, according to the information I have.
>> CERTAINTY: 65.55

?> cool, what about Gabe? 
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: I'm sorry, the document does not provide information about Gabe's ice cream preference.
>> CERTAINTY: 40.55

EDIT: I also tried with MXFP4_MOE and the results seem to be closer to F16 than Q4_K_M

gabe-l-hart · 2025-08-18T22:45:02Z

Also important to test will be concurrent requests to the same alora. It's possible that these could end up in the same slot, but due to the logic for doing pre-invocation tokens without the adapter, they could pollute a single batch.

gabe-l-hart · 2025-08-20T13:30:09Z

@ryan-mangeno if you have any cycles for testing, I'd love some help putting this one through its paces

…ation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

This is the preferred method in PEFT which is the source of ground truth https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

This does not yet do the part to identify the invocation tokens and only apply the lora adapter afterwards, but it does seem to produce correct results if the invocation tokens are the beginning of the uncached input. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

This currently limits to a single enabled alora per slot. Multiple aloras with different invocation sequences would be possible, but it would require a more complex integration of the adapter toggling and is not really a well studied case for alora since it's unclear if one alora can reuse cache from previous prefill computed with a different alora. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

This is a bit of an edge case, but theoretically a user could try the same query with the alora disabled (just using the base model), then retry with the alora. The cached tokens from the first pass should be invalid. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

The solution is to only fill up to the token before the invocation start in the batch if there are any tokens to be prefilled between those pulled from cache and the invocation start. When this is detected, the alora is temporarily disabled with a scale of 0.0, then immediately re-enabled after it has been initialized for the internal graph. Since the batch does not complete the prompt tokens, the remaining prompt tokens are handled in the next task, pulling all of the non-alora tokens from cache and proceeding with prefill for the alora tokens. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

Too much python 🤦 Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

This was the cause of the inconsistent results from the dummy test script with and without the turn that runs the prompt without the adapter before running it with the adapter. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

…er_config.json While this has been replaced in the PEFT PR in favor of alora_invocation_tokens, the existing adapters in the ibm-granite org on HF use "invocation_string," so this will enable backwards compatibility and enable testing now (before PEFT PR changes have percolated everywhere). Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-08-22T16:24:04Z

@ngxson @ggerganov @CISC this PR should be ready for review. The first question, though, is whether you all think this is a feature that we should add. It should be a net-addition (no change to existing functionality), but it's certainly a new feature that will require maintenance going forward (which I'm happy to be on the hook for).

CISC · 2025-08-22T21:18:11Z

I think it's a very interesting feature with definite worth, and doesn't seem to add much maintenance burden even if you hadn't committed yourself to it. :)

…upport * origin/master: (61 commits) scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633) kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) gguf-py: byteswapping improvements (ggml-org#12851) cli : change log to warning to explain reason for stopping (ggml-org#15604) model-conversion : add mmproj conversion target (ggml-org#15628) cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622) server: higher timeout for tests (ggml-org#15621) presets : add qwen3-30B-a3b FIM (ggml-org#15616) HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615) kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610) CANN: refactor mask handling and improve performance in FA (ggml-org#15561) ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057) common : add -m to bash completion for --model [no ci] (ggml-org#15591) OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314) tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599) SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592) mtmd : fix mtmd ios build (ggml-org#15579) tests: add performance test for mul mat id (ggml-org#15543) llamafile: PowerPC Sgemm Optimization (ggml-org#15558) graph : fix assert in memory-less build_attn (ggml-org#15590) ...

gabe-l-hart · 2025-08-28T14:48:48Z

@CISC It looks like there are some new conflicts with #13693. Looking over the changes there, I don't think prompt_prefix is serving the same purpose as invocation_tokens since that has to do with automatically activating a lora based on a prefix. There is certainly some overlap though, especially with how the metadata is stored. Do you think it would be best to update this to store the invocation_tokens in the generic metadata blob now?

CISC · 2025-08-28T15:42:24Z

@CISC It looks like there are some new conflicts with #13693. Looking over the changes there, I don't think prompt_prefix is serving the same purpose as invocation_tokens since that has to do with automatically activating a lora based on a prefix. There is certainly some overlap though, especially with how the metadata is stored. Do you think it would be best to update this to store the invocation_tokens in the generic metadata blob now?

Yeah, they are somewhat similar, but I think the purpose they serve is different enough that it justifies separating them, esp. as prompt_prefix is meant as a cue for the user to prefix every prompt with it.

gabe-l-hart · 2025-08-28T15:43:50Z

Great, I'll resolve the conflicts the simple way then.

…upport * origin/master: ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625) kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638) model : jina-embeddings-v3 support (ggml-org#13693) Signed-off-by: Gabe Goodhart <[email protected]>

…upport * origin/master: (72 commits) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) vulkan : update ggml_vk_instance_validation_ext_available (ggml-org#15666) ggml vulkan: add hardsigmoid and hardswish operations (ggml-org#15762) CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E (ggml-org#15715) model-conversion : fix pyright errors (ggml-org#15770) sampling : optimize dist sampler (ggml-org#15704) llama : fix incorrect model type for Gemma 270M (ggml-org#15764) model-conversion : remove hardcoded /bin/bash shebangs [no ci] (ggml-org#15765) CANN: Add RoPE contiguous check for 310I DUP device (ggml-org#15735) ggml-cpu : optimize RVV kernels (ggml-org#15720) ...

CISC

Would it perhaps make sense to expose the invocation tokens here?

llama.cpp/tools/server/server.cpp

Lines 5082 to 5088 in d8651c8

    
           result.push_back({ 
        
               {"id", i}, 
        
               {"path", lora.path}, 
        
               {"scale", lora.scale}, 
        
               {"task_name", lora.task_name}, 
        
               {"prompt_prefix", lora.prompt_prefix}, 
        
           });

src/llama-adapter.cpp

Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

…upport * origin/master: Thinking model disabled assistant prefill (ggml-org#15404) Implement --log-colors with always/never/auto (ggml-org#15792) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798)

gabe-l-hart · 2025-09-05T21:11:41Z

Would it perhaps make sense to expose the invocation tokens here?

Yep, great idea. The only tricky bit would be whether it's better to detokenize it and show the string version?

CISC · 2025-09-05T21:13:47Z

Would it perhaps make sense to expose the invocation tokens here?

Yep, great idea. The only tricky bit would be whether it's better to detokenize it and show the string version?

If so, have both (since you can send raw tokens), but really the user can detokenize if necessary.

… /lora-adapters Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-09-05T21:32:29Z

It was pretty easy to detokenize, so I added both. This will be very nice for reflection!

* feat: Add python-side constants and conversion for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add c++ side constants for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Parse invocation string for adapters from GGUF Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix(python): Update conversion to alora_invocation_tokens This is the preferred method in PEFT which is the source of ground truth https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix(cpp): Update to alora_invocation_tokens on c++ side Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add C APIs to get alora invocation token array from lora Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Initial implementation of alora cache logic in server This does not yet do the part to identify the invocation tokens and only apply the lora adapter afterwards, but it does seem to produce correct results if the invocation tokens are the beginning of the uncached input. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Identify alora invocation sequences This currently limits to a single enabled alora per slot. Multiple aloras with different invocation sequences would be possible, but it would require a more complex integration of the adapter toggling and is not really a well studied case for alora since it's unclear if one alora can reuse cache from previous prefill computed with a different alora. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Only reuse cache for tokens before the alora invocation start This is a bit of an edge case, but theoretically a user could try the same query with the alora disabled (just using the base model), then retry with the alora. The cached tokens from the first pass should be invalid. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Handle un-cached tokens that come before the alora activation The solution is to only fill up to the token before the invocation start in the batch if there are any tokens to be prefilled between those pulled from cache and the invocation start. When this is detected, the alora is temporarily disabled with a scale of 0.0, then immediately re-enabled after it has been initialized for the internal graph. Since the batch does not complete the prompt tokens, the remaining prompt tokens are handled in the next task, pulling all of the non-alora tokens from cache and proceeding with prefill for the alora tokens. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use || instead of 'or' Too much python 🤦 Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix off-by-one for limiting cached tokens to before alora start This was the cause of the inconsistent results from the dummy test script with and without the turn that runs the prompt without the adapter before running it with the adapter. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Support backwards-compatibility for "invocation_string" in adapter_config.json While this has been replaced in the PEFT PR in favor of alora_invocation_tokens, the existing adapters in the ibm-granite org on HF use "invocation_string," so this will enable backwards compatibility and enable testing now (before PEFT PR changes have percolated everywhere). Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove duplicate logging Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> * feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

jukofyork · 2025-09-09T06:41:40Z

Just noticed convert_lora_to_gguf.py has been recently updated in this PR and wondered if you or anyone else has ever managed to create a non-F32 LoRA in GGUF format:

#15890

(also: if this is a bug it's nothing to do with this PR! I first noticed it a couple of months ago - it's just now I might need to distribute some quite high rank LoRAs and trying to work out if it's possible to create a non-F32 LoRA).

gabe-l-hart · 2025-09-09T20:11:28Z

@jukofyork I have not tried creating quantized adapters beyond an initial attempt to point llama-quantize at one and having it fail. In #15327 (comment) I provide output of testing using the full-precision adapter with a quantized model which does seem to work well, even for alora.

* feat: Add python-side constants and conversion for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add c++ side constants for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Parse invocation string for adapters from GGUF Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix(python): Update conversion to alora_invocation_tokens This is the preferred method in PEFT which is the source of ground truth https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix(cpp): Update to alora_invocation_tokens on c++ side Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add C APIs to get alora invocation token array from lora Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Initial implementation of alora cache logic in server This does not yet do the part to identify the invocation tokens and only apply the lora adapter afterwards, but it does seem to produce correct results if the invocation tokens are the beginning of the uncached input. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Identify alora invocation sequences This currently limits to a single enabled alora per slot. Multiple aloras with different invocation sequences would be possible, but it would require a more complex integration of the adapter toggling and is not really a well studied case for alora since it's unclear if one alora can reuse cache from previous prefill computed with a different alora. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Only reuse cache for tokens before the alora invocation start This is a bit of an edge case, but theoretically a user could try the same query with the alora disabled (just using the base model), then retry with the alora. The cached tokens from the first pass should be invalid. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * feat: Handle un-cached tokens that come before the alora activation The solution is to only fill up to the token before the invocation start in the batch if there are any tokens to be prefilled between those pulled from cache and the invocation start. When this is detected, the alora is temporarily disabled with a scale of 0.0, then immediately re-enabled after it has been initialized for the internal graph. Since the batch does not complete the prompt tokens, the remaining prompt tokens are handled in the next task, pulling all of the non-alora tokens from cache and proceeding with prefill for the alora tokens. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use || instead of 'or' Too much python 🤦 Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix off-by-one for limiting cached tokens to before alora start This was the cause of the inconsistent results from the dummy test script with and without the turn that runs the prompt without the adapter before running it with the adapter. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Support backwards-compatibility for "invocation_string" in adapter_config.json While this has been replaced in the PEFT PR in favor of alora_invocation_tokens, the existing adapters in the ibm-granite org on HF use "invocation_string," so this will enable backwards compatibility and enable testing now (before PEFT PR changes have percolated everywhere). Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove duplicate logging Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> * feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

jukofyork · 2025-09-12T04:06:40Z

@jukofyork I have not tried creating quantized adapters beyond an initial attempt to point llama-quantize at one and having it fail. In #15327 (comment) I provide output of testing using the full-precision adapter with a quantized model which does seem to work well, even for alora.

CISC aswered here:

#15890 (comment)

so it looks like it could easily be hacked to create quantised LoRAs if needed.

gabe-l-hart mentioned this pull request Aug 14, 2025

Feature Request: Support for Activated LoRA #15212

Closed

4 tasks

github-actions bot added examples python python script changes server labels Aug 14, 2025

gabe-l-hart mentioned this pull request Aug 18, 2025

Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client #13196

Merged

gabe-l-hart marked this pull request as ready for review August 18, 2025 21:29

gabe-l-hart requested a review from ngxson as a code owner August 18, 2025 21:29

gabe-l-hart added 11 commits August 22, 2025 10:00

feat: Add python-side constants and conversion for adapter.lora.invoc…

5d42ca6

…ation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

feat: Add c++ side constants for adapter.lora.invocation_string

a242ee8

Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

feat: Parse invocation string for adapters from GGUF

4c214e4

Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

fix(cpp): Update to alora_invocation_tokens on c++ side

215841f

Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

feat: Add C APIs to get alora invocation token array from lora

7212b9d

Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

fix: Use || instead of 'or'

d03d106

Too much python 🤦 Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart added 2 commits August 22, 2025 10:00

gabe-l-hart force-pushed the gabe-l-hart/alora-support branch from 532f0ff to da2e8c1 Compare August 22, 2025 16:00

gabe-l-hart added the enhancement New feature or request label Aug 28, 2025

lastras mentioned this pull request Aug 30, 2025

[Model] Activated LoRA vllm-project/vllm#19710

Open

4 tasks

CISC approved these changes Sep 5, 2025

View reviewed changes

src/llama-adapter.cpp Outdated Show resolved Hide resolved

gabe-l-hart and others added 2 commits September 5, 2025 15:09

fix: Remove duplicate logging

5958557

Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

feat: Report alora_invocation_string and alora_invocation_tokens from…

5647676

… /lora-adapters Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart merged commit fd62188 into ggml-org:master Sep 5, 2025
51 checks passed

gabe-l-hart deleted the gabe-l-hart/alora-support branch September 8, 2025 14:27

	result.push_back({
	{"id", i},
	{"path", lora.path},
	{"scale", lora.scale},
	{"task_name", lora.task_name},
	{"prompt_prefix", lora.prompt_prefix},
	});

aLoRA Support #15327

aLoRA Support #15327

Uh oh!

Conversation

gabe-l-hart commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DRAFT STATUS

Description

Current Changes

TODO

Testing

Conversion

Execution

Sniff test

Uh oh!

gabe-l-hart commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Aug 14, 2025

Uh oh!

gabe-l-hart commented Aug 14, 2025

Uh oh!

CISC commented Aug 14, 2025

Uh oh!

gabe-l-hart commented Aug 14, 2025

Uh oh!

gabe-l-hart commented Aug 15, 2025

Update

Testing

Don't use cached invocation sequence from base model

Don't use adapter for uncached tokens before invocation sequence

Uh oh!

gabe-l-hart commented Aug 18, 2025

Adapters

Boot with adapters

Test Script

Example Output

Uh oh!

gabe-l-hart commented Aug 18, 2025

Uh oh!

gabe-l-hart commented Aug 18, 2025

Uh oh!

gabe-l-hart commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Aug 18, 2025

Uh oh!

gabe-l-hart commented Aug 20, 2025

Uh oh!

gabe-l-hart commented Aug 22, 2025

Uh oh!

CISC commented Aug 22, 2025

Uh oh!

gabe-l-hart commented Aug 28, 2025

Uh oh!

CISC commented Aug 28, 2025

Uh oh!

gabe-l-hart commented Aug 28, 2025

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gabe-l-hart commented Sep 5, 2025

Uh oh!

CISC commented Sep 5, 2025

Uh oh!

gabe-l-hart commented Sep 5, 2025

Uh oh!

Uh oh!

jukofyork commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Sep 9, 2025

Uh oh!

jukofyork commented Sep 12, 2025

Uh oh!

Uh oh!

gabe-l-hart commented Aug 14, 2025 •

edited

Loading

gabe-l-hart commented Aug 14, 2025 •

edited

Loading

gabe-l-hart commented Aug 18, 2025 •

edited

Loading

jukofyork commented Sep 9, 2025 •

edited

Loading