server : added --no-prefill-assistant flag #13608

isaac-mcfadyen · 2025-05-17T16:01:15Z

Following up on PR #13174.

Overview

After some discussion, the decision was made to add an opt-out flag for the assistant prefill behavior so it can be disabled to restore the previous functionality.

This PR adds the --no-prefill-assistant flag, specific to llama-server. Also has a corresponding env var LLAMA_ARG_NO_PREFILL_ASSISTANT.
When the flag is not specified, the default behavior is to prefill the response based on the assistant message if it's at the end of the messages array, so that use-cases such as Feature Request: Prefix assistant answer #11536 continue to work.
When this flag is specified, we treat the trailing assistant message as a full message as was the behavior before Prefilling assistant message in openai compatible API #13174.

Testing

Used bartowski/Llama-3.2-1B-Instruct-GGUF for testing as I had it on hand. Tested with both /apply-template and /v1/chat/completions as they both used the shared prompt templating functions.

/apply-template:

# Flag omitted
curl http://127.0.0.1:8080/apply-template --json '{"messages": [{"role": "assistant", "content": "My name is"}]}' -s
# {"prompt":"<|start_header_id|>assistant<|end_header_id|>\n\nMy name is"}

# --no-prefill-assistant added (also tested with LLAMA_ARG_NO_PREFILL_ASSISTANT=1)
curl http://127.0.0.1:8080/apply-template --json '{"messages": [{"role": "assistant", "content": "My name is"}]}' -s
# {"prompt":"<|start_header_id|>assistant<|end_header_id|>\n\nMy name is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"}

/v1/chat/completions:

# Flag omitted
curl http://127.0.0.1:8080/v1/chat/completions --json '{"max_tokens": 12, "messages": [{"role": "assistant", "content": "My name is"}]}' -s | jq ".choices[0].message.content"
# " Rohan, and I'm an assistant here. What seems"

# --no-prefill-assistant added (also tested with LLAMA_ARG_NO_PREFILL_ASSISTANT=1)
curl http://127.0.0.1:8080/v1/chat/completions --json '{"max_tokens": 12, "messages": [{"role": "assistant", "content": "My name is"}]}' -s | jq ".choices[0].message.content"
# "It seems like you're about to start a conversation, but"

This is my first non-docs PR to llama.cpp so let me know if I need to make any changes 😅

strawberrymelonpanda · 2025-05-18T19:49:44Z

Glad to see it, though personally I would have preferred the other way around (--prefill-assistant), I think.

Not a big deal, but the general the policy I'd personally like to see is the standard behavior should be the defaults. My understanding is #13174 is not standard OpenAI API behavior, so now this flag is "needed" to restore behavior.

Just my 2c.

isaac-mcfadyen · 2025-05-18T19:51:45Z

though personally I would have preferred the other way around

This was also my personal opinion, but in #13174 the counterargument was that it was default for a week or two and so it would be more breaking to revert that again.

* added no-prefill-assistant flag * reworded documentation comment * updated server README.md

isaac-mcfadyen added 2 commits May 17, 2025 11:42

added no-prefill-assistant flag

594facb

reworded documentation comment

1b24043

isaac-mcfadyen requested a review from ngxson as a code owner May 17, 2025 16:01

github-actions bot added examples server labels May 17, 2025

updated server README.md

3e45ff1

ngxson approved these changes May 17, 2025

View reviewed changes

ngxson merged commit 6a2bc8b into ggml-org:master May 17, 2025
46 checks passed

isaac-mcfadyen deleted the no-prefill-assistant branch May 18, 2025 01:27

infil00p pushed a commit to baseweight/llama.cpp that referenced this pull request May 22, 2025

server : added --no-prefill-assistant flag (ggml-org#13608)

763a4fe

* added no-prefill-assistant flag * reworded documentation comment * updated server README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : added --no-prefill-assistant flag #13608

server : added --no-prefill-assistant flag #13608

Uh oh!

isaac-mcfadyen commented May 17, 2025

Uh oh!

Uh oh!

strawberrymelonpanda commented May 18, 2025 •

edited

Loading

Uh oh!

isaac-mcfadyen commented May 18, 2025

Uh oh!

Uh oh!

server : added --no-prefill-assistant flag #13608

server : added --no-prefill-assistant flag #13608

Uh oh!

Conversation

isaac-mcfadyen commented May 17, 2025

Overview

Testing

Uh oh!

Uh oh!

strawberrymelonpanda commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-mcfadyen commented May 18, 2025

Uh oh!

Uh oh!

strawberrymelonpanda commented May 18, 2025 •

edited

Loading