Skip to content

server : added --no-prefill-assistant flag #13608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 17, 2025

Conversation

isaac-mcfadyen
Copy link
Contributor

Following up on PR #13174.

Overview

After some discussion, the decision was made to add an opt-out flag for the assistant prefill behavior so it can be disabled to restore the previous functionality.

  • This PR adds the --no-prefill-assistant flag, specific to llama-server. Also has a corresponding env var LLAMA_ARG_NO_PREFILL_ASSISTANT.
  • When the flag is not specified, the default behavior is to prefill the response based on the assistant message if it's at the end of the messages array, so that use-cases such as Feature Request: Prefix assistant answer #11536 continue to work.
  • When this flag is specified, we treat the trailing assistant message as a full message as was the behavior before Prefilling assistant message in openai compatible API #13174.

Testing

Used bartowski/Llama-3.2-1B-Instruct-GGUF for testing as I had it on hand. Tested with both /apply-template and /v1/chat/completions as they both used the shared prompt templating functions.

/apply-template:

# Flag omitted
curl http://127.0.0.1:8080/apply-template --json '{"messages": [{"role": "assistant", "content": "My name is"}]}' -s
# {"prompt":"<|start_header_id|>assistant<|end_header_id|>\n\nMy name is"}

# --no-prefill-assistant added (also tested with LLAMA_ARG_NO_PREFILL_ASSISTANT=1)
curl http://127.0.0.1:8080/apply-template --json '{"messages": [{"role": "assistant", "content": "My name is"}]}' -s
# {"prompt":"<|start_header_id|>assistant<|end_header_id|>\n\nMy name is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"}

/v1/chat/completions:

# Flag omitted
curl http://127.0.0.1:8080/v1/chat/completions --json '{"max_tokens": 12, "messages": [{"role": "assistant", "content": "My name is"}]}' -s | jq ".choices[0].message.content"
# " Rohan, and I'm an assistant here. What seems"

# --no-prefill-assistant added (also tested with LLAMA_ARG_NO_PREFILL_ASSISTANT=1)
curl http://127.0.0.1:8080/v1/chat/completions --json '{"max_tokens": 12, "messages": [{"role": "assistant", "content": "My name is"}]}' -s | jq ".choices[0].message.content"
# "It seems like you're about to start a conversation, but"

This is my first non-docs PR to llama.cpp so let me know if I need to make any changes 😅

@ngxson ngxson merged commit 6a2bc8b into ggml-org:master May 17, 2025
46 checks passed
@isaac-mcfadyen isaac-mcfadyen deleted the no-prefill-assistant branch May 18, 2025 01:27
@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented May 18, 2025

Glad to see it, though personally I would have preferred the other way around (--prefill-assistant), I think.

Not a big deal, but the general the policy I'd personally like to see is the standard behavior should be the defaults. My understanding is #13174 is not standard OpenAI API behavior, so now this flag is "needed" to restore behavior.

Just my 2c.

@isaac-mcfadyen
Copy link
Contributor Author

though personally I would have preferred the other way around

This was also my personal opinion, but in #13174 the counterargument was that it was default for a week or two and so it would be more breaking to revert that again.

infil00p pushed a commit to baseweight/llama.cpp that referenced this pull request May 22, 2025
* added no-prefill-assistant flag

* reworded documentation comment

* updated server README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants