-
Notifications
You must be signed in to change notification settings - Fork 12k
Prefilling assistant message in openai compatible API #13174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Just a heads-up that this is potentially a very breaking change, especially because this is an OpenAI compatible API but this is not OpenAI's behavior. The main situation I can think of is if someone wants to generate a new assistant message after the last one - i.e for ChatML they want the I'd suggest we add this to #9291 at a minimum. |
A better alternative would be to use an additional There is also this issue about a prefix API. I think there is an issue with token healing. |
The feature is aligned with the claude api and the open-webui client. Using |
That is because the Claude API is strictly worse than the Mistral API. You can't even tell whether the Claude API is broken without inspecting the output and you can't shut it off if you don't want that behavior. |
I believe
I believe those clients would still allow adding custom |
I am not aware of clients that support An alternative implementation is For reference, here is an example code that shows how to use both options:
|
That sounds good, I'd very much vote for this being changed to a field in the body rather than default. 😄 |
I just noticed that this change affected @ngxson Apologies for not pinging when all of this was discussed this last week, but this might be something we want to revert? Aside from not being OpenAI compatible (OpenAI does not have this behavior) it breaks applications that don't expect this behavior... perhaps this could be put behind an optional parameter like discussed above (in a future PR)? For context this was my use-case that this change broke (with different text but same idea): curl http://127.0.0.1:8080/apply-template --json '{"messages": [{"role": "user", "content": "Hello?"}, {"role": "assistant", "content": "Hello! How are you?"]}'
# before: {"prompt":"<|im_start|>user\nHello?<|im_end|>\n<|im_start|>assistant\nHello! How are you?<|im_end|>\n<|im_start|>assistant\n"}
# after: {"prompt":"<|im_start|>user\nHello?<|im_end|>\n<|im_start|>assistant\nHello! How are you?"} |
I think what we can do is to add a boolean to control this behaviour. Re. your point about OAI compat, I think OAI doesn't allow 2 assistant messages (correct me if I'm wrong). The original PR suggests that this feature is indeed copied from Claude API though tbh I haven't had time to test it myself. Nevertheless, I think we should still keep this feature because it's the simplest way to control reasoning models. |
AFAIK, OAI doesn't allow "assistant" as last role. It was allowed in older models for prefilling using the same api as in this PR. In recent models that feature is disabled. I'm thinking of adding a command line flag to optionally disable that and revert to older behavior. I have been busy IRL but I'll submit a PR when I have some time for writing the code. |
Makes sense, and @matteoserva I'm happy to PR a flag if you're good with that. Do we want opt-in or opt-out behavior for the flag? Personally I think opt-in might be better to prevent surprises, but given this is already added we could also do opt-out. |
I have no preference, but logically say, because we already introduced this as an "official" feature, so we want to avoid breaking change by allow opt-out |
Yeah, I'm certainly happy if you submit the PR. My vote is for opt-out. |
Opt-out makes sense, I'll see about PRing later today when I get the chance! |
This adds support for prefilling assistant response (or its thought process) using the OpenAI compatible API.
The feature is used for example by Claude.
It can be tested using open-webui or with the following curl command:
Example advanced scenario: time limit for the thinking process
</think>
to its partial response