-
Notifications
You must be signed in to change notification settings - Fork 12k
server
: streaming of tool calls and thoughts when --jinja
is on
#12379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…al based thinking tags parsing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces streaming of tool calls and reasoning content when using the --jinja flag while also improving partial JSON parsing and proper handling of grammar triggers. Key changes include:
- Addition of streaming request utilities and extensive updates to unit tests for tool call and chat completion behavior.
- Integration of a unified chat syntax structure in server.cpp and changes in sampling.cpp to better handle grammar trigger patterns.
- Major enhancements in the JSON partial parsing/healing logic and chat parser to recover truncated model outputs.
Reviewed Changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
tools/server/tests/utils.py | Added a new make_any_request method to support both streaming and non-streaming calls. |
tools/server/tests/unit/test_tool_call.py | Updated test functions to support streaming and adjusted assertions for tool calls. |
tools/server/tests/unit/test_chat_completion.py | Modified stream assertion checks (using is None rather than empty string). |
tools/server/server.cpp | Integrated common_chat_syntax and refined grammar trigger handling and logging. |
tests/test-json-partial.cpp | Added comprehensive tests for partial JSON healing with various fragment examples. |
tests/test-chat-parser.cpp | Expanded test coverage for chat parsing, including reasoning and regex consumption. |
src/llama-grammar.cpp | Updated grammar trigger handling to capture the first non-empty capturing group. |
scripts/tool_bench.py | Enhanced command-line benchmarking parameters by adding chat_template_file support. |
models/templates/README.md & Qwen-QwQ-32B.jinja | Added documentation and a new Jinja template for Qwen-QwQ-32B. |
docs/function-calling.md | Included a caution regarding extreme KV quantizations affecting tool calling. |
common/sampling.cpp | Refactored grammar trigger pattern concatenation by removing patterns_at_start logic. |
common/json-partial.* | Introduced and refined JSON healing logic with detailed parsing and recovery steps. |
common/chat*.{h,cpp} | Updated chat structures and the parser to leverage the new chat syntax and reasoning. |
Comments suppressed due to low confidence (2)
tools/server/tests/unit/test_tool_call.py:98
- Consider removing or re-enabling the commented assertion for a non-empty tool call id if it is no longer applicable. This cleanup can help avoid confusion in test outputs and maintain code clarity.
# assert len(tool_call.get("id", "") > 0, f'Expected non empty tool call id in {tool_call}')
common/sampling.cpp:164
- [nitpick] The variable name 'trigger_patterns' (previously replacing 'patterns_at_start') could be made more descriptive. Consider renaming it (e.g. to 'grammar_trigger_patterns') for clarity and consistency with its usage later in the code.
std::vector<std::string> trigger_patterns;
it = temptative_end; | ||
return true; | ||
} catch (const std::exception & ex) { | ||
// No, needs healing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The healing logic for partial JSON parsing is quite complex. Adding detailed inline comments explaining the steps and decisions—especially the conditions under which different closing characters are appended—would improve maintainability and clarity.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be a fan of merging this, even if it was imperfect, so the community can build upon it
+1 to @ericcurtin This capability starts to enable llama.cpp as a backend for vibe coding with self hosted models. |
Merged this as is 😅🤞🙏 |
[error] langchain stream tool call error for no tool_call_id error: |
I see a similar problem in n8n: [ERROR: All OpenAI tool calls must have an "id" field.] |
A similar error occurs with Spring AI
|
Thanks all for the feedback so far! (and sorry for the ahem lil breakages & unpolished edges 😅) As I have only 10h left of copyright ownership to push fixes 🙈, here's a recap of the PRs out for review (cc/ @ngxson @ggerganov @CISC ):
|
All pending fixes were merged (thanks for the prompt reviews!), please all sync and report more bugs if you have a chance, I’ll do a last round in a few hours ✌🏼 (ideally please provide self-contained repro instructions, esp. for integration issues with agentic frameworks: even if it’s on a hacky branch of yours, it will help tremendously) Update: 1st release w/ the fixes is b5497 |
Just as an FYI, there's a minor regression with predicted logits. 20:13:18 | /mnt/valerie/public/agent
(.venv) git:(main | Δ) λ python -m agent.openai.stream
{'type': 'role', 'value': 'assistant'}
{'type': 'reasoning.open', 'value': '<think>Okay'}
{'type': 'content', 'value': ','}
{'type': 'content', 'value': ' the'} Previously, in earlier commits, before this PR was merged, This is okay here in a controlled fashion, but I suspect this might creep up in unsuspected ways further down the line. Updated simple example# agent.openai.stream
import json
import os
from pprint import pprint
from typing import Any, Dict, Generator, Optional
import dotenv
from openai import OpenAI
from agent.tools import tools
class GPTRequest:
def __init__(self, base_url: str = None, api_key: str = None):
self.client = self._connect(base_url, api_key)
def _connect(self, base_url: str = None, api_key: str = None) -> OpenAI:
if not base_url or not api_key:
dotenv.load_dotenv(".env")
if not base_url:
base_url = os.getenv("OPENAI_BASE_URL", "")
if not api_key:
api_key = os.getenv("OPENAI_API_KEY", "")
if not api_key:
raise ValueError("EnvironmentError: OPENAI_API_KEY not set in .env")
# Setup default base URL if using local mode
if api_key == "sk-no-key-required" and not base_url:
base_url = "http://localhost:8080/v1"
return OpenAI(api_key=api_key, base_url=base_url)
def _classify_reasoning(self, text: str) -> Optional[str]:
if "<think>" in text:
return "reasoning.open"
if "</think>" in text:
return "reasoning.close"
return None
def _classify_tool(self, tool_call, buffer, args_fragments):
fn = tool_call.function
if fn.name:
buffer["name"] = fn.name
if fn.arguments:
args_fragments.append(fn.arguments)
if fn.arguments and fn.arguments.strip().endswith("}"):
try:
args = "".join(args_fragments)
buffer["arguments"] = json.loads(args)
args_fragments.clear()
return {"type": "tool_call", "value": buffer.copy()}
except json.JSONDecodeError:
# Wait for more fragments
pass
return None
def stream(self, **kwargs) -> Generator[Dict[str, Any], None, None]:
response = self.client.chat.completions.create(**kwargs)
tool_buffer = {}
args_fragments = []
for chunk in response:
delta = chunk.choices[0].delta
if hasattr(delta, "role") and delta.role:
yield {"type": "role", "value": delta.role}
if hasattr(delta, "content") and delta.content:
reasoning_type = self._classify_reasoning(delta.content)
yield {"type": reasoning_type or "content", "value": delta.content}
if hasattr(delta, "tool_calls") and delta.tool_calls:
for tool_call in delta.tool_calls:
result = self._classify_tool(tool_call, tool_buffer, args_fragments)
if result:
yield result
if hasattr(delta, "refusal") and delta.refusal:
yield {"type": "refusal", "value": delta.refusal.model_dump()}
if hasattr(delta, "reasoning") and delta.reasoning: # Future-compatible
yield {"type": "reasoning", "value": delta.reasoning.model_dump()}
def main():
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the weather like in Paris, France?"},
]
request = GPTRequest()
generator = request.stream(
model="gpt-3.5-turbo", # Llama.Cpp expects this model definition
messages=messages,
max_tokens=-1, # Allow the to model to naturally stop on its own
stream=True,
temperature=0.8,
tools=tools,
)
for obj in generator:
pprint(obj)
if __name__ == "__main__":
main() My current suspicion is to assume that it's an issue with token healing which I have to handle upstream because of this issue. It's only obvious here because it shows up right away in a predictable and deterministic fashion. |
Thanks so much for pushing this over the finish line @ochafik! |
Minja is open-source, right? Do you need copyright ownership? |
Both |
@strawberrymelonpanda I hope to be able to clarify soon w/ my new employer whether I can contribute to minja, to a fork of it or... not at all (in any case, I've added a couple of extra maintainers from Google to the google/minja repo cc/ @matthewchan-g @Ferev, and there's always the option for llama.cpp to just fork it under ggml-org, etc) |
Do maintainers in Google namespace have to be Google employees, if yes, might be a decent reason to move to ggml-org namespace |
<think>
reasoning content inside the content (same output for all thinking models when using the default--reasoning-content deepseek
, even for those not using the<think>
syntax like Command R7B), and even if the<think>
tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).{"code": "json-encoded code"}
for multiline programs)This fixes #12107, #10920, #11861
Follow up to #9639
Pending follow ups / bugfixes:
server
: fix format of streamed tool call deltas (diff name, fix id location) #13800: fixes delta format (returntool_call.function.name
as deltas, putid
at right place, outputtype
)How to test / use
Get and build this PR's branch
Run
llama-server
w/ any model (see more details in the tool calling docs; note that some GGUFs require a chat template override!):Call the chat completions endpoint in streamed mode with any OpenAI-compatible library, or plain curl:
You can also open http://localhost:8080/ to see thoughts being streamed back properly even for models which template add an opening
<think>
tag to the end of the prompt (QwQ, now DeepSeek R1 too although most GGUFs have their initial version) and models like Cohere Command R7B that natively use a different thinking tags syntax (now normalized, since—reasoning-format deepseek
is the default)Context
Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.
While tool calls are returned in a standard format, each w/ a function name, tool call id and JSON encoded arguments, model outputs vary greatly in their syntax. That syntax mostly uses JSON for arguments but not always.
Function calls and their arguments can be at various levels:
[TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}]
)<tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>
; note that some models use other keys here, e.g.tool_name
,parameters
, and may have the tool call id too)<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>special_function\n```json\n{"arg1": 1}\n```<|tool▁call▁end|><|tool▁calls▁end|>
, or functionary v3.2:special_function\n{"arg1": 1}
){"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}}
(or insidetool_calls
array ifparallel_tool_calls
is on)python
tool call, with two variants:<|python_tag|>multiline python code here
(functionary v3.1),python\nmultiline python code here
(functionary v3.2; w/ prefix>>>
if after textual response)<|python_tag|>python.call(code="multiline\npython\ncode\nhere")
Side note about raw python code:
<|python_tag>foo.call(bar="baz")
in Llama 3.x style will return"tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}]
, while the same output from Functionary would be parsed as"tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}]
.Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.
(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)
The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full
common_chat_msg
against the last one we sent back, and compute OpenAI-compatible deltas out of this.Location, location, location 🏡
Note that the output of the model may be truncated (max token output length reached or streaming in progress), and that may fall inside an expected literal (e.g.
<think>
isn't a single token on QwQ-32B), inside a regex (used for some matchers), or inside some JSON.But more interesting is where it happens, esp. for partial JSON:
tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for
/* is_partial= */ true
)See examples of streamed tool call deltas
Implementation notes
Partial parsing utils
I added a
common_chat_msg_parser
utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:common_regex
(seecommon/regex-partial.cpp
)./abc/
gives/((?:(?:c)?b)?a)[\s\S]*/
, with a single capturing group which end indicates - in reverse - where the partial match started)nlohmann/json
's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parseconsume_json
accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)try_*
parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart fromoptional
s, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when
finish_reason
!=tool_call
).To think or not to think... what is the prompt?
I've also introduced
common_chat_syntax
which wrapscommon_reasoning_format
,common_chat_format
together with:thinking_forced_open
: whether the prompt was detected to end w/ a (model-specific)<think>
tag to force thinking modereasoning_in_content
: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.This allows streaming back a standard
<think>...
syntax even for models that use a different set of tags (e.g. Command R7B). And of course,--reasoning-format none
is still allowed to get the raw output.Note: Ideally, we'd stream the thoughts as a
reasoning_content
delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if--reasoning-format deepseek
, which is the default).Triggering thoughts 😓
I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.
To address this, I made it possible for
common_chat_templates_apply
to create trigger regexes that match on the entire output (this was already the case in the sampler).COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL
(renamed from_START
) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.Functionary v3.2 w/ raw python
Ask
bartowski/functionary-small-v3.2-GGUF:Q4_K_M
to write a hello world in Python and it outputspython\n{"code": "print('hey')"}
.But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax
python\nprint('hey')\n# many other lines
. This is now supported.TODOs
tool-call
: ensure there's always a non-empty tool call id #12292logprobs
for tools mode (right now, forbidden; we don't return diffs for every token, for instance if a function name is in multiple tokens we don't want to send its name in chunks)llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L
)common_regex
) as separate PR:common
: add partial regex support #12808common_json
) as separate PR(?) or fold intochat-parser.cpp
<|START_RESPONSE|>
at the end of the prompt. Output will contain an<|END_RESPONSE|>
that needs handling (would fit nicely in newcommon_chat_syntax
struct). Maybe combine w/ forced/disabled thinking modes as a follow up PRscripts/tool_bench.sh
to compare againstmaster
(+ compare timings)Future follow ups:
cc/ @jpohhhh