-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server
: streaming of tool calls and thoughts when --jinja
is on
#12379
Draft
ochafik
wants to merge
19
commits into
ggml-org:master
Choose a base branch
from
ochafik:tool-diffs
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
documentation
Improvements or additions to documentation
examples
python
python script changes
server
testing
Everything test related
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is still WIP (see todos at the bottom) but welcoming early feedback / testing
<think>
reasoning content inside the content (same output for all thinking models when using the default--reasoning-content deepseek
, even for those not using the<think>
syntax like Command R7B), and even if the<think>
tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).{"code": "json-encoded code"}
for multiline programs)This fixes #12107, #10920, #11861
Follow up to #9639
Context
Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.
Taking a step back from streaming, for instance Llama 3.x may call a couple of "builtin tools" with the syntax
<|python_tag>foo.call(bar="baz")
, for which the non-streaming API will return"tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}]
; the same output in Functionary would be parsed as"tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}]
.See examples of streamed tool call deltas
Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.
(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)
The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full
common_chat_msg
against the last one we sent back, and compute OpenAI-compatible deltas out of this.Implementation notes
Partial parsing utils
I added a
common_chat_msg_parser
utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:common_regex
(seecommon/regex-partial.cpp
)./abc/
gives/((?:(?:c)?b)?a)[\s\S]*/
, with a single capturing group which end indicates - in reverse - where the partial match started)nlohmann/json
's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parseconsume_json
accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)try_*
parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart fromoptional
s, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when
finish_reason
!=tool_call
).To think or not to think... what is the prompt?
I've also introduced
common_chat_syntax
which wrapscommon_reasoning_format
,common_chat_format
together with:thinking_forced_open
: whether the prompt was detected to end w/ a (model-specific)<think>
tag to force thinking modereasoning_in_content
: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.This allows streaming back a standard
<think>...
syntax even for models that use a different set of tags (e.g. Command R7B). And of course,--reasoning-format none
is still allowed to get the raw output.Note: Ideally, we'd stream the thoughts as a
reasoning_content
delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if--reasoning-format deepseek
, which is the default).Triggering thoughts 😓
I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.
To address this, I made it possible for
common_chat_templates_apply
to create trigger regexes that match on the entire output (this was already the case in the sampler).COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL
(renamed from_START
) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.Functionary v3.2 w/ raw python
Ask
bartowski/functionary-small-v3.2-GGUF:Q4_K_M
to write a hello world in Python and it outputspython\n{"code": "print('hey')"}
.But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax
python\nprint('hey')\n# many other lines
. This is now supported.TODOs
common_regex
) as separate PRcommon_json
) as separate PR(?) or fold intochat-parser.cpp
logprobs
for tools mode (right now, forbidden; we don't return diffs for every token, for instance if a function name is in multiple tokens we don't want to send its name in chunks)tool-call
: ensure there's always a non-empty tool call id #12292scripts/tool_bench.sh
to compare againstmaster
(+ compare timings)Future follow ups:
cc/ @jpohhhh