`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

ochafik · 2025-03-14T04:45:40Z

This PR is still WIP (see todos at the bottom) but welcoming early feedback / testing

Support streaming of tool calls in OpenAI format
Improve handling of thinking model (DeepSeek R1 Distills, QwQ, Command R7B):
- Stream <think> reasoning content inside the content (same output for all thinking models when using the default --reasoning-content deepseek, even for those not using the <think> syntax like Command R7B), and even if the <think> tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).
- Avoid spurious triggers from "thoughts about tool calls" (only trigger after closing any unclosed thoughts)
Improves Functionary v3.2 support (allow raw python code, preferred by models over {"code": "json-encoded code"} for multiline programs)
Support truncated outputs incl. reasoning_content & tool_calls (returns salvageable fields when finish_reason = length)

Follow up to #9639

Context

Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.

Taking a step back from streaming, for instance Llama 3.x may call a couple of "builtin tools" with the syntax <|python_tag>foo.call(bar="baz"), for which the non-streaming API will return "tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}]; the same output in Functionary would be parsed as "tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}].

See examples of streamed tool call deltas

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
        {
        "type":"function",
        "function":{
            "name":"python",
            "description":"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters":{
            "type":"object",
            "properties":{
                "code":{
                "type":"string",
                "description":"The code to run in the ipython interpreter."
                }
            },
            "required":["code"]
            }
        }
        }
    ],
    "messages": [
        {
        "role": "user",
        "content": "Print a hello world message with python."
        }
    ], "stream": true
}'

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_aqwOReHDKPnqiF7NbRxzDTY1","type":"function","function":{"name":"python","arguments":""}}],"refusal":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"code"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"print"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"('"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"Hello"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":","}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":" World"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"!"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"')"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"}"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"tool_calls"}]}

data: [DONE]

Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.

(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)

The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full common_chat_msg against the last one we sent back, and compute OpenAI-compatible deltas out of this.

Implementation notes

Partial parsing utils

I added a common_chat_msg_parser utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:

Supports partial regex parsing
- Since the STL still doesn't have partial matching support (unlike Boost), I had to implement my own in common_regex (see common/regex-partial.cpp).
- The trick = transform the original regex to a regex that matches in reverse from the end of the string (e.g. /abc/ gives /((?:(?:c)?b)?a)[\s\S]*/, with a single capturing group which end indicates - in reverse - where the partial match started)
Supports partial JSON parsing:
- Used nlohmann/json's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parse
- Healing the JSON w/ a healing marker that can then be found when visiting the resulting JSON (to remove things we don't want to heal - e.g. function name - and cut any JSON encoded result at the "right" place, which must be somewhere inside function arguments: consume_json accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)
Supports control flow w/ try_* parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart from optionals, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.
Supports full or partial parsing w/ same code (throws partial exceptions to interrupt the control flow without making parsing code more complex)

This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when finish_reason != tool_call).

To think or not to think... what is the prompt?

I've also introduced common_chat_syntax which wraps common_reasoning_format, common_chat_format together with:

thinking_forced_open: whether the prompt was detected to end w/ a (model-specific) <think> tag to force thinking mode
reasoning_in_content: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.

This allows streaming back a standard <think>... syntax even for models that use a different set of tags (e.g. Command R7B). And of course, --reasoning-format none is still allowed to get the raw output.

Note: Ideally, we'd stream the thoughts as a reasoning_content delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if --reasoning-format deepseek, which is the default).

Triggering thoughts 😓

I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.

To address this, I made it possible for common_chat_templates_apply to create trigger regexes that match on the entire output (this was already the case in the sampler). COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL (renamed from _START) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.

Functionary v3.2 w/ raw python

Ask bartowski/functionary-small-v3.2-GGUF:Q4_K_M to write a hello world in Python and it outputs python\n{"code": "print('hey')"}.

But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax python\nprint('hey')\n# many other lines. This is now supported.

TODOs

Add some docs
Add more tests
Send partial regex (common_regex) as separate PR
Send partial JSON (common_json) as separate PR(?) or fold into chat-parser.cpp
Decide what to do about logprobs for tools mode (right now, forbidden; we don't return diffs for every token, for instance if a function name is in multiple tokens we don't want to send its name in chunks)
Fix tool call id attribution logic (disabled for now) from tool-call: ensure there's always a non-empty tool call id #12292
Might need one last diff in the final response after a stream, say, to close any raw python code
Run scripts/tool_bench.sh to compare against master (+ compare timings)

Future follow ups:

To make this faster, I suggest two options:
- Wait for the project to switch to C++20 & turn all the parser functions into resumable coroutines (feed them tokens and persist their state in the slot)
- Only compute and send deltas after N milliseconds

cc/ @jpohhhh

ochafik added 12 commits March 12, 2025 23:51

add common_regex w/ support for partial final matches

16c9c63

add common_json w/ support for truncated json healing

6dcff43

renaming: string_find_partial_stop (moved to common.cpp)

a95fe78

add common_chat_msg_diff

ce2f593

partial common_chat_parse

cd3681d

refactor parser w/ optionals

9462365

server: wire chat diffs in stream mode

6ed8a8f

fix trigger of thinking models (must happen after thoughts are closed)

eaeed7d

nits + docs

d6e680a

fix functionary v3.2 raw python!

64ea080

rename: common_chat_syntax (now contains format)

c46d4da

rm common_regex.at_start

4358d5d

github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples python python script changes server labels Mar 14, 2025

ochafik added 2 commits March 14, 2025 11:55

Merge remote-tracking branch 'origin/master' into tool-diffs

f477288

fix gcc compilation

e0202b3

ochafik force-pushed the tool-diffs branch from 94f8b38 to e0202b3 Compare March 14, 2025 12:07

ochafik added 5 commits March 14, 2025 12:41

fix unreachable code warning after [[noreturn]] annotation

f840e3a

fix / refactor test-regex-partial

af7391e

fix test-chat

449917b

rm spaces

b428b5c

fix command r7b partial parsing (lacked args path)

668fc90

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

ochafik commented Mar 14, 2025 •

edited

Loading

server: streaming of tool calls and thoughts when --jinja is on #12379

Are you sure you want to change the base?

server: streaming of tool calls and thoughts when --jinja is on #12379

Conversation

ochafik commented Mar 14, 2025 • edited Loading

Context

Implementation notes

Partial parsing utils

To think or not to think... what is the prompt?

Triggering thoughts 😓

Functionary v3.2 w/ raw python

TODOs

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

ochafik commented Mar 14, 2025 •

edited

Loading