-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc. bug: tool call issues with hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF #12279
Comments
@codefromthecrypt Thank you so much for the detailed & actionable report(s)!!
Fixing in #12292
Fixing in #12293
This one is... interesting. Turns out You can fix this by adding a small repetition penalty:
Not sure if the need for a repetition penalty is new, but it came up in #12234 (for Qwen2.5 family) and in #12251 (for QwQ but w/ other specifics e.g. shorter context length). |
@ochafik excellent thanks for the help on this! I confirmed that your repeat-penalty technique sorted out my vercel config, too |
@codefromthecrypt Hopefully all fixed now (except need for penalty, that is), please let me know of any further issues you may find! |
Thanks tons! One last question I suspect I could ask deep research but prefer hearing from you. How did you come upon the repeat penalty solution to tool calls? By accident or is there a practice doc somewhere I can look at? Does this help solve other matters besides missed tool calls? What downsides are there to adding this (besides knowing you have to) |
@codefromthecrypt Hah, human to human knowledge transfer? Old school, I like it ;-) Repetition penalties came up recently in #12234 (comment) (first time I find an acceptable use case for it tbh). @edmcman linked to this paper that seems to have introduced it (ref) and suggests a value of 1.2 but I would assume this to depend on other sampling settings. Tbh I thought tweaking this kind of parameters was a thing of the past as models got so much better. And in terms of downsides, set the penalty too high and your model will be neurotically avoiding even useful repetitions. |
Appreciate the reply. I will make the eval score high on this human to human transfer;) |
@codefromthecrypt Please checkout my comment in #12234 (comment) and retry the Vercel test using greedy-sampling instead of repetition penalty as I explained there. Let me know how it goes. |
@ggerganov so, unless I misunderstood your suggestion, adjusting the sampling like this doesn't prevent the tool miss: llama-server --jinja -fa -c 0 --samplers top_k --top-k 1 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF Unrelated, I was able to reduce the repeat penalty from 1.2 to 1.1 and still reliably proceed |
fyi I can confirm on latest llama.cpp, the semantic-kernel and spring AI examples work. If curious to try vercel yourself, get https://github.com/elastic/observability-examples/tree/main/genai-function-calling/vercel-ai and set your .env to the following, then follow any of the README instructions (docker or npm)
Above are the three examples I can use at the ramalama meetup next week in North Carolina https://www.meetup.com/raleigh-rhug/events/306421516/ I haven't edited in ramalama instructions yet, as I was testing things out. |
Could you provide the |
@ggerganov here it is, vs
Curl command curl -sX POST localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "unused",
"temperature": 0,
"messages": [
{
"role": "user",
"content": "What is the latest version of Elasticsearch 8?"
},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "oj1FYDDy8KuKK4JSNcrljuXoVQ8gLsPW",
"type": "function",
"function": {
"name": "getLatestElasticsearchVersion",
"arguments": "{\"majorVersion\":8}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "oj1FYDDy8KuKK4JSNcrljuXoVQ8gLsPW",
"content": "8.17.3"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "getLatestElasticsearchVersion",
"description": "Get the latest version of Elasticsearch",
"parameters": {
"type": "object",
"properties": {
"majorVersion": {
"type": "number",
"description": "Major version to filter by (e.g. 7, 8). Defaults to latest"
}
},
"additionalProperties": false,
"$schema": "http://json-schema.org/draft-07/schema#"
}
}
}
],
"tool_choice": "auto"
}' |
I can't say how the Unsloth version of the model was created and what is the effect on the output, but if you try vanilla # create Q4_K quantization
./bin/llama-quantize ../models/qwen2.5-7b-coder-instruct/ggml-model-f16.gguf ../models/qwen2.5-7b-coder-instruct/ggml-model-q4_k.gguf q4_k
# start server
./bin/llama-server --jinja -fa -c 0 -m ../models/qwen2.5-7b-coder-instruct/ggml-model-q4_k.gguf
# request from your post above:
curl -sX POST localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "unused",
"temperature": 0,
"messages": [...
...
[
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "The latest version of Elasticsearch 8 is 8.17.3."
}
}
] Using the Unsloth model that you linked indeed produces the |
@ggerganov enlightening. thanks! For now, I will consider all the related topics closed. Appreciate all the insights folks, I'm totally good for my conference next week. |
Name and Version
I'm running my server like this, to test #12034
Using various LLM frameworks in different languages, I couldn't get a successful tool call to complete. I've listed the errors, that vary, in the details
Operating systems
No response
Which llama.cpp modules do you know to be affected?
No response
Command line
Problem description & steps to reproduce
I ran each tool calling example app in this directory catching where it errored at via
socat -v TCP-LISTEN:8080,fork TCP:localhost:8081
, then I re-ran the corresponding curl to that failure.Semantic Kernel dotnet: fails because tool_call.id is returned empty.
FYI this was First noticed here microsoft/semantic-kernel#10842
Here's the equiv request in curl:
Spring AI: llama-server returns 500
failed to parse messages: Expected 'content'
Notes:
ollama://qwen2.5:3b
withllama-server
, it completes fine.Here's the equiv request in curl:
Vercel AI (node.js): returns
choices[0].message.content
xml of the tool content instead of completingThe last message sent to the LLM is the result of the tool call, it should have completed the initial request, not reformat that same message as xml.
Notes:
ollama://qwen2.5:3b
withllama-server
, it completes fine.First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: