Misc. bug: tool call issues with hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF #12279

codefromthecrypt · 2025-03-09T06:24:52Z

Name and Version

I'm running my server like this, to test #12034

llama-server --jinja -fa -c 0 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF

Using various LLM frameworks in different languages, I couldn't get a successful tool call to complete. I've listed the errors, that vary, in the details

Operating systems

No response

Which llama.cpp modules do you know to be affected?

No response

Command line

Here's the version of llama-cpp

$ llama-cli --version
version: 4856 (6fefc05a)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0

Problem description & steps to reproduce

I ran each tool calling example app in this directory catching where it errored at via socat -v TCP-LISTEN:8080,fork TCP:localhost:8081, then I re-ran the corresponding curl to that failure.

Semantic Kernel dotnet: fails because tool_call.id is returned empty.

FYI this was First noticed here microsoft/semantic-kernel#10842

Here's the equiv request in curl:

curl -sX POST localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
  "temperature": 0,
  "tools": [
    {
      "function": {
        "description": "Returns the latest GA version of Elasticsearch in \"X.Y.Z\" format.",
        "name": "Elasticsearch-get_latest_version",
        "strict": false,
        "parameters": {
          "type": "object",
          "required": [],
          "properties": {
            "majorVersion": {
              "description": "Major version to filter by (e.g. 7, 8). Defaults to latest",
              "type": "integer"
            }
          }
        }
      },
      "type": "function"
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "What is the latest version of Elasticsearch 8?"
    }
  ],
  "model": "unused",
  "tool_choice": "auto"
}'|jq .
{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "Elasticsearch-get_latest_version",
              "arguments": "{\"majorVersion\":8}"
            },
            "id": ""
          }
        ]
      }
    }
  ],
  "created": 1741499613,
  "model": "unused",
  "system_fingerprint": "b4856-6fefc05a",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 32,
    "prompt_tokens": 206,
    "total_tokens": 238
  },
  "id": "chatcmpl-d7mNPLF5fmLGgt7VQjyWuBxrScIKzAXY",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 55.296,
    "prompt_per_token_ms": 55.296,
    "prompt_per_second": 18.08449074074074,
    "predicted_n": 32,
    "predicted_ms": 1107.194,
    "predicted_per_token_ms": 34.5998125,
    "predicted_per_second": 28.901890725563906
  }
}

Spring AI: llama-server returns 500 `failed to parse messages: Expected 'content'`

Notes:

This also fails the same way with pydantic-ai
If you run via ramalama so that you can run ollama://qwen2.5:3b with llama-server, it completes fine.

Here's the equiv request in curl:

$ curl -sX POST http://localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "messages": [
      {
        "content": "What is the latest version of Elasticsearch 8?",
        "role": "user"
      },
      {
        "role": "assistant",
        "tool_calls": [
          {
            "id": "",
            "type": "function",
            "function": {
              "name": "getLatestElasticsearchVersion",
              "arguments": "{\"majorVersion\":8}"
            }
          }
        ]
      },
      {
        "content": "\"8.17.3\"",
        "role": "tool",
        "name": "getLatestElasticsearchVersion",
        "tool_call_id": ""
      }
    ],
    "model": "unused",
    "stream": false,
    "temperature": 0.0,
    "tools": [
      {
        "type": "function",
        "function": {
          "description": "Returns the latest GA version of Elasticsearch in \"X.Y.Z\" format.",
          "name": "getLatestElasticsearchVersion",
          "parameters": {
            "$schema": "https://json-schema.org/draft/2020-12/schema",
            "additionalProperties": false,
            "type": "object",
            "properties": {
              "majorVersion": {
                "type": "integer",
                "format": "int32",
                "description": "Major version to filter by (e.g. 7, 8). Defaults to latest"
              }
            },
            "required": ["majorVersion"]
          }
        }
      }
    ]
  }'|jq .
{
  "error": {
    "code": 500,
    "message": "Failed to parse messages: Expected 'content' (ref: https://github.com/ggml-org/llama.cpp/issues/8367); messages = [\n  {\n    \"content\": \"What is the latest version of Elasticsearch 8?\",\n    \"role\": \"user\"\n  },\n  {\n    \"role\": \"assistant\",\n    \"tool_calls\": [\n      {\n        \"id\": \"\",\n        \"type\": \"function\",\n        \"function\": {\n          \"name\": \"getLatestElasticsearchVersion\",\n          \"arguments\": \"{\\\"majorVersion\\\":8}\"\n        }\n      }\n    ]\n  },\n  {\n    \"content\": \"\\\"8.17.3\\\"\",\n    \"role\": \"tool\",\n    \"name\": \"getLatestElasticsearchVersion\",\n    \"tool_call_id\": \"\"\n  }\n]",
    "type": "server_error"
  }
}

Vercel AI (node.js): returns `choices[0].message.content` xml of the tool content instead of completing

The last message sent to the LLM is the result of the tool call, it should have completed the initial request, not reformat that same message as xml.

Notes:

If you run via ramalama so that you can run ollama://qwen2.5:3b with llama-server, it completes fine.

$ curl -sX POST localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
  "model": "unused",
  "temperature": 0,
  "messages": [
    {
      "role": "user",
      "content": "What is the latest version of Elasticsearch 8?"
    },
    {
      "role": "assistant",
      "content": "",
      "tool_calls": [
        {
          "id": "",
          "type": "function",
          "function": {
            "name": "getLatestElasticsearchVersion",
            "arguments": "{\"majorVersion\":8}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "",
      "content": "\"8.17.3\""
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "getLatestElasticsearchVersion",
        "description": "Get the latest version of Elasticsearch",
        "parameters": {
          "type": "object",
          "properties": {
            "majorVersion": {
              "type": "number",
              "description": "Major version to filter by (e.g. 7, 8). Defaults to latest"
            }
          },
          "additionalProperties": false,
          "$schema": "http://json-schema.org/draft-07/schema#"
        }
      }
    }
  ],
  "tool_choice": "auto"
}'|jq .
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<tool_response>\n\"8.17.3\"\n</tool_response>"
      }
    }
  ],
  "created": 1741500130,
  "model": "unused",
  "system_fingerprint": "b4856-6fefc05a",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 17,
    "prompt_tokens": 267,
    "total_tokens": 284
  },
  "id": "chatcmpl-stqLFsYGVG2NBoW8c5gwNSQtDKQMVbQE",
  "timings": {
    "prompt_n": 64,
    "prompt_ms": 222.386,
    "prompt_per_token_ms": 3.47478125,
    "prompt_per_second": 287.78790031746604,
    "predicted_n": 17,
    "predicted_ms": 565.839,
    "predicted_per_token_ms": 33.28464705882353,
    "predicted_per_second": 30.04388174021232
  }
}

First Bad Commit

No response

Relevant log output

The text was updated successfully, but these errors were encountered:

ochafik · 2025-03-10T01:00:34Z

@codefromthecrypt Thank you so much for the detailed & actionable report(s)!!

Semantic Kernel dotnet: fails because tool_call.id is returned empty.

Fixing in #12292

Spring AI: llama-server returns 500 failed to parse messages: Expected 'content'

Fixing in #12293

Vercel AI (node.js): returns choices[0].message.content xml of the tool content instead of completing

This one is... interesting. Turns out unsloth/Qwen2.5-Coder-32B-Instruct-128K-GGUF is doing fine here, while the 7B model struggles w/ repetitions.

You can fix this by adding a small repetition penalty:

llama-server --jinja -fa -c 0 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF --repeat-penalty 1.2

Not sure if the need for a repetition penalty is new, but it came up in #12234 (for Qwen2.5 family) and in #12251 (for QwQ but w/ other specifics e.g. shorter context length).

codefromthecrypt · 2025-03-10T02:10:51Z

@ochafik excellent thanks for the help on this! I confirmed that your repeat-penalty technique sorted out my vercel config, too

ochafik · 2025-03-10T10:36:36Z

@codefromthecrypt Hopefully all fixed now (except need for penalty, that is), please let me know of any further issues you may find!

codefromthecrypt · 2025-03-10T22:58:23Z

Thanks tons! One last question I suspect I could ask deep research but prefer hearing from you. How did you come upon the repeat penalty solution to tool calls? By accident or is there a practice doc somewhere I can look at? Does this help solve other matters besides missed tool calls? What downsides are there to adding this (besides knowing you have to)

ochafik · 2025-03-11T00:18:57Z

@codefromthecrypt Hah, human to human knowledge transfer? Old school, I like it ;-)

Repetition penalties came up recently in #12234 (comment) (first time I find an acceptable use case for it tbh). @edmcman linked to this paper that seems to have introduced it (ref) and suggests a value of 1.2 but I would assume this to depend on other sampling settings. Tbh I thought tweaking this kind of parameters was a thing of the past as models got so much better. And in terms of downsides, set the penalty too high and your model will be neurotically avoiding even useful repetitions.

codefromthecrypt · 2025-03-11T03:45:51Z

Appreciate the reply. I will make the eval score high on this human to human transfer;)

ggerganov · 2025-03-11T08:20:39Z

@codefromthecrypt Please checkout my comment in #12234 (comment) and retry the Vercel test using greedy-sampling instead of repetition penalty as I explained there. Let me know how it goes.

codefromthecrypt · 2025-03-11T12:17:07Z

@ggerganov so, unless I misunderstood your suggestion, adjusting the sampling like this doesn't prevent the tool miss:

llama-server --jinja -fa -c 0 --samplers top_k --top-k 1 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF

Unrelated, I was able to reduce the repeat penalty from 1.2 to 1.1 and still reliably proceed

codefromthecrypt · 2025-03-11T12:32:00Z

fyi I can confirm on latest llama.cpp, the semantic-kernel and spring AI examples work.

If curious to try vercel yourself, get https://github.com/elastic/observability-examples/tree/main/genai-function-calling/vercel-ai and set your .env to the following, then follow any of the README instructions (docker or npm)

OPENAI_BASE_URL=http://localhost:8080/v1
OPENAI_API_KEY=unused
CHAT_MODEL=unused
OTEL_SERVICE_NAME=genai-function-calling

Above are the three examples I can use at the ramalama meetup next week in North Carolina https://www.meetup.com/raleigh-rhug/events/306421516/ I haven't edited in ramalama instructions yet, as I was testing things out.

ggerganov · 2025-03-12T09:28:22Z

@ggerganov so, unless I misunderstood your suggestion, adjusting the sampling like this doesn't prevent the tool miss:

llama-server --jinja -fa -c 0 --samplers top_k --top-k 1 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF
Unrelated, I was able to reduce the repeat penalty from 1.2 to 1.1 and still reliably proceed

Could you provide the curl command that I can use to reproduce the problem?

codefromthecrypt · 2025-03-12T23:55:04Z

@ggerganov here it is, vs llama-server --jinja -fa -c 0 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF. Note that one time this morning it worked, but all other times sends back like this:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<tool_response>\n8.17.3\n</tool_response>"
      }
    }
  ],
--snip--

Curl command

curl -sX POST localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
  "model": "unused",
  "temperature": 0,
  "messages": [
    {
      "role": "user",
      "content": "What is the latest version of Elasticsearch 8?"
    },
    {
      "role": "assistant",
      "content": "",
      "tool_calls": [
        {
          "id": "oj1FYDDy8KuKK4JSNcrljuXoVQ8gLsPW",
          "type": "function",
          "function": {
            "name": "getLatestElasticsearchVersion",
            "arguments": "{\"majorVersion\":8}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "oj1FYDDy8KuKK4JSNcrljuXoVQ8gLsPW",
      "content": "8.17.3"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "getLatestElasticsearchVersion",
        "description": "Get the latest version of Elasticsearch",
        "parameters": {
          "type": "object",
          "properties": {
            "majorVersion": {
              "type": "number",
              "description": "Major version to filter by (e.g. 7, 8). Defaults to latest"
            }
          },
          "additionalProperties": false,
          "$schema": "http://json-schema.org/draft-07/schema#"
        }
      }
    }
  ],
  "tool_choice": "auto"
}'

ggerganov · 2025-03-13T08:38:06Z

I can't say how the Unsloth version of the model was created and what is the effect on the output, but if you try vanilla Qwen 2.5 7B Coder Instruct, it works as expected using greedy sampling:

# create Q4_K quantization
./bin/llama-quantize ../models/qwen2.5-7b-coder-instruct/ggml-model-f16.gguf ../models/qwen2.5-7b-coder-instruct/ggml-model-q4_k.gguf q4_k

# start server
./bin/llama-server --jinja -fa -c 0 -m ../models/qwen2.5-7b-coder-instruct/ggml-model-q4_k.gguf

# request from your post above:
curl -sX POST localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
  "model": "unused",
  "temperature": 0,
  "messages": [...
...

[
  {
    "finish_reason": "stop",
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The latest version of Elasticsearch 8 is 8.17.3."
    }
  }
]

Using the Unsloth model that you linked indeed produces the "content": "<tool_response>\n8.17.3\n</tool_response>", but to me this seems more likely to be an issue with the specific finetune of the model, or the inference parameters (i.e. RoPE scaling, type, etc.) that it requires to run correctly.

codefromthecrypt · 2025-03-13T09:47:16Z

@ggerganov enlightening. thanks! For now, I will consider all the related topics closed.

Appreciate all the insights folks, I'm totally good for my conference next week.

codefromthecrypt added the bug-unconfirmed label Mar 9, 2025

This was referenced Mar 9, 2025

tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

Merged

Add chat template support containers/ramalama#917

Merged

This was referenced Mar 10, 2025

tool-call: ensure there's always a non-empty tool call id #12292

Merged

tool-call: allow missing content in message if tool_calls provided #12293

Merged

ochafik added bug Something isn't working and removed bug-unconfirmed labels Mar 10, 2025

ochafik self-assigned this Mar 10, 2025

ochafik closed this as completed Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: tool call issues with hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF #12279

Misc. bug: tool call issues with hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF #12279

codefromthecrypt commented Mar 9, 2025 •

edited

Loading

ochafik commented Mar 10, 2025

codefromthecrypt commented Mar 10, 2025

ochafik commented Mar 10, 2025

codefromthecrypt commented Mar 10, 2025

ochafik commented Mar 11, 2025

codefromthecrypt commented Mar 11, 2025

ggerganov commented Mar 11, 2025

codefromthecrypt commented Mar 11, 2025

codefromthecrypt commented Mar 11, 2025 •

edited

Loading

ggerganov commented Mar 12, 2025

codefromthecrypt commented Mar 12, 2025

ggerganov commented Mar 13, 2025 •

edited

Loading

codefromthecrypt commented Mar 13, 2025

Misc. bug: tool call issues with hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF #12279

Misc. bug: tool call issues with hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF #12279

Comments

codefromthecrypt commented Mar 9, 2025 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Semantic Kernel dotnet: fails because tool_call.id is returned empty.

Spring AI: llama-server returns 500 failed to parse messages: Expected 'content'

Vercel AI (node.js): returns choices[0].message.content xml of the tool content instead of completing

First Bad Commit

Relevant log output

ochafik commented Mar 10, 2025

codefromthecrypt commented Mar 10, 2025

ochafik commented Mar 10, 2025

codefromthecrypt commented Mar 10, 2025

ochafik commented Mar 11, 2025

codefromthecrypt commented Mar 11, 2025

ggerganov commented Mar 11, 2025

codefromthecrypt commented Mar 11, 2025

codefromthecrypt commented Mar 11, 2025 • edited Loading

ggerganov commented Mar 12, 2025

codefromthecrypt commented Mar 12, 2025

ggerganov commented Mar 13, 2025 • edited Loading

codefromthecrypt commented Mar 13, 2025

codefromthecrypt commented Mar 9, 2025 •

edited

Loading

Spring AI: llama-server returns 500 `failed to parse messages: Expected 'content'`

Vercel AI (node.js): returns `choices[0].message.content` xml of the tool content instead of completing

codefromthecrypt commented Mar 11, 2025 •

edited

Loading

ggerganov commented Mar 13, 2025 •

edited

Loading