Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: tool call issues with hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF #12279

Closed
codefromthecrypt opened this issue Mar 9, 2025 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@codefromthecrypt
Copy link

codefromthecrypt commented Mar 9, 2025

Name and Version

I'm running my server like this, to test #12034

llama-server --jinja -fa -c 0 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF

Using various LLM frameworks in different languages, I couldn't get a successful tool call to complete. I've listed the errors, that vary, in the details

Operating systems

No response

Which llama.cpp modules do you know to be affected?

No response

Command line

Here's the version of llama-cpp

$ llama-cli --version
version: 4856 (6fefc05a)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0

Problem description & steps to reproduce

I ran each tool calling example app in this directory catching where it errored at via socat -v TCP-LISTEN:8080,fork TCP:localhost:8081, then I re-ran the corresponding curl to that failure.

Semantic Kernel dotnet: fails because tool_call.id is returned empty.

FYI this was First noticed here microsoft/semantic-kernel#10842

Here's the equiv request in curl:

curl -sX POST localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
  "temperature": 0,
  "tools": [
    {
      "function": {
        "description": "Returns the latest GA version of Elasticsearch in \"X.Y.Z\" format.",
        "name": "Elasticsearch-get_latest_version",
        "strict": false,
        "parameters": {
          "type": "object",
          "required": [],
          "properties": {
            "majorVersion": {
              "description": "Major version to filter by (e.g. 7, 8). Defaults to latest",
              "type": "integer"
            }
          }
        }
      },
      "type": "function"
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "What is the latest version of Elasticsearch 8?"
    }
  ],
  "model": "unused",
  "tool_choice": "auto"
}'|jq .
{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "Elasticsearch-get_latest_version",
              "arguments": "{\"majorVersion\":8}"
            },
            "id": ""
          }
        ]
      }
    }
  ],
  "created": 1741499613,
  "model": "unused",
  "system_fingerprint": "b4856-6fefc05a",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 32,
    "prompt_tokens": 206,
    "total_tokens": 238
  },
  "id": "chatcmpl-d7mNPLF5fmLGgt7VQjyWuBxrScIKzAXY",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 55.296,
    "prompt_per_token_ms": 55.296,
    "prompt_per_second": 18.08449074074074,
    "predicted_n": 32,
    "predicted_ms": 1107.194,
    "predicted_per_token_ms": 34.5998125,
    "predicted_per_second": 28.901890725563906
  }
}

Spring AI: llama-server returns 500 failed to parse messages: Expected 'content'

Notes:

  • This also fails the same way with pydantic-ai
  • If you run via ramalama so that you can run ollama://qwen2.5:3b with llama-server, it completes fine.

Here's the equiv request in curl:

$ curl -sX POST http://localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "messages": [
      {
        "content": "What is the latest version of Elasticsearch 8?",
        "role": "user"
      },
      {
        "role": "assistant",
        "tool_calls": [
          {
            "id": "",
            "type": "function",
            "function": {
              "name": "getLatestElasticsearchVersion",
              "arguments": "{\"majorVersion\":8}"
            }
          }
        ]
      },
      {
        "content": "\"8.17.3\"",
        "role": "tool",
        "name": "getLatestElasticsearchVersion",
        "tool_call_id": ""
      }
    ],
    "model": "unused",
    "stream": false,
    "temperature": 0.0,
    "tools": [
      {
        "type": "function",
        "function": {
          "description": "Returns the latest GA version of Elasticsearch in \"X.Y.Z\" format.",
          "name": "getLatestElasticsearchVersion",
          "parameters": {
            "$schema": "https://json-schema.org/draft/2020-12/schema",
            "additionalProperties": false,
            "type": "object",
            "properties": {
              "majorVersion": {
                "type": "integer",
                "format": "int32",
                "description": "Major version to filter by (e.g. 7, 8). Defaults to latest"
              }
            },
            "required": ["majorVersion"]
          }
        }
      }
    ]
  }'|jq .
{
  "error": {
    "code": 500,
    "message": "Failed to parse messages: Expected 'content' (ref: https://github.com/ggml-org/llama.cpp/issues/8367); messages = [\n  {\n    \"content\": \"What is the latest version of Elasticsearch 8?\",\n    \"role\": \"user\"\n  },\n  {\n    \"role\": \"assistant\",\n    \"tool_calls\": [\n      {\n        \"id\": \"\",\n        \"type\": \"function\",\n        \"function\": {\n          \"name\": \"getLatestElasticsearchVersion\",\n          \"arguments\": \"{\\\"majorVersion\\\":8}\"\n        }\n      }\n    ]\n  },\n  {\n    \"content\": \"\\\"8.17.3\\\"\",\n    \"role\": \"tool\",\n    \"name\": \"getLatestElasticsearchVersion\",\n    \"tool_call_id\": \"\"\n  }\n]",
    "type": "server_error"
  }
}

Vercel AI (node.js): returns choices[0].message.content xml of the tool content instead of completing

The last message sent to the LLM is the result of the tool call, it should have completed the initial request, not reformat that same message as xml.

Notes:

  • If you run via ramalama so that you can run ollama://qwen2.5:3b with llama-server, it completes fine.
$ curl -sX POST localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
  "model": "unused",
  "temperature": 0,
  "messages": [
    {
      "role": "user",
      "content": "What is the latest version of Elasticsearch 8?"
    },
    {
      "role": "assistant",
      "content": "",
      "tool_calls": [
        {
          "id": "",
          "type": "function",
          "function": {
            "name": "getLatestElasticsearchVersion",
            "arguments": "{\"majorVersion\":8}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "",
      "content": "\"8.17.3\""
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "getLatestElasticsearchVersion",
        "description": "Get the latest version of Elasticsearch",
        "parameters": {
          "type": "object",
          "properties": {
            "majorVersion": {
              "type": "number",
              "description": "Major version to filter by (e.g. 7, 8). Defaults to latest"
            }
          },
          "additionalProperties": false,
          "$schema": "http://json-schema.org/draft-07/schema#"
        }
      }
    }
  ],
  "tool_choice": "auto"
}'|jq .
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<tool_response>\n\"8.17.3\"\n</tool_response>"
      }
    }
  ],
  "created": 1741500130,
  "model": "unused",
  "system_fingerprint": "b4856-6fefc05a",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 17,
    "prompt_tokens": 267,
    "total_tokens": 284
  },
  "id": "chatcmpl-stqLFsYGVG2NBoW8c5gwNSQtDKQMVbQE",
  "timings": {
    "prompt_n": 64,
    "prompt_ms": 222.386,
    "prompt_per_token_ms": 3.47478125,
    "prompt_per_second": 287.78790031746604,
    "predicted_n": 17,
    "predicted_ms": 565.839,
    "predicted_per_token_ms": 33.28464705882353,
    "predicted_per_second": 30.04388174021232
  }
}

First Bad Commit

No response

Relevant log output

@ochafik
Copy link
Collaborator

ochafik commented Mar 10, 2025

@codefromthecrypt Thank you so much for the detailed & actionable report(s)!!

Semantic Kernel dotnet: fails because tool_call.id is returned empty.

Fixing in #12292

Spring AI: llama-server returns 500 failed to parse messages: Expected 'content'

Fixing in #12293

Vercel AI (node.js): returns choices[0].message.content xml of the tool content instead of completing

This one is... interesting. Turns out unsloth/Qwen2.5-Coder-32B-Instruct-128K-GGUF is doing fine here, while the 7B model struggles w/ repetitions.

You can fix this by adding a small repetition penalty:

llama-server --jinja -fa -c 0 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF --repeat-penalty 1.2

Not sure if the need for a repetition penalty is new, but it came up in #12234 (for Qwen2.5 family) and in #12251 (for QwQ but w/ other specifics e.g. shorter context length).

@ochafik ochafik added bug Something isn't working and removed bug-unconfirmed labels Mar 10, 2025
@ochafik ochafik self-assigned this Mar 10, 2025
@codefromthecrypt
Copy link
Author

@ochafik excellent thanks for the help on this! I confirmed that your repeat-penalty technique sorted out my vercel config, too

@ochafik
Copy link
Collaborator

ochafik commented Mar 10, 2025

@codefromthecrypt Hopefully all fixed now (except need for penalty, that is), please let me know of any further issues you may find!

@ochafik ochafik closed this as completed Mar 10, 2025
@codefromthecrypt
Copy link
Author

Thanks tons! One last question I suspect I could ask deep research but prefer hearing from you. How did you come upon the repeat penalty solution to tool calls? By accident or is there a practice doc somewhere I can look at? Does this help solve other matters besides missed tool calls? What downsides are there to adding this (besides knowing you have to)

@ochafik
Copy link
Collaborator

ochafik commented Mar 11, 2025

@codefromthecrypt Hah, human to human knowledge transfer? Old school, I like it ;-)

Repetition penalties came up recently in #12234 (comment) (first time I find an acceptable use case for it tbh). @edmcman linked to this paper that seems to have introduced it (ref) and suggests a value of 1.2 but I would assume this to depend on other sampling settings. Tbh I thought tweaking this kind of parameters was a thing of the past as models got so much better. And in terms of downsides, set the penalty too high and your model will be neurotically avoiding even useful repetitions.

@codefromthecrypt
Copy link
Author

Appreciate the reply. I will make the eval score high on this human to human transfer;)

@ggerganov
Copy link
Member

@codefromthecrypt Please checkout my comment in #12234 (comment) and retry the Vercel test using greedy-sampling instead of repetition penalty as I explained there. Let me know how it goes.

@codefromthecrypt
Copy link
Author

@ggerganov so, unless I misunderstood your suggestion, adjusting the sampling like this doesn't prevent the tool miss:

llama-server --jinja -fa -c 0 --samplers top_k --top-k 1 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF

Unrelated, I was able to reduce the repeat penalty from 1.2 to 1.1 and still reliably proceed

@codefromthecrypt
Copy link
Author

codefromthecrypt commented Mar 11, 2025

fyi I can confirm on latest llama.cpp, the semantic-kernel and spring AI examples work.

If curious to try vercel yourself, get https://github.com/elastic/observability-examples/tree/main/genai-function-calling/vercel-ai and set your .env to the following, then follow any of the README instructions (docker or npm)

OPENAI_BASE_URL=http://localhost:8080/v1
OPENAI_API_KEY=unused
CHAT_MODEL=unused
OTEL_SERVICE_NAME=genai-function-calling

Above are the three examples I can use at the ramalama meetup next week in North Carolina https://www.meetup.com/raleigh-rhug/events/306421516/ I haven't edited in ramalama instructions yet, as I was testing things out.

@ggerganov
Copy link
Member

@ggerganov so, unless I misunderstood your suggestion, adjusting the sampling like this doesn't prevent the tool miss:

llama-server --jinja -fa -c 0 --samplers top_k --top-k 1 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF
Unrelated, I was able to reduce the repeat penalty from 1.2 to 1.1 and still reliably proceed

Could you provide the curl command that I can use to reproduce the problem?

@codefromthecrypt
Copy link
Author

@ggerganov here it is, vs llama-server --jinja -fa -c 0 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF. Note that one time this morning it worked, but all other times sends back like this:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<tool_response>\n8.17.3\n</tool_response>"
      }
    }
  ],
--snip--

Curl command

curl -sX POST localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
  "model": "unused",
  "temperature": 0,
  "messages": [
    {
      "role": "user",
      "content": "What is the latest version of Elasticsearch 8?"
    },
    {
      "role": "assistant",
      "content": "",
      "tool_calls": [
        {
          "id": "oj1FYDDy8KuKK4JSNcrljuXoVQ8gLsPW",
          "type": "function",
          "function": {
            "name": "getLatestElasticsearchVersion",
            "arguments": "{\"majorVersion\":8}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "oj1FYDDy8KuKK4JSNcrljuXoVQ8gLsPW",
      "content": "8.17.3"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "getLatestElasticsearchVersion",
        "description": "Get the latest version of Elasticsearch",
        "parameters": {
          "type": "object",
          "properties": {
            "majorVersion": {
              "type": "number",
              "description": "Major version to filter by (e.g. 7, 8). Defaults to latest"
            }
          },
          "additionalProperties": false,
          "$schema": "http://json-schema.org/draft-07/schema#"
        }
      }
    }
  ],
  "tool_choice": "auto"
}'

@ggerganov
Copy link
Member

ggerganov commented Mar 13, 2025

I can't say how the Unsloth version of the model was created and what is the effect on the output, but if you try vanilla Qwen 2.5 7B Coder Instruct, it works as expected using greedy sampling:

# create Q4_K quantization
./bin/llama-quantize ../models/qwen2.5-7b-coder-instruct/ggml-model-f16.gguf ../models/qwen2.5-7b-coder-instruct/ggml-model-q4_k.gguf q4_k

# start server
./bin/llama-server --jinja -fa -c 0 -m ../models/qwen2.5-7b-coder-instruct/ggml-model-q4_k.gguf

# request from your post above:
curl -sX POST localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
  "model": "unused",
  "temperature": 0,
  "messages": [...
...

[
  {
    "finish_reason": "stop",
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The latest version of Elasticsearch 8 is 8.17.3."
    }
  }
]

Using the Unsloth model that you linked indeed produces the "content": "<tool_response>\n8.17.3\n</tool_response>", but to me this seems more likely to be an issue with the specific finetune of the model, or the inference parameters (i.e. RoPE scaling, type, etc.) that it requires to run correctly.

@codefromthecrypt
Copy link
Author

@ggerganov enlightening. thanks! For now, I will consider all the related topics closed.

Appreciate all the insights folks, I'm totally good for my conference next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants