StableToolBench

Project • Server • Solvable Queries • Inference • StableToolEval • Paper • Citation

Welcome to StableToolBench. Faced with the instability of Tool Learning benchmarks, we developed this new benchmark aiming to balance the stability and reality, based on ToolBench (Qin et al., 2023).

Note for Applying ToolBench keys

Note that if you have applied a ToolBench key but did not get a response for a long time, please contact Shihao Liang ([email protected]) for further assistance.

Updates from StableToolBench-MirrorAPI

The new API simulation model, named MirrorAPI, which is trained to simulate more than 7k tools in ToolBench. You can download it from huggingface.
The new FAC evaluation for StableToolBench, which takes final answers only into account.

Updates

[2024.09.15] We found there exist some problems in the inference codes of ToolLLaMA v2 and we update model performance accordingly.
[2024.06.19] We update the OpenAI API to the newest version, which also support parallel function calling now. We also updated the model performance evaluation using gpt-4-turbo-2024-04-09, replacing gpt-4-turbo-preview, which we found may produce unstable evaluations. The inference results (run in Feb 2024) can be found on Huggingface.

Features

Based on the large scale of ToolBench, we introduce the following features to ensure the stability and reality of the benchmark:

MirrorAPI, which is a trained on real request-response pairs to stably mirror more than 7k API's behaviours.
Virtual API System, which comprises a caching system and API simulators. The caching system stores API call responses to ensure consistency, while the API simulators, powered by LLMs, are used for unavailable APIs. Note that we keep the large-scale diverse APIs environment from ToolBench.
A New Set of Solvable Queries. Query solvability is hard to determine on the fly, causing significant randomness and instability. In StableToolBench, we use state-of-the-art LLMs to determine task solvability to filter queries beforehand. We maintain the same query and answer format as ToolBench for seamless transition from it.
Stable Evaluation System: Implements a two-phase evaluation process using GPT-4 as an automatic evaluator. It involves judging the solvability of tasks and employing metrics like Solvable Pass Rate (SoPR) and Solvable Win Rate (SoWR). Starting with MirrorAPI, we also provide an end-to-end trained evaluator, which takes only input query and final answer into account and gives more stable and straightforward evaluation.

The Virtual API Server

We now provide two simulating systems, the MirrorAPI server and the GPT based caching system.

Before you run any code, please first set up the environment by running pip install -r requirements.txt.

The MirrorAPI server

Downloading Tools and Models

You need to download a set of tools to start the server. You can use either the tool set we crawled on Apr 2024, which you can download from HuggingFace or the tools for the ToolBench/StableToolBench test set, which you can download from ToolBench.

We provide two versions of model, the MirrorAPI, trained for general tool responses, and MirrorAPI-Cache, which is trained on the cache of StableToolBench for better test set tool responses. You can download them from the link above.

Starting the server

To start the server, you need to install vllm. Then you can start a model by running

vllm serve {model-path} --api-key EMPTY --port 12345 --served-model-name {model-name}

Then you need to fill the model-name, api-key and port you specified in server/config_mirrorapi.yml (or server/config_mirrorapi_cache.yml if you are running MirrorAPI-Cache), along with the tool folder you downloaded tools into. The parameters in the config files are:

api_key: The API key for VLLM model.
api_base: The API base for VLLM models. Normally http://127.0.0.1:{port}/v1
model: The {model-name} you specified in VLLM.
temperature: The temperature for LLM simulation. The default value is 0.
tools_folder: The tools environment folder path. Default to ./tools.
port: The server port to run on, default to 8080.

Then you can run python main_mirrorapi.py or python main_mirrorapi_cache.py to run the API server.

The GPT based caching system

Our Virtual API server featured two components, the API simulation system with GPT 4 Turbo and the caching system. We provide two methods to use the virtual API system: building from source and using our prebuilt Docker.

To start the server, you need to provide a cache directory and an OpenAI key.

Downloading the cache

We provide a cache to download from HuggingFace or Tsinghua Cloud. After downloading the cache, unzip the folder into the server folder and ensure the server folder contains tool_response_cache folder and tools folder. The resulting folder of server looks like:

├── /server/
│  ├── /tools/
│  │  └── ...
│  ├── /tool_response_cache/
│  │  └── ...
│  ├── config.yml
│  ├── main.py
│  ├── utils.py

Running the server directly

You need to first specify your configurations in server/config.yml before running the server. Parameters needed are:

api_key: The API key for OpenAI models.
api_base: The API base for OpenAI models if you are using Azure.
model: The OpenAI model to use. The default value is gpt-4-turbo-preview.
temperature: The temperature for LLM simulation. The default value is 0.
toolbench_url: The real ToolBench server URL. The default value is http://8.218.239.54:8080/rapidapi.
tools_folder: The tools environment folder path. Default to ./tools.
cache_folder: The cache folder path. Default to ./tool_response_cache.
is_save: A flag to indicate whether to save real and simulated responses into the cache. The new cache is saved at ./tool_response_new_cache.
port: The server port to run on, default to 8080.

Now you can run the server by running:

cd server
python main.py

The server will be run at http://localhost:{port}/virtual. To use the server, you will further need a toolbench key. You can apply one from this form.

Running the server using Docker

We provide a Dockerfile for easy deployment and consistent server environment. This allows you to run the server on various platforms that support Docker.

Prerequisites:

Docker installed: https://docs.docker.com/engine/install/

Building the Docker Image:

Navigate to your project directory in the terminal.
Build the Docker image using the following command:

docker build -t my-fastapi-server .  # Replace 'my-fastapi-server' with your desired image name
docker run -p {port}:8080 my-fastapi-server  # Replace 'my-fastapi-server' with your image name

Using the Prebuilt Docker Image

You can also use our prebuilt Docker image from Docker Hub hosted at https://hub.docker.com/repository/docker/zhichengg/stb-docker/general. Before running the docker, you will need to install docker and download the cache files as described in Building from Source. Then you can run the server using the following command:

docker pull zhichengg/stb-docker:latest
docker run -p {port}:8080 -v {tool_response_cache_path}:/app/tool_response_cache -v {tools_path}:/app/tools -e OPENAI_API_KEY= -e OPENAI_API_BASE= zhichengg/stb-docker

Remember to fill in the port, tool_response_cache_path, and tools_path with your own values. The OPENAI_API_KEY and OPENAI_API_BASE are the OpenAI API key and API base if you are using Azure. The server will be run at http://localhost:{port}/virtual.

Testing the Server

You can test the server with

import requests
import json
import os

url = 'http://0.0.0.0:8080/virtual'
data = {
    "category": "Artificial_Intelligence_Machine_Learning",
    "tool_name": "TTSKraken",
    "api_name": "List Languages",
    "tool_input": '{}',
    "strip": "truncate",
    "toolbench_key": ""
}
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
}

# Make the POST request
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.text)

Solvable Queries

The original queries are curated without considering the solvability but judging the solvability with ChatGPT on the fly will cause significant instability. Therefore, we judge the solvability of the original queries with the majority vote of gpt-4-turbo, gemini-pro and claude-2. The filtered queries are saved in solvable_queries.

Inference With Our StableToolBench Server

If you have not set up the environment, please first do so by running pip install -r requirements.txt. We currently implement all models and algorithms supported by ToolBench. We show ChatGPT (gpt-3.5-turbo-16k) with CoT as an example here. The script is also shown in inference_chatgpt_pipeline_virtual.sh. An example of the results is shown in data_example/answer.

To use ChatGPT, run:

export TOOLBENCH_KEY=""
export OPENAI_KEY=""
export OPENAI_API_BASE="" 
export PYTHONPATH=./
export GPT_MODEL="gpt-3.5-turbo-16k"
export SERVICE_URL="http://localhost:8080/virtual"
export OUTPUT_DIR="data/answer/virtual_chatgpt_cot"
group=G1_instruction
mkdir -p $OUTPUT_DIR; mkdir -p $OUTPUT_DIR/$group

python toolbench/inference/qa_pipeline_multithread.py \
    --tool_root_dir toolenv/tools \
    --backbone_model chatgpt_function \
    --openai_key $OPENAI_KEY \
    --max_observation_length 1024 \
    --method CoT@1 \
    --input_query_file solvable_queries/test_instruction/${group}.json \
    --output_answer_file $OUTPUT_DIR/$group \
    --toolbench_key $TOOLBENCH_KEY \
    --num_thread 1

StableToolEval

We follow the evaluation process of ToolBench. The difference is that we update the evaluation logic of the Pass Rate and Win Rate, resulting in the Solvable Pass Rate and Solvable Win Rate.

The first step is to prepare data. This step is the same as ToolEval in ToolBench. The following paragraph is adapted from ToolBench. To evaluate your model and method using ToolEval, you first need to prepare all the model predictions for the six test subsets. Create a directory naming with your model and method, e.g. chatgpt_cot then put each test set's predictions under the directory. The file structure of the directory should be:

├── /chatgpt_cot/
│  ├── /G1_instruction/
│  │  ├── /[email protected]
│  │  └── ...
│  ├── /G1_tool/
│  │  ├── /[email protected]
│  │  └── ...
│  ├── ...
│  ├── /G3_instruction/
│  │  ├── /[email protected]
│  │  └── ...

Then preprocess the predictions by running the following commands:

cd toolbench/tooleval
export RAW_ANSWER_PATH=../../data_example/answer
export CONVERTED_ANSWER_PATH=../../data_example/model_predictions_converted
export MODEL_NAME=virtual_chatgpt_cot
export test_set=G1_instruction

mkdir -p ${CONVERTED_ANSWER_PATH}/${MODEL_NAME}
answer_dir=${RAW_ANSWER_PATH}/${MODEL_NAME}/${test_set}
output_file=${CONVERTED_ANSWER_PATH}/${MODEL_NAME}/${test_set}.json

python convert_to_answer_format.py\
    --answer_dir ${answer_dir} \
    --method CoT@1 # DFS_woFilter_w2 for DFS \
    --output ${output_file}

Next, you can calculate the Solvable Pass Rate. Before running the process, you need to specify your evaluation OpenAI key in openai_key.json as follows:

[
    {
        "api_key": "your_openai_key",
        "api_base": "your_organization"
    },
    ...
]

Then calculate SoPR with :

cd  toolbench/tooleval
export API_POOL_FILE=../../openai_key.json
export CONVERTED_ANSWER_PATH=../../data_example/model_predictions_converted
export SAVE_PATH=../../data_example/pass_rate_results
mkdir -p ${SAVE_PATH}
export CANDIDATE_MODEL=virtual_chatgpt_cot
export EVAL_MODEL=gpt-4-turbo-preview
mkdir -p ${SAVE_PATH}/${CANDIDATE_MODEL}

python eval_pass_rate.py \
    --converted_answer_path ${CONVERTED_ANSWER_PATH} \
    --save_path ${SAVE_PATH}/${CANDIDATE_MODEL} \
    --reference_model ${CANDIDATE_MODEL} \
    --test_ids ../../solvable_queries_example/test_query_ids \
    --max_eval_threads 35 \
    --evaluate_times 3 \
    --test_set G1_instruction

Note that we use gpt-4-turbo-preview as the standard evaluation model, which provided much better stability than gpt-3.5 series models.

The result files will be stored under the ${SAVE_PATH}.

Then you can calculate the SoWR. The below example takes ChatGPT-CoT as the reference model and ChatGPT-DFS as the candidate model. Note that you need to get both model's pass rate results first.

cd  toolbench/tooleval
export API_POOL_FILE=../../openai_key.json
export CONVERTED_ANSWER_PATH=../../data_example/model_predictions_converted
export SAVE_PATH=../../data_example/preference_results
export PASS_RATE_PATH=../../data_example/pass_rate_results
export REFERENCE_MODEL=virtual_chatgpt_cot
export CANDIDATE_MODEL=virtual_chatgpt_dfs
export EVAL_MODEL=gpt-4-turbo-preview
mkdir -p ${SAVE_PATH}


python eval_preference.py \
    --converted_answer_path ${CONVERTED_ANSWER_PATH} \
    --reference_model ${REFERENCE_MODEL} \
    --output_model ${CANDIDATE_MODEL} \
    --test_ids ../../solvable_queries_example/test_query_ids/ \
    --save_path ${SAVE_PATH} \
    --pass_rate_result_path ${PASS_RATE_PATH} \
    --max_eval_threads 10 \
    --use_pass_rate true \
    --evaluate_times 3 \
    --test_set G1_instruction

The result files will be stored under the ${SAVE_PATH}.

The MirrorAPI Final-Answer-Correctness (FAC) Evaluation

To run the FAC evaluation, you need to use the converted answer stated above. Then you can run the evaluation by running the following code (also shown in run_fac_eval.sh):

cd  toolbench/tooleval

export MODEL_PATH="Your path to the FAC model"
export CONVERTED_ANSWER_PATH=../../data_example/model_predictions_converted
export SAVE_PATH=../../data_example/fac_results
mkdir -p ${SAVE_PATH}

GROUP="The group name"
CANDIDATE_MODEL="Your candidiate model"
python tool_eval.py \
                    --model_path $MODEL_PATH \
                    --evaluation_path $MODEL_FILE \
                    --output_path $SAVE_PATH/$CANDIDATE_MODEL/$GROUP.csv \
                    --ids ../../solvable_queries_example/test_query_ids/${GROUP}.json

MirrorAPI Training and Evaluation

We also publish the data and metrics used in the training and evaluation of MirrorAPI. The training and testing data can be found at huggingface. The newly created ToolBench test set used to compare real and simulated data can also be found at huggingface.

We use FastChat to perform LLM-as-a-Judge. The prompt we used can be found at Table 12 of our paper.

Model Experiments Results

MirrorAPI-Cache Based System

Solvable Pass Rate Score

We evaluate the results with gpt-4o.

Method	I1 Inst	I1 Cat	I1 Tool	I2 Cat	I2 Inst	I3 Inst	Average
ToolLLaMA v2 CoT	28.0±1.9	30.5±0.8	21.5±0.9	19.9±1.0	22.3±0.4	19.1±0.8	22.8±0.8
ToolLLaMA v2 DFS	28.4±0.9	32.5±0.8	22.2±1.0	22.8±1.5	19.2±1.6	18.6±1.5	22.9±1.4
GPT 4o mini CoT	27.8±1.4	34.9±0.3	34.2±0.5	24.5±1.0	22.3±2.7	20.8±1.5	25.9±1.7
GPT 4o mini DFS	26.8±1.4	36.4±1.6	33.1±1.1	25.8±1.7	25.8±2.7	20.2±0.8	26.4±1.6
GPT 4o CoT	33.3±2.0	35.1±0.6	33.6±0.8	32.5±1.7	29.6±1.6	27.9±3.5	32.0±2.2
GPT 4o DFS	32.7±1.9	42.3±1.3	34.6±1.3	32.8±1.5	28.3±1.3	23.0±1.3	30.9±1.7

FAC Score

Method	I1 Inst	I1 Cat	I1 Tool	I2 Cat	I2 Inst	I3 Inst	Average
ToolLLaMA v2 CoT	45.4	38.6	34.2	40.3	37.7	31.1	37.9
ToolLLaMA v2 DFS	47.9	40.5	31.0	40.3	34.0	31.1	37.5
GPT 4o mini CoT	42.3	39.9	38.0	44.4	36.8	36.1	39.6
GPT 4o mini DFS	46.0	43.8	44.3	41.1	34.9	34.4	40.8
GPT 4o CoT	45.4	43.8	44.3	54.0	45.3	32.8	44.3
GPT 4o DFS	46.6	53.6	44.9	50.0	42.5	34.4	45.3

GPT Based Caching System

Below are the main results (Inference done in Feb 2024). The win rate for each model is compared with ChatGPT-ReACT. We use gpt-4-turbo-2024-04-09 as the evaluator. Evaluation done in May 2024.

Note that the ToolLLaMA v2 performance is update on 15 Sep 2024 with the new inference codes. Legacy performance can be found here

Solvable Pass Rate:

Method	I1 Instruction	I1 Category	I1 Tool	I2 Category	I2 Instruction	I3 Instruction	Average
GPT-3.5-Turbo-0613 (CoT)	52.2±1.1	47.3±0.6	53.6±1.3	42.5±2.1	35.8±2.0	48.1±0.8	46.6±1.3
GPT-3.5-Turbo-0613 (DFS)	60.3±1.3	66.2±1.2	67.1±0.0	59.1±0.4	51.3±1.2	73.8±2.3	63.0±1.1
GPT-4-0613 (CoT)	45.5±0.4	57.4±0.3	48.8±0.7	43.0±0.7	46.5±0.9	48.1±1.5	48.2±0.8
GPT-4-0613 (DFS)	57.3±0.6	57.3±0.3	60.9±1.0	57.9±1.0	51.3±0.8	66.4±2.4	58.5±1.0
ToolLLaMA v2 (CoT)	51.8±0.4	53.1±0.6	46.4±1.2	51.6±1.1	48.9±0.4	37.2±0.8	48.2±0.8
ToolLLaMA v2 (DFS)	61.0±1.8	58.8±0.5	45.6±0.9	60.3±1.3	53.5±1.8	48.1±1.5	54.6±1.3
GPT-3.5-Turbo-1106 (CoT)	50.4±0.5	45.1±1.4	50.8±0.3	48.7±0.8	42.1±0.4	55.7±0.0	48.8±0.6
GPT-3.5-Turbo-1106 (DFS)	62.8±0.3	63.9±1.2	65.6±0.3	56.5±0.7	56.9±1.2	67.2±1.3	62.2±0.8
GPT-4-Turbo-Preview (CoT)	52.8±1.3	56.6±0.9	51.9±0.5	51.9±1.0	52.8±0.8	52.5±0.0	53.1±0.8
GPT-4-Turbo-Preview (DFS)	59.2±0.5	61.7±0.7	65.7±1.0	55.6±0.6	55.2±0.4	66.1±4.3	60.6±1.3

In this experiment, we run all models once, evaluate them three times, and take the average results.

Solvable Win Rate: (Reference model: ChatGPT-CoT)

Method	I1 Instruction	I1 Category	I1 Tool	I2 Instruction	I2 Category	I3 Instruction	Average
GPT-3.5-Turbo-0613 (DFS)	60.7	67.3	59.5	63.2	62.1	75.4	64.7
GPT-4-0613 (CoT)	54.6	58.8	58.2	75.5	60.5	62.3	61.7
GPT-4-0613 (DFS)	62.6	62.7	58.2	74.5	62.9	67.2	64.7
ToolLLaMA v2 (CoT)	41.7	45.1	32.3	52.8	46.8	26.2	40.8
ToolLLaMA v2 (DFS)	42.3	51.0	31.0	67.0	54.0	31.1	54.0
GPT-3.5-Turbo-1106 (CoT)	47.2	47.7	44.9	50.9	54.0	62.3	51.2
GPT-3.5-Turbo-1106 (DFS)	55.8	53.6	51.9	68.9	59.7	68.9	59.8
GPT-4-Turbo-Preview (CoT)	71.2	77.1	61.4	79.2	71.8	67.2	71.3
GPT-4-Turbo-Preview (DFS)	73.0	75.2	68.4	77.4	66.9	60.7	70.2
We run all models once against `GPT-3.5-Turbo-0613 + CoT` and evaluate them three times. We follow the ToolBench implementation to take the most frequent result for each query during evaluation.

Acknowledgement

We thank Jingwen Wu and Yao Li for their contributions to experiments and result presentation. We also appreciate Yile Wang and Jitao Xu for their valuable suggestions during discussions.

Citation

@misc{guo2024stabletoolbench,
      title={StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models}, 
      author={Zhicheng Guo and Sijie Cheng and Hao Wang and Shihao Liang and Yujia Qin and Peng Li and Zhiyuan Liu and Maosong Sun and Yang Liu},
      year={2024},
      eprint={2403.07714},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
data_example		data_example
server		server
solvable_queries		solvable_queries
solvable_queries_example		solvable_queries_example
toolbench		toolbench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference_chatgpt_pipeline_virtual.sh		inference_chatgpt_pipeline_virtual.sh
legacy_results.md		legacy_results.md
openai_key.json		openai_key.json
requirements.txt		requirements.txt
run_convert_answer.sh		run_convert_answer.sh
run_fac_eval.sh		run_fac_eval.sh
run_pass_rate.sh		run_pass_rate.sh
run_preference.sh		run_preference.sh
stbicon.svg		stbicon.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StableToolBench

Note for Applying ToolBench keys

Updates from StableToolBench-MirrorAPI

Updates

Features

The Virtual API Server

The MirrorAPI server

Downloading Tools and Models

Starting the server

The GPT based caching system

Downloading the cache

Running the server directly

Running the server using Docker

Using the Prebuilt Docker Image

Testing the Server

Solvable Queries

Inference With Our StableToolBench Server

StableToolEval

The MirrorAPI Final-Answer-Correctness (FAC) Evaluation

MirrorAPI Training and Evaluation

Model Experiments Results

MirrorAPI-Cache Based System

GPT Based Caching System

Acknowledgement

Citation

About

Releases

Packages

Contributors 4

Languages

License

THUNLP-MT/StableToolBench

Folders and files

Latest commit

History

Repository files navigation

StableToolBench

Note for Applying ToolBench keys

Updates from StableToolBench-MirrorAPI

Updates

Features

The Virtual API Server

The MirrorAPI server

Downloading Tools and Models

Starting the server

The GPT based caching system

Downloading the cache

Running the server directly

Running the server using Docker

Using the Prebuilt Docker Image

Testing the Server

Solvable Queries

Inference With Our StableToolBench Server

StableToolEval

The MirrorAPI Final-Answer-Correctness (FAC) Evaluation

MirrorAPI Training and Evaluation

Model Experiments Results

MirrorAPI-Cache Based System

GPT Based Caching System

Acknowledgement

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages