🔥 AudioBench 🔥

⚡ A repository for evaluating AudioLLMs in various tasks 🚀 ⚡
⚡ AudioBench: A Universal Benchmark for Audio Large Language Models 🚀 ⚡
🌟 Come to View Our Live Leaderboard on Huggingface Space 🌟

🏠 AudioBench Leaderboard | 🤗 Huggingface Datasets | 🤗 AudioLLM Paper Collection

📝 Change log

Mar 2025: Supported phi_4_multimodal_instruct model, gigaspeech 2 evaluation (Thai, Vietenames and Indonesina).
Mar 2025: Support MMAU testset. Multiple-choice questions for speech, audio and music understanding!
Mar 2025: AudioBench now supports over 50 datasets!!
Mar 2025: Support SEAME testsets (dev). It is a code-switching dataset for Chinese and Singapore accented English.
JAN 2025: AudioBench paper is accepted to NAACL 2025 Main Conference.
JAN 2025: Support 10+ MNSC - Singlish Understanding datasets, the results are updated on leaderboard.
DEC 2024: Support more (35) datasets / more Models (2 cascade and 3 fusion models).
SEP 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
AUG 2024: Support a 6 speech translation datasets. Update the evaluation script for several MCQ evaluation.
AUG 2024: Leaderboard is live. Check it out here.
JUL 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
JUL 2024: Support all INITIAL 26 datasets listed in AudioBench manuscript.

Supported Evaluation Data

How to evaluate with the supported datasets? That's as simple as it can be. Replace the DATASET and METRIC name.

DATASET=librispeech_test_clean
METRIC=wer

How to Evaluation on Your Dataset?

Two simple steps:

Make a copy of one of the customized dataset loader. Example: cn_college_listen_mcq_test. Customize it as your like on your own dataset.
Add a new term in dataset.py.
Done!

Supported Models

How to evaluation your own models?

As long as the model can do inference, you can load them and inference to get the responses. To evaluate on new models, please refer to adding_new_model.

🔧 Installation

Installation with pip:

pip install -r requirements.txt

⏩ Quick Start

For model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.

The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model.

# Step 1:
# Server the judgement model using VLLM framework (my example is using int4 quantized version)
# This requires with 1 * 80GB GPU
bash vllm_model_judge_llama_3_70b.sh

# Step 2:
# We perform model inference and obtain the evaluation results with the second GPU
GPU=2
BATCH_SIZE=1
OVERWRITE=True
NUMBER_OF_SAMPLES=-1 # indicate all test samples if number_of_samples=-1

MODEL_NAME=Qwen2-Audio-7B-Instruct

DATASET=cn_college_listen_mcq_test
METRICS=llama3_70b_judge

bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLES

📖 Citation

If you find our work useful, please consider citing our paper!

@article{wang2024audiobench,
  title={AudioBench: A Universal Benchmark for Audio Large Language Models},
  author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
  journal={NAACL},
  year={2025}
}

To submit your model to leaderboard

Email: [email protected]

Researchers, companies or groups that are using AudioBench:

Llama3-S: When Llama Learns to Listen
[llms-eval] https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/lmms-eval-0.3.md
More to come...

To-Do List

Features
- Evaluation with audio/speech generation
- Evaluation with multiround chatbot
- Also support other model-as-judge and report the results
- Update AI-SHELL from WER to CER
Bugs
- Threads of model-as-judge
- Post-processing script for IMDA PART4 which contains code-switching in 4 languages.

Contributors

Xue Cong Tey (MMAU-mini Dataset)

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
assets		assets
examples		examples
leaderboard		leaderboard
log_for_all_models		log_for_all_models
scripts/aspire2ap_cluster		scripts/aspire2ap_cluster
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
check_log.py		check_log.py
debug.sh		debug.sh
debug2.sh		debug2.sh
debug3.sh		debug3.sh
eval.sh		eval.sh
requirements.txt		requirements.txt
vllm_model_judge_llama_3_70b.sh		vllm_model_judge_llama_3_70b.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔥 AudioBench 🔥

📝 Change log

Supported Evaluation Data

How to Evaluation on Your Dataset?

Supported Models

How to evaluation your own models?

🔧 Installation

⏩ Quick Start

📖 Citation

To submit your model to leaderboard

Researchers, companies or groups that are using AudioBench:

To-Do List

Contributors

About

Releases

Packages

Languages

License

AudioLLMs/AudioBench

Folders and files

Latest commit

History

Repository files navigation

🔥 AudioBench 🔥

📝 Change log

Supported Evaluation Data

How to Evaluation on Your Dataset?

Supported Models

How to evaluation your own models?

🔧 Installation

⏩ Quick Start

📖 Citation

To submit your model to leaderboard

Researchers, companies or groups that are using AudioBench:

To-Do List

Contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages