Skip to content

AudioLLMs/AudioBench

Repository files navigation

Prometheus-Logo

πŸ”₯ AudioBench πŸ”₯

arXiv Hugging Face Organization License

⚑ A repository for evaluating AudioLLMs in various tasks πŸš€ ⚑
⚑ AudioBench: A Universal Benchmark for Audio Large Language Models πŸš€ ⚑
🌟 Come to View Our Live Leaderboard on Huggingface Space 🌟

🏠 AudioBench Leaderboard | πŸ€— Huggingface Datasets | πŸ€— AudioLLM Paper Collection GitHub Repo stars

πŸ“ Change log

  • Mar 2025: Supported phi_4_multimodal_instruct model, gigaspeech 2 evaluation (Thai, Vietenames and Indonesina).
  • Mar 2025: Support MMAU testset. Multiple-choice questions for speech, audio and music understanding!
  • Mar 2025: AudioBench now supports over 50 datasets!!
  • Mar 2025: Support SEAME testsets (dev). It is a code-switching dataset for Chinese and Singapore accented English.
  • JAN 2025: AudioBench paper is accepted to NAACL 2025 Main Conference.
  • JAN 2025: Support 10+ MNSC - Singlish Understanding datasets, the results are updated on leaderboard.
  • DEC 2024: Support more (35) datasets / more Models (2 cascade and 3 fusion models).
  • SEP 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
  • AUG 2024: Support a 6 speech translation datasets. Update the evaluation script for several MCQ evaluation.
  • AUG 2024: Leaderboard is live. Check it out here.
  • JUL 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
  • JUL 2024: Support all INITIAL 26 datasets listed in AudioBench manuscript.

Star History Chart

Supported Evaluation Data

How to evaluate with the supported datasets? That's as simple as it can be. Replace the DATASET and METRIC name.

DATASET=librispeech_test_clean
METRIC=wer

How to Evaluation on Your Dataset?

Two simple steps:

  1. Make a copy of one of the customized dataset loader. Example: cn_college_listen_mcq_test. Customize it as your like on your own dataset.
  2. Add a new term in dataset.py.
  3. Done!

Supported Models

How to evaluation your own models?

As long as the model can do inference, you can load them and inference to get the responses. To evaluate on new models, please refer to adding_new_model.

πŸ”§ Installation

Installation with pip:

pip install -r requirements.txt

⏩ Quick Start

For model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.

The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model.

# Step 1:
# Server the judgement model using VLLM framework (my example is using int4 quantized version)
# This requires with 1 * 80GB GPU
bash vllm_model_judge_llama_3_70b.sh

# Step 2:
# We perform model inference and obtain the evaluation results with the second GPU
GPU=2
BATCH_SIZE=1
OVERWRITE=True
NUMBER_OF_SAMPLES=-1 # indicate all test samples if number_of_samples=-1

MODEL_NAME=Qwen2-Audio-7B-Instruct

DATASET=cn_college_listen_mcq_test
METRICS=llama3_70b_judge

bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLES

πŸ“– Citation

If you find our work useful, please consider citing our paper!

@article{wang2024audiobench,
  title={AudioBench: A Universal Benchmark for Audio Large Language Models},
  author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
  journal={NAACL},
  year={2025}
}

To submit your model to leaderboard

Email: [email protected]

Researchers, companies or groups that are using AudioBench:

To-Do List

  • Features
    • Evaluation with audio/speech generation
    • Evaluation with multiround chatbot
    • Also support other model-as-judge and report the results
    • Update AI-SHELL from WER to CER
  • Bugs
    • Threads of model-as-judge
    • Post-processing script for IMDA PART4 which contains code-switching in 4 languages.

Contributors

  • Xue Cong Tey (MMAU-mini Dataset)

Releases

No releases published

Packages

No packages published