β‘ A repository for evaluating AudioLLMs in various tasks π β‘
β‘ AudioBench: A Universal Benchmark for Audio Large Language Models π β‘
π Come to View Our Live Leaderboard on Huggingface Space π
π AudioBench Leaderboard | π€ Huggingface Datasets | π€ AudioLLM Paper Collection
- Mar 2025: Supported phi_4_multimodal_instruct model, gigaspeech 2 evaluation (Thai, Vietenames and Indonesina).
- Mar 2025: Support MMAU testset. Multiple-choice questions for speech, audio and music understanding!
- Mar 2025: AudioBench now supports over 50 datasets!!
- Mar 2025: Support SEAME testsets (dev). It is a code-switching dataset for Chinese and Singapore accented English.
- JAN 2025: AudioBench paper is accepted to NAACL 2025 Main Conference.
- JAN 2025: Support 10+ MNSC - Singlish Understanding datasets, the results are updated on leaderboard.
- DEC 2024: Support more (35) datasets / more Models (2 cascade and 3 fusion models).
- SEP 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
- AUG 2024: Support a 6 speech translation datasets. Update the evaluation script for several MCQ evaluation.
- AUG 2024: Leaderboard is live. Check it out here.
- JUL 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
- JUL 2024: Support all INITIAL 26 datasets listed in AudioBench manuscript.
- librispeech_test_clean, ASR, English, Metric:
wer
- librispeech_test_other, ASR, English, Metric:
wer
- common_voice_15_en_test, ASR, English, Metric:
wer
- peoples_speech_test, ASR, English, Metric:
wer
- gigaspeech_test, ASR, English, Metric:
wer
- tedlium3_test, ASR, English, Metric:
wer
- tedlium3_long_form_test, ASR, English, Long recording, Metric:
wer
- earnings21_test, ASR, English, Long recording, Metric:
wer
- earnings22_test, ASR, English, Long recording, Metric:
wer
- aishell_asr_zh_test, ASR, Chinese, Metric:
wer
- covost2_en_id_test, Speech Translation, English-Indonesian, Metric:
bleu
- covost2_en_zh_test, Speech Translation, English-Chinese, Metric:
bleu
- covost2_en_ta_test, Speech Translation, English-Tamil, Metric:
bleu
- covost2_id_en_test, Speech Translation, Indonesian-English, Metric:
bleu
- covost2_zh_en_test, Speech Translation, Chinese-English, Metric:
bleu
- covost2_ta_en_test, Speech Translation, Tamil-English, Metric:
bleu
- cn_college_listen_mcq_test, Speech Question Answering, Multiple Choice, Metric:
llama3_70b_judge
,gpt4o_judge
- slue_p2_sqa5_test, Speech Question Answering, Metric:
llama3_70b_judge
,gpt4o_judge
- dream_tts_mcq_test, Speech Question Answering, Multiple Choice, Metric:
llama3_70b_judge
,gpt4o_judge
- public_sg_speech_qa_test, Speech Question Answering, Metric:
llama3_70b_judge
,gpt4o_judge
- spoken_squad_test, Speech Question Answering, Metric:
llama3_70b_judge
,gpt4o_judge
- openhermes_audio_test, Speech Instruction, Metric:
llama3_70b_judge
,gpt4o_judge
- alpaca_audio_test, Speech Instruction, Metric:
llama3_70b_judge
,gpt4o_judge
- clotho_aqa_test, Speech Question Answering, Metric:
llama3_70b_judge
,gpt4o_judge
- wavcaps_qa_test, Audio Scene Question Answering, Metric:
llama3_70b_judge
,gpt4o_judge
- audiocaps_qa_test, Audio Scene Question Answering, Metric:
llama3_70b_judge
,gpt4o_judge
- wavcaps_test, Audio Scene Question Answering, Metric:
llama3_70b_judge
,meteor
,gpt4o_judge
- audiocaps_test, Audio Scene Question Answering, Metric:
llama3_70b_judge
,meteor
,gpt4o_judge
- iemocap_emotion_test, Emotion Recognition, Metric:
llama3_70b_judge
,gpt4o_judge
- meld_sentiment_test, Emotion Recognition, Metric:
llama3_70b_judge
,gpt4o_judge
- meld_emotion_test, Emotion Recognition, Metric:
llama3_70b_judge
,gpt4o_judge
- voxceleb_accent_test, Accent Recognition, Metric:
llama3_70b_judge
,gpt4o_judge
- voxceleb_gender_test, Gender Recognition, Metric:
llama3_70b_judge
,gpt4o_judge
- iemocap_gender_test, Gender Recognition, Metric:
llama3_70b_judge
,gpt4o_judge
- muchomusic_test, Music Understanding, Metric:
llama3_70b_judge
,gpt4o_judge
- imda_part1_asr_test, Singlish ASR, Metric:
wer
- imda_part2_asr_test, Singlish ASR, Metric:
wer
- imda_part3_30s_asr_test, Singlish ASR, Metric:
wer
- imda_part4_30s_asr_test, Singlish ASR, Metric:
wer
- imda_part5_30s_asr_test, Singlish ASR, Metric:
wer
- imda_part6_30s_asr_test, Singlish ASR, Metric:
wer
- imda_part3_30s_sqa_human_test, Singlish Speech Question Answering, Metric:
llama3_70b_judge
,gpt4o_judge
- imda_part4_30s_sqa_human_test, Singlish Speech Question Answering, Metric:
llama3_70b_judge
,gpt4o_judge
- imda_part5_30s_sqa_human_test, Singlish Speech Question Answering, Metric:
llama3_70b_judge
,gpt4o_judge
- imda_part6_30s_sqa_human_test, Singlish Speech Question Answering, Metric:
llama3_70b_judge
,gpt4o_judge
- imda_part3_30s_ds_human_test, Singlish Speech Summarization, Metric:
llama3_70b_judge
,gpt4o_judge
- imda_part4_30s_ds_human_test, Singlish Speech Summarization, Metric:
llama3_70b_judge
,gpt4o_judge
- imda_part5_30s_ds_human_test, Singlish Speech Summarization, Metric:
llama3_70b_judge
,gpt4o_judge
- imda_part6_30s_ds_human_test, Singlish Speech Summarization, Metric:
llama3_70b_judge
,gpt4o_judge
- imda_ar_sentence, Singlish, Accent Recognition, Metric:
llama3_70b_judge
,gpt4o_judge
- imda_ar_dialogue, Singlish, Accent Recognition, Metric:
llama3_70b_judge
,gpt4o_judge
- imda_gr_sentence, Singlish, Gender Recognition, Metric:
llama3_70b_judge
,gpt4o_judge
- imda_gr_dialogue, Singlish, Gender Recognition, Metric:
llama3_70b_judge
,gpt4o_judge
- seame_dev_man, English-Chinese Code-Switching, Metric:
wer
- seame_dev_sge, English-Chinese Code-Switching, Metric:
wer
- mmau_mini, Audio Understandign and Reasoning, Multiple Choice Questions, Metric:
llama3_70b_judge
,string_match
,gpt4o_judge
- gigaspeech2_thai, ASR for Thai language, Metric:
wer
- gigaspeech2_indo, ASR for Indonesian language, Metric:
wer
- gigaspeech2_viet, ASR for Vietnamese language, Metric:
wer
- ASCEND, English-Chinese Code-Switching, Metric:
wer
- [fleurs] speech translation
- [AIR-Bench] airbench tasks
How to evaluate with the supported datasets? That's as simple as it can be. Replace the DATASET
and METRIC
name.
DATASET=librispeech_test_clean
METRIC=wer
Two simple steps:
- Make a copy of one of the customized dataset loader. Example: cn_college_listen_mcq_test. Customize it as your like on your own dataset.
- Add a new term in dataset.py.
- Done!
- cascade_whisper_large_v3_llama_3_8b_instruct
- cascade_whisper_large_v2_gemma2_9b_cpt_sea_lionv3_instruct
- MERaLiON-AudioLLM-Whisper-SEA-LION
- Qwen-Audio-Chat
- Qwen2-Audio-7B-Instruct
- SALMONN_7B: need extra git clone.
- WavLLM_fairseq: no longer supported as the inference takes too much effort.
- whisper_large_v3
- whisper_large_v2
- gemini-1.5-flash: key needed
- gemini-2-flash: key needed
- gpt-4o-audio: key needed
- phi_4_multimodal_instruct
- seallms_audio_7b
- ultravox https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_1-8b / https://www.ultravox.ai/
- llama3_s
- audio-flamingo-2
- [GLM4-Voice]
- [Mini-Omni]
- [SLAM-Omni]
- [https://huggingface.co/scb10x/llama3.1-typhoon2-audio-8b-instruct]
- [https://huggingface.co/WillHeld/DiVA-llama-3-v0-8b]
As long as the model can do inference, you can load them and inference to get the responses. To evaluate on new models, please refer to adding_new_model.
Installation with pip:
pip install -r requirements.txt
For model-as-judge evaluation, we serve the judgement model as a service via vllm
on port 5000
.
The example is hosting a Llama-3-70B-Instruct
model and running the cascade Whisper + Llama-3
model.
# Step 1:
# Server the judgement model using VLLM framework (my example is using int4 quantized version)
# This requires with 1 * 80GB GPU
bash vllm_model_judge_llama_3_70b.sh
# Step 2:
# We perform model inference and obtain the evaluation results with the second GPU
GPU=2
BATCH_SIZE=1
OVERWRITE=True
NUMBER_OF_SAMPLES=-1 # indicate all test samples if number_of_samples=-1
MODEL_NAME=Qwen2-Audio-7B-Instruct
DATASET=cn_college_listen_mcq_test
METRICS=llama3_70b_judge
bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLES
If you find our work useful, please consider citing our paper!
@article{wang2024audiobench,
title={AudioBench: A Universal Benchmark for Audio Large Language Models},
author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
journal={NAACL},
year={2025}
}
Email: [email protected]
- Llama3-S: When Llama Learns to Listen
- [llms-eval] https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/lmms-eval-0.3.md
- More to come...
- Features
- Evaluation with audio/speech generation
- Evaluation with multiround chatbot
- Also support other model-as-judge and report the results
- Update AI-SHELL from WER to CER
- Bugs
- Threads of model-as-judge
- Post-processing script for IMDA PART4 which contains code-switching in 4 languages.
- Xue Cong Tey (MMAU-mini Dataset)