Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

How do multi-modaility LLMs perform on low-level computer vision?

Haoning Wu¹^*, Zicheng Zhang²^*, Erli Zhang¹^*, Chaofeng Chen¹, Liang Liao¹,

Annan Wang¹, Chunyi Li², Wenxiu Sun³, Qiong Yan³, Guangtao Zhai², Weisi Lin¹^#

¹Nanyang Technological University, ²Shanghai Jiaotong University, ³Sensetime Research

^*Equal contribution. ^#Corresponding author.

The proposed Q-Bench includes three realms for low-level vision: perception (A1), description (A2), and assessment (A3).

For perception (A1) /description (A2), we collect two benchmark datasets LLVisionQA/LLDescribe.
We are open to submission-based evaluation for the two tasks. The details for submission is as follows.
For assessment (A3), as we use public datasets, we provide an abstract evaluation code for arbitrary MLLMs for anyone to test.

GPT-4V!

Our latest experiment suggests that GPT-4V is primarily entry human-level on general low-level perception, marking a new era for low-level visual perception and understanding!

Here is the comparison of GPT-4V and non-expert human on test set of Task A1 (Perception).

Participant Name	yes-or-no	what	how	distortion	others	in-context distortion	in-context others	overall
GPT-4V	0.7792	0.7918	0.6268	0.7058	0.7303	0.7466	0.7795	0.7336 (+0.1142 to best open-source)
human-1	0.8248	0.7939	0.6029	0.7562	0.7208	0.7637	0.7300	0.7431 (+0.0095 to GPT-4V)
human-2-senior	0.8431	0.8894	0.7202	0.7965	0.7947	0.8390	0.8707	0.8174 (+0.0838 to GPT-4V)

Human-1 is an ordinary person with no training while human-2-senior is a trained ordinary person but still not expert. GPT-4V is witnessed to be on par with human-1, but still room to go to surpass human-2-expert.

We sincerely hope that one day open-source models can also get that level (or even better) and we believe that it is coming soon. Try to challenge and beat it!

Submission Guideline for A1/A2

Option 1: Submit Results

New on Oct. 15! For people with bad connection to huggingface, we have also provided a GitHub-release version of all datasets. Please see our release as an alternative data source.

Important! We have released the datasets for these two tasks, for everyone to test on local machines and directly submit results. Please refer to the data release notes and example code to smoothly test these data.

Please email [email protected] to submit your result in json format.

Option 2: Submit Model

Otherwise, you can consider submitting your model to Q-Bench (A1/A2), you can prepare a huggingface/GitHub repo (with some README for us to run it) of your MLLM with an implementation of the following single ability:

Generate text outputs based on multi-modality inputs (image + text).

Specifically, it should has two important methods: embed_image_and_text (to allow multi-modality inputs), and generate (for dialog).

We recommend to wrap up the function call to your MLLM in the following format:

from PIL import Image
from my_mllm_model import Model, Tokenizer, embed_image_and_text # [REPLACE with YOUR MLLM here]

model, tokenizer = Model(), Tokenizer()

prompt = '[ANY_PROMPT]'

image = Image.open("image_for_query.jpg")
input_embeds = embed_image_and_text(image, prompt) #
generated_texts = tokenizer.batch_decode(model.generate(input_embeds=input_embeds))[0]

Optional: You will also need to implement the generative loss for your model if you would like to test with close-set inference (PPL-based) for perception task (A1), as follows:

loss = model(input_embeds=input_embeds, labels=input_ids).loss.item()

We further provide a demo implementation of IDEFICS, huggingface's open-source MLLM, for most simple question-answering (A1) and description (A2). See example on how to run the demo and provide a similar one for submission-based evaluation.

Please email [email protected] to submit your model if you are outside China Mainland. Please email [email protected] to submit your model if you are inside China Mainland.

A1: Perception

A snapshot for LLVisionQA benchmark dataset for MLLM low-level perception ability is as follows. See the leaderboard here.

We measure the answer accuracy of MLLMs (provided with the question and all choices) as the metric here.

A2: Description

A snapshot for LLDescribe benchmark dataset for MLLM low-level description ability is as follows. See the leaderboard here.

We measure the completeness, precision, and relevance of MLLM descriptions as the metric here.

A3: Assessment

An exciting ability that MLLMs are able to predict quantitative scores for IQA!

Methodology

Predict a Score

Pseudo Code

Similarly as above, as long as a model (based on causal language models) has the following two methods: embed_image_and_text (to allow multi-modality inputs), and forward (for computing logits), the Image Quality Assessment (IQA) with the model can be achieved as follows:

from PIL import Image
from my_mllm_model import Model, Tokenizer, embed_image_and_text

model, tokenizer = Model(), Tokenizer()

prompt = "##User: Rate the quality of the image.\n" \
         "##Assistant: The quality of the image is" ### This line can be modified based on MLLM's default behaviour.

good_idx, poor_idx = tokenizer(["good","poor"]).tolist()

image = Image.open("image_for_iqa.jpg")
input_embeds = embed_image_and_text(image, prompt)
output_logits = model(input_embeds=input_embeds).logits[0,-1]
q_pred = (output_logits[[good_idx, poor_idx]] / 100).softmax(0)[0]

*Note that you can modify the second line based on your model's default format, e.g. for Shikra, the "##Assistant: The quality of the image is" is modified as "##Assistant: The answer is". It is okay if your MLLM will first answer "Ok, I would like to help! The image quality is", just replace this into line 2 of the prompt.

Example Real Code for IDEFICS

We further provide a full implementation of IDEFICS on IQA. See example on how to run IQA with this MLLM. Other MLLMs can also be modified in the same way for use in IQA.

Compute SRCC/PLCC with IQA databases

We have prepared JSON format human opinion scores (MOS) for the seven IQA databases as evaluated in our benchmark.

Please see IQA_databases for details.

Official Results on IQA Databases

Moved to leaderboards. Please click to see details.

Contact

Please contact any of the first authors of this paper for queries.

Haoning Wu, [email protected], @teowu
Zicheng Zhang, [email protected], @zzc-1998
Erli Zhang, [email protected], @ZhangErliCarl

Citation

If you find our work interesting, please feel free to cite our paper:

@article{wu2023qbench,
    title={Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision},
    author={Wu, Haoning and Zhang, Zicheng and Zhang, Erli and Chen, Chaofeng and Liao, Liang and Wang, Annan and Li, Chunyi and Sun, Wenxiu and Yan, Qiong and Zhai, Guangtao and Lin, Weisi},
    year={2023},
}

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github/workflows		.github/workflows
a3_iqa_databases		a3_iqa_databases
data_release		data_release
example_code_for_idefics		example_code_for_idefics
leaderboards		leaderboards
README.md		README.md
_config.yaml		_config.yaml
gpt-4v-vs-human.png		gpt-4v-vs-human.png
lldescribe.png		lldescribe.png
llmiqa.png		llmiqa.png
llvisionqa.png		llvisionqa.png
llvisionqa_db.png		llvisionqa_db.png
logo.png		logo.png
qbench.png		qbench.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

GPT-4V!

Submission Guideline for A1/A2

Option 1: Submit Results

Option 2: Submit Model

A1: Perception

A2: Description

A3: Assessment

Methodology

Predict a Score

Pseudo Code

Example Real Code for IDEFICS

Compute SRCC/PLCC with IQA databases

Official Results on IQA Databases

Contact

Citation

About

Releases

Packages

Languages

teowu/Q-Bench-fork

Folders and files

Latest commit

History

Repository files navigation

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

GPT-4V!

Submission Guideline for A1/A2

Option 1: Submit Results

Option 2: Submit Model

A1: Perception

A2: Description

A3: Assessment

Methodology

Predict a Score

Pseudo Code

Example Real Code for IDEFICS

Compute SRCC/PLCC with IQA databases

Official Results on IQA Databases

Contact

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages