The folder includes the implementation of LLaVolta for Efficient Large Language and Vision Assistant.
@inproceedings{chen2024efficient,
title={Efficient large multi-modal models via visual context compression},
author={Chen, Jieneng and Ye, Luoxin and He, Ju and Wang, Zhao-Yang and Khashabi, Daniel and Yuille, Alan},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}



Note: code is developed based on Ubuntu 20.04/22.04. CUDA=12.1 Our code is developed based on LLaVA, the installation is very similar to original repo of LLaVA:
- Clone this repository and navigate to LLaVA folder
git clone https://github.com/Beckschen/LLaVolta
cd LLaVolta
- Install Package
conda create -n llavolta python=3.10 -y
conda activate llavolta
pip install --upgrade pip
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir
cd llava/eval
tar xvf table.tar
cd ../..
- Download the training data for both pretraining and fine-tuning from the original LLaVA repository.
- Set the necessary path variables:
ROOT_DATA
,ROOT_WEIGHT
, andROOT_LOG
(optional). - Begin training using the scripts. We provide four examples: 4stage, heavy_compression, light_compression, and reproduce.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/train-$NAME.sh
Running scripts under scripts/v1_5/eval/$NAME, where NAME is the name of checkpoint's name. We provide four example: 4stage, heavy_compression, light_compression, reproduce.
For all scripts we provided, please first fill up necessary path variables: ROOT_DATA, ROOT_WEIGHT, ROOT_LOG(optional)
- Download
test2015
and put it under$ROOT_DATA/eval/vqav2
. - Multi-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/vqav2.sh
- Submit the results to the evaluation server.
- Download the data and evaluation scripts following the official instructions and put under
$ROOT_DATA/eval/gqa/data
. You may need to modifyeval.py
as this due to the missing assets in the GQA v1.2 release. - Multi-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/gqa.sh
- Download
test.json
and extracttest.zip
totest
. Put them under$ROOT_DATA/eval/vizwiz
. - Single-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/vizwiz.sh
- Submit the results to the evaluation server:
$ROOT_DATA/eval/vizwiz/answers_upload
.
- Under
$ROOT_DATA/eval/scienceqa
, downloadimages
,pid_splits.json
,problems.json
from thedata/scienceqa
folder of the ScienceQA repo. - Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/sqa.sh
- Download
TextVQA_0.5.1_val.json
and images and extract to$ROOT_DATA/eval/textvqa
. - Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/textvqa.sh
- Download
coco
from POPE and put under$ROOT_DATA/eval/pope
. - Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/pope.sh
- Download the data following the official instructions here.
- Downloaded images to
MME_Benchmark_release_version
. - put the official
eval_tool
andMME_Benchmark_release_version
under$ROOT_DATA/eval/MME
. - Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mme.sh
- Download
mmbench_dev_20230712.tsv
and put under$ROOT_DATA/eval/mmbench
. - Single-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mmbench.sh
- Submit the results to the evaluation server:
$ROOT_DATA/eval/mmbench/answers_upload/mmbench_dev_20230712
.
- Download
mmbench_dev_cn_20231003.tsv
and put under$ROOT_DATA/eval/mmbench
. - Single-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mmbench_cn.sh
- Submit the results to the evaluation server:
$ROOT_DATA/eval/mmbench/answers_upload/mmbench_dev_cn_20231003
.
- Following the official instructions to download the images and the videos. Put images under
$DATA_ROOT/eval/seed_bench/SEED-Bench-image
. Note that we only use image subset to evaluate LLaVolta - Multiple-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/seed.sh
- Extract contents of
llava-bench-in-the-wild
to$ROOT_DATA/eval/llava-bench-in-the-wild
. - Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/llavabench.sh
- Extract
mm-vet.zip
to$ROOT_DATA/eval/mmvet
. - Single-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mmvet.sh
- Evaluate the predictions in
$ROOT_DATA/eval/mmvet/results
using the official jupyter notebook.
@inproceedings{chen2024efficient,
title={Efficient large multi-modal models via visual context compression},
author={Chen, Jieneng and Ye, Luoxin and He, Ju and Wang, Zhao-Yang and Khashabi, Daniel and Yuille, Alan},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}
Luoxin Ye (@feiyu12138) is the primary contributor to the codebase. We have archived the project here, in order to maintain a clean and organized code style.