[NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression

The folder includes the implementation of LLaVolta for Efficient Large Language and Vision Assistant.

@inproceedings{chen2024efficient,
  title={Efficient large multi-modal models via visual context compression},
  author={Chen, Jieneng and Ye, Luoxin and He, Ju and Wang, Zhao-Yang and Khashabi, Daniel and Yuille, Alan},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024}
}

Instantiation of LLaVolta schemes:

Accelerate and Boost LLaVA:

Accelerate and Boost VideoLLaVA:

Install

Note: code is developed based on Ubuntu 20.04/22.04. CUDA=12.1 Our code is developed based on LLaVA, the installation is very similar to original repo of LLaVA:

Clone this repository and navigate to LLaVA folder

git clone https://github.com/Beckschen/LLaVolta
cd LLaVolta

Install Package

conda create -n llavolta python=3.10 -y
conda activate llavolta
pip install --upgrade pip 
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir
cd llava/eval
tar xvf table.tar
cd ../..

Efficient Training

Download the training data for both pretraining and fine-tuning from the original LLaVA repository.
Set the necessary path variables: ROOT_DATA, ROOT_WEIGHT, and ROOT_LOG (optional).
Begin training using the scripts. We provide four examples: 4stage, heavy_compression, light_compression, and reproduce.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/train-$NAME.sh

Evaluation

Running scripts under scripts/v1_5/eval/$NAME, where NAME is the name of checkpoint's name. We provide four example: 4stage, heavy_compression, light_compression, reproduce.

For all scripts we provided, please first fill up necessary path variables: ROOT_DATA, ROOT_WEIGHT, ROOT_LOG(optional)

VQAv2

Download test2015 and put it under $ROOT_DATA/eval/vqav2.
Multi-GPU inference.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/vqav2.sh

Submit the results to the evaluation server.

GQA

Download the data and evaluation scripts following the official instructions and put under $ROOT_DATA/eval/gqa/data. You may need to modify eval.py as this due to the missing assets in the GQA v1.2 release.
Multi-GPU inference.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/gqa.sh

VisWiz

Download test.json and extract test.zip to test. Put them under $ROOT_DATA/eval/vizwiz.
Single-GPU inference.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/vizwiz.sh

Submit the results to the evaluation server: $ROOT_DATA/eval/vizwiz/answers_upload.

ScienceQA

Under $ROOT_DATA/eval/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
Single-GPU inference and evaluate.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/sqa.sh

TextVQA

Download TextVQA_0.5.1_val.json and images and extract to $ROOT_DATA/eval/textvqa.
Single-GPU inference and evaluate.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/textvqa.sh

POPE

Download coco from POPE and put under $ROOT_DATA/eval/pope.
Single-GPU inference and evaluate.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/pope.sh

MME

Download the data following the official instructions here.
Downloaded images to MME_Benchmark_release_version.
put the official eval_tool and MME_Benchmark_release_version under $ROOT_DATA/eval/MME.
Single-GPU inference and evaluate.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mme.sh

MMBench

Download mmbench_dev_20230712.tsv and put under $ROOT_DATA/eval/mmbench.
Single-GPU inference.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mmbench.sh

Submit the results to the evaluation server: $ROOT_DATA/eval/mmbench/answers_upload/mmbench_dev_20230712.

MMBench-CN

Download mmbench_dev_cn_20231003.tsv and put under $ROOT_DATA/eval/mmbench.
Single-GPU inference.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mmbench_cn.sh

Submit the results to the evaluation server: $ROOT_DATA/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.

SEED-Bench

Following the official instructions to download the images and the videos. Put images under $DATA_ROOT/eval/seed_bench/SEED-Bench-image. Note that we only use image subset to evaluate LLaVolta
Multiple-GPU inference and evaluate.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/seed.sh

LLaVA-Bench-in-the-Wild

Extract contents of llava-bench-in-the-wild to $ROOT_DATA/eval/llava-bench-in-the-wild.
Single-GPU inference and evaluate.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/llavabench.sh

MM-Vet

Extract mm-vet.zip to $ROOT_DATA/eval/mmvet.
Single-GPU inference.

NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mmvet.sh

Evaluate the predictions in $ROOT_DATA/eval/mmvet/results using the official jupyter notebook.

Citing LLaVolta

@inproceedings{chen2024efficient,
  title={Efficient large multi-modal models via visual context compression},
  author={Chen, Jieneng and Ye, Luoxin and He, Ju and Wang, Zhao-Yang and Khashabi, Daniel and Yuille, Alan},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024}
}

Acknowledgement

LLaVA
Vicuna

Luoxin Ye (@feiyu12138) is the primary contributor to the codebase. We have archived the project here, in order to maintain a clean and organized code style.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
docs		docs
images		images
llava		llava
modelconfig		modelconfig
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
config.json		config.json
pyproject.toml		pyproject.toml
staging.png		staging.png
staging2.png		staging2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression

Instantiation of LLaVolta schemes:

Accelerate and Boost LLaVA:

Accelerate and Boost VideoLLaVA:

Install

Efficient Training

Evaluation

VQAv2

GQA

VisWiz

ScienceQA

TextVQA

POPE

MME

MMBench

MMBench-CN

SEED-Bench

LLaVA-Bench-in-the-Wild

MM-Vet

Citing LLaVolta

Acknowledgement

About

Releases

Packages

Languages

License

Beckschen/LLaVolta

Folders and files

Latest commit

History

Repository files navigation

[NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression

Instantiation of LLaVolta schemes:

Accelerate and Boost LLaVA:

Accelerate and Boost VideoLLaVA:

Install

Efficient Training

Evaluation

VQAv2

GQA

VisWiz

ScienceQA

TextVQA

POPE

MME

MMBench

MMBench-CN

SEED-Bench

LLaVA-Bench-in-the-Wild

MM-Vet

Citing LLaVolta

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages