Skip to content

[NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression

License

Notifications You must be signed in to change notification settings

Beckschen/LLaVolta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression

The folder includes the implementation of LLaVolta for Efficient Large Language and Vision Assistant.

teaser

@inproceedings{chen2024efficient,
  title={Efficient large multi-modal models via visual context compression},
  author={Chen, Jieneng and Ye, Luoxin and He, Ju and Wang, Zhao-Yang and Khashabi, Daniel and Yuille, Alan},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024}
}

Instantiation of LLaVolta schemes:

image

Accelerate and Boost LLaVA:

image

Accelerate and Boost VideoLLaVA:

image

Install

Note: code is developed based on Ubuntu 20.04/22.04. CUDA=12.1 Our code is developed based on LLaVA, the installation is very similar to original repo of LLaVA:

  1. Clone this repository and navigate to LLaVA folder
git clone https://github.com/Beckschen/LLaVolta
cd LLaVolta
  1. Install Package
conda create -n llavolta python=3.10 -y
conda activate llavolta
pip install --upgrade pip 
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir
cd llava/eval
tar xvf table.tar
cd ../..

Efficient Training

  1. Download the training data for both pretraining and fine-tuning from the original LLaVA repository.
  2. Set the necessary path variables: ROOT_DATA, ROOT_WEIGHT, and ROOT_LOG (optional).
  3. Begin training using the scripts. We provide four examples: 4stage, heavy_compression, light_compression, and reproduce.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/train-$NAME.sh

Evaluation

Running scripts under scripts/v1_5/eval/$NAME, where NAME is the name of checkpoint's name. We provide four example: 4stage, heavy_compression, light_compression, reproduce.

For all scripts we provided, please first fill up necessary path variables: ROOT_DATA, ROOT_WEIGHT, ROOT_LOG(optional)

VQAv2

  1. Download test2015 and put it under $ROOT_DATA/eval/vqav2.
  2. Multi-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/vqav2.sh
  1. Submit the results to the evaluation server.

GQA

  1. Download the data and evaluation scripts following the official instructions and put under $ROOT_DATA/eval/gqa/data. You may need to modify eval.py as this due to the missing assets in the GQA v1.2 release.
  2. Multi-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/gqa.sh

VisWiz

  1. Download test.json and extract test.zip to test. Put them under $ROOT_DATA/eval/vizwiz.
  2. Single-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/vizwiz.sh
  1. Submit the results to the evaluation server: $ROOT_DATA/eval/vizwiz/answers_upload.

ScienceQA

  1. Under $ROOT_DATA/eval/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
  2. Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/sqa.sh

TextVQA

  1. Download TextVQA_0.5.1_val.json and images and extract to $ROOT_DATA/eval/textvqa.
  2. Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/textvqa.sh

POPE

  1. Download coco from POPE and put under $ROOT_DATA/eval/pope.
  2. Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/pope.sh

MME

  1. Download the data following the official instructions here.
  2. Downloaded images to MME_Benchmark_release_version.
  3. put the official eval_tool and MME_Benchmark_release_version under $ROOT_DATA/eval/MME.
  4. Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mme.sh

MMBench

  1. Download mmbench_dev_20230712.tsv and put under $ROOT_DATA/eval/mmbench.
  2. Single-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mmbench.sh
  1. Submit the results to the evaluation server: $ROOT_DATA/eval/mmbench/answers_upload/mmbench_dev_20230712.

MMBench-CN

  1. Download mmbench_dev_cn_20231003.tsv and put under $ROOT_DATA/eval/mmbench.
  2. Single-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mmbench_cn.sh
  1. Submit the results to the evaluation server: $ROOT_DATA/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.

SEED-Bench

  1. Following the official instructions to download the images and the videos. Put images under $DATA_ROOT/eval/seed_bench/SEED-Bench-image. Note that we only use image subset to evaluate LLaVolta
  2. Multiple-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/seed.sh

LLaVA-Bench-in-the-Wild

  1. Extract contents of llava-bench-in-the-wild to $ROOT_DATA/eval/llava-bench-in-the-wild.
  2. Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/llavabench.sh

MM-Vet

  1. Extract mm-vet.zip to $ROOT_DATA/eval/mmvet.
  2. Single-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mmvet.sh
  1. Evaluate the predictions in $ROOT_DATA/eval/mmvet/results using the official jupyter notebook.

Citing LLaVolta

@inproceedings{chen2024efficient,
  title={Efficient large multi-modal models via visual context compression},
  author={Chen, Jieneng and Ye, Luoxin and He, Ju and Wang, Zhao-Yang and Khashabi, Daniel and Yuille, Alan},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024}
}

Acknowledgement

Luoxin Ye (@feiyu12138) is the primary contributor to the codebase. We have archived the project here, in order to maintain a clean and organized code style.

About

[NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published