Skip to content

[TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"

License

Notifications You must be signed in to change notification settings

Yangyi-Chen/SOLO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

9739f45 Β· Nov 14, 2024

History

39 Commits
Jul 9, 2024
Jul 28, 2024
Jul 28, 2024
Jul 28, 2024
Sep 21, 2024
Jul 28, 2024
Jul 9, 2024
Jul 4, 2024
Jul 28, 2024
Nov 14, 2024
Jul 28, 2024
Jul 9, 2024
Jul 9, 2024
Jul 28, 2024

Repository files navigation

[TMLR] SOLO: A Single Transformer for Scalable
Vision-Language Modeling

πŸ“ƒ Paper β€’ πŸ€— Model (SOLO-7B)

We present SOLO, a single Transformer architecture for unified vision-language modeling. SOLO accepts both raw image patches (in pixels) and texts as inputs, without using a separate pre-trained vision encoder.

TODO Roadmap

  βœ… Release the instruction tuning data mixture

  βœ… Release the code for instruction tuning

  βœ… Release the pre-training code

  βœ… Release the SOLO model πŸ€— Model (SOLO-7B)

  βœ… Paper on arxiv πŸ“ƒ Paper

Setup

Clone Repo

git clone https://github.com/Yangyi-Chen/SOLO
git submodule update --init --recursive

Setup Environment for Data Processing

conda env create -f environment.yml
conda activate solo

OR simply

pip install -r requirements.txt

SOLO Inference with Huggingface

Check scripts/notebook/demo.ipynb for an example of performing inference on the model.

Pre-Training

Please refer to PRETRAIN_GUIDE.md for more details about how to perform pre-training. The following table documents the data statistics in pre-training:

Instruction Fine-Tuning

Please refer to SFT_GUIDE.md for more details about how to perform instruction fine-tuning. The following table documents the data statistics in instruction fine-tuning:

Citation

If you use or extend our work, please consider citing our paper.

@article{chen2024single,
  title={A Single Transformer for Scalable Vision-Language Modeling},
  author={Chen, Yangyi and Wang, Xingyao and Peng, Hao and Ji, Heng},
  journal={arXiv preprint arXiv:2407.06438},
  year={2024}
}

About

[TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published