[TMLR] SOLO: A Single Transformer for Scalable
Vision-Language Modeling

We present SOLO, a single Transformer architecture for unified vision-language modeling. SOLO accepts both raw image patches (in pixels) and texts as inputs, without using a separate pre-trained vision encoder.

TODO Roadmap

✅ Release the instruction tuning data mixture

✅ Release the code for instruction tuning

✅ Release the pre-training code

✅ Release the SOLO model 🤗 Model (SOLO-7B)

✅ Paper on arxiv 📃 Paper

Setup

Clone Repo

git clone https://github.com/Yangyi-Chen/SOLO
git submodule update --init --recursive

Setup Environment for Data Processing

conda env create -f environment.yml
conda activate solo

OR simply

pip install -r requirements.txt

SOLO Inference with Huggingface

Check scripts/notebook/demo.ipynb for an example of performing inference on the model.

Pre-Training

Please refer to PRETRAIN_GUIDE.md for more details about how to perform pre-training. The following table documents the data statistics in pre-training:

Instruction Fine-Tuning

Please refer to SFT_GUIDE.md for more details about how to perform instruction fine-tuning. The following table documents the data statistics in instruction fine-tuning:

Citation

If you use or extend our work, please consider citing our paper.

@article{chen2024single,
  title={A Single Transformer for Scalable Vision-Language Modeling},
  author={Chen, Yangyi and Wang, Xingyao and Peng, Hao and Ji, Heng},
  journal={arXiv preprint arXiv:2407.06438},
  year={2024}
}

Name	Name	Last commit message	Last commit date
Latest commit Yangyi-Chen Update README.md Nov 14, 2024 9739f45 · Nov 14, 2024 History 39 Commits
Megatron-LLM @ 8392550	Megatron-LLM @ 8392550	update megatron commit	Jul 9, 2024
config	config	add sft guide	Jul 28, 2024
images	images	add sft guide	Jul 28, 2024
scripts	scripts	add missing megatron conversion scripts	Jul 28, 2024
src	src	Update utils.py	Sep 21, 2024
.gitignore	.gitignore	add instruction fine-tuning code	Jul 28, 2024
.gitmodules	.gitmodules	update submodule	Jul 9, 2024
LICENSE	LICENSE	Initial commit	Jul 4, 2024
PRETRAIN_GUIDE.md	PRETRAIN_GUIDE.md	Update PRETRAIN_GUIDE.md	Jul 28, 2024
README.md	README.md	Update README.md	Nov 14, 2024
SFT_GUIDE.md	SFT_GUIDE.md	add sft guide	Jul 28, 2024
environment.yml	environment.yml	add scripts for init commit	Jul 9, 2024
image_utils.py	image_utils.py	add image util	Jul 9, 2024
requirements.txt	requirements.txt	add instruction fine-tuning code	Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[TMLR] SOLO: A Single Transformer for Scalable
Vision-Language Modeling

TODO Roadmap

Setup

Clone Repo

Setup Environment for Data Processing

SOLO Inference with Huggingface

Pre-Training

Instruction Fine-Tuning

Citation

About

Releases

Packages

Contributors 2

Languages

License

Yangyi-Chen/SOLO

Folders and files

Latest commit

History

Repository files navigation

[TMLR] SOLO: A Single Transformer for Scalable Vision-Language Modeling

TODO Roadmap

Setup

Clone Repo

Setup Environment for Data Processing

SOLO Inference with Huggingface

Pre-Training

Instruction Fine-Tuning

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

[TMLR] SOLO: A Single Transformer for Scalable
Vision-Language Modeling

Packages