Siyu Jiao1, Gengwei Zhang2, Yinlong Qian3, Jiancheng Huang3, Yao Zhao1,
Humphrey Shi4, Lin Ma3, Yunchao Wei1, Zequn Jie3
1 BJTU, 2 UTS, 3 Meituan, 4 Georgia Tech
This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images (
-
Install
torch>=2.0.0
. -
Install other pip packages via
pip3 install -r requirements.txt
. -
Prepare the ImageNet dataset
assume the ImageNet is in `/path/to/imagenet`. It should be like this:
/path/to/imagenet/: train/: n01440764: many_images.JPEG ... n01443537: many_images.JPEG ... val/: n01440764: ILSVRC2012_val_00000293.JPEG ... n01443537: ILSVRC2012_val_00000236.JPEG ...
NOTE: The arg
--data_path=/path/to/imagenet
should be passed to the training script. -
(Optional) install and compile
flash-attn
andxformers
for faster attention computation.
-
You need to download FlexVAE.pth first.
-
FID IS Step Weights d16 3.05 291.3 10 FlexVARd16-epo179.pth d20 2.41 299.3 10 FlexVARd20-epo249.pth d24 2.21 299.1 10 FlexVARd16-epo349.pth
-
256x256 (default)
For FID evaluation, use var.autoregressive_infer_cfg to sample 50,000 images (50 per class) and save them as PNG (not JPEG) files in a folder. Pack them into a
.npz
file. Then use the OpenAI's FID evaluation toolkit and reference ground truth npz file of 256x256 to evaluate FID, IS, precision, and recall. See Evaluation for detailsFor example, evaluate our pre-trained
FlexVARd20-epo249.pth
model:# 1. Download FlexVARd24-epo349.pth. # 2. put it at `pretrained/FlexVARd20-epo349.pth`. # 3. evaluation args_infer_patch_nums="1_2_3_4_5_7_10_13_16" torchrun --nnodes=1 --nproc_per_node=2 --node_rank=0 eval_c2i.py --batch_size 16 --cfg 2.5 --top_k 900 \ --maxpn 16 --infer_patch_nums $args_infer_patch_nums \ --depth 24
-
Zero-shot transfer with 13 steps
Modify
args_infer_patch_nums
between steps 8 and 14.args_infer_patch_nums="1_2_3_4_5_6_7_8_9_10_12_14_16" torchrun --nnodes=1 --nproc_per_node=2 --node_rank=0 eval_c2i.py --batch_size 16 --cfg 2.5 --top_k 900 \ --maxpn 16 --infer_patch_nums $args_infer_patch_nums \ --depth 20
-
Zero-shot transfer to 512x512
Reference ground truth .npz file of 512x512
args_infer_patch_nums="1_2_3_4_5_6_7_8_9_10_12_14_16_23_32" torchrun --nnodes=1 --nproc_per_node=2 --node_rank=0 eval_c2i.py --batch_size 16 --cfg 3.0 --top_k 900 \ --maxpn 16 --infer_patch_nums $args_infer_patch_nums \ --depth 20