Skip to content

πŸ”₯ SpatialVLA: a spatial-enhanced vision-language-action model that is trained on 1.1 Million real robot episodes.

Notifications You must be signed in to change notification settings

SHAILAB-IPEC/SpatialVLA

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models

A spatial-enhanced vision-language-action model trained on 1.1 Million real robot episodes. πŸ€— purely huggingFace-based, concise code with efficient performance.

[πŸ“„Paper] [πŸ”₯Project Page] [πŸ“– Document] [πŸš€ Quick Start] [βœ… Performance] [πŸ€— FAQs]

[πŸ”₯Pre-train] [πŸš€ Fine-tune] [πŸŽ„Custom Dataset]

perform

News πŸš€πŸš€πŸš€

  • 2025/01/29: We release the SpatialVLA 1.0. SpatialVLA achieves state-of-the-art performance across a diverse range of evaluations and shows significantly faster inference speed with fewer tokens per action.
  • 2025/02/06: We release the SimplerEnv evaluation code for SpatialVLA. Please refer to DelinQu/SimplerEnv-OpenVLA, and make sure transformers >= 4.47.0.

Documents

πŸš€ Quick Start

SpatialVLA relies solely on HuggingFace Transformers πŸ€—, making deployment extremely easy. If your environment supports transformers >= 4.47.0, you can directly use the following code to load the model and perform inference. (requires 8.5GB of GPU memory).

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_name_or_path="IPEC-COMMUNITY/spatialvla-4b-224-pt"
processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()

image = Image.open("example.png").convert("RGB")
prompt = "What action should the robot take to pick the cup?"
inputs = processor(images=[image], text=prompt, return_tensors="pt")
generation_outputs = model.predict_action(inputs)

actions = processor.decode_actions(generation_outputs, unnorm_key="bridge_orig/1.0.0")
print(actions)

If you want to use the model for fine-tuning or pre-training, you need to install the required packages and download the model from the Hugging Face model hub. The VLM backbone of SpatialVLA is PaLiGemma2, which requires transformers >= 4.47.0. Hence, create a Python environment with Python >= 3.10.

conda create -n spatialvla python=3.10
conda activate spatialvla

Install packages from requirements.txt file. Note that we use a customised dlimp to support seed setting for reproducibility. If you catch any problems, please manually install the dlimp form the dlimp_custom.

pip install -r requirements.txt

🌟 Pre-train from Scratch

SpatialVLA is pre-trained with 1.1 Million real-robot demonstrations from the OXE and RH20T dataset on a cluster of 64 A100 GPUs for abut 10 days, using a batch size of 2048. You can pre-train the model from scratch using the following command.

# torchrun
bash scripts/spatialvla_4b_pretrain/torchrun_pretrain.sh

# or in a slurm cluster
bash scripts/spatialvla_4b_pretrain/slurm_pretrain.sh

🌟 Fine-tune from SpatialVLA

Most of our fine-tuning experiments are conducted using LoRA on 4 or 8 A100 GPUs. You can use the following scripts for full-parameter or LoRA fine-tuning. For real-world experiments with small datasets, we prefer using LoRA for fine-tuning.

# full fine-tuning
bash scripts/spatialvla_4b_finetune/finetune_full.sh

# LoRA fine-tuning
bash scripts/spatialvla_4b_finetune/finetune_lora.sh

🌟 SimplerEnv Benchmark

We release the SimplerEnv evaluation code for SpatialVLA based on DelinQu/SimplerEnv-OpenVLA. Please install the simpler_env environment by following DelinQu/SimplerEnv-OpenVLA and make sure transformers >= 4.47.0. After install all the dependencies, you can perform the evaluation by:

# under the project dir of SimplerEnv-OpenVLA/
bash scripts/run_spatialvla.sh

Note: Similar to most papers, e.g., HPT and TraceVLA, we omitted the Open Top Drawer and Place Apple from our evaluation, since the vast majority of policies achieved scores approaching 0 on this task.

πŸŽ„ Use Custom Datasets

TODO

πŸ€— Model Zoo

Model Name VLM Backbone VLA Model
SpatialVLA-4B-224-pt google/paligemma2-3b-pt-224 spatialvla-4b-224-pt
SpatialVLA-4B-mix-224-pt google/paligemma2-3b-pt-224 spatialvla-4b-mix-224-pt

βœ… Performance in Simulation and Real-world

Note

SimplerEnv evaluation on Google Robot tasks.

Model Visual Matching Variant Aggregation
Pick Coke Can Move Near Open/Close Drawer #Average Pick Coke Can Move Near Open/Close Drawer #Average
RT-1 (Begin) 2.7% 5.0% 13.9% 6.8% 2.2% 4.0% 6.9% 4.2%
RT-1 (15%) 71.0% 35.4% 56.5% 60.2% 81.3% 44.6% 26.7% 56.2%
RT-1 (Converged) 85.7% 44.2% 73.0% 74.6% 89.8% 50.0% 32.3% 63.3%
HPT 56.0% 60.0% 24.0% 46.0% -- -- 31.0% 45.0%
TraceVLA 28.0% 53.7% 57.0% 42.0% 60.0% 56.4% 29.4% 39.6%
RT-1-X 56.7% 31.7% 59.7% 53.4% 49.0% 32.3% 35.3% 64.3%
RT-2-X 78.7% 77.9% 25.0% 60.7% 82.3% 79.2% -- --
Octo-Base 17.0% 4.2% 22.7% 16.8% 0.6% 3.1% 1.1% 1.1%
OpenVLA 16.3% 46.2% 35.6% 27.7% 54.5% 47.7% 17.7% 39.8%
RoboVLM (zero-shot) 72.7% 66.3% 26.8% 56.3% 68.3% 56.0% 8.5% 46.3%
RoboVLM (fine-tuning) 77.3% 61.7% 43.5% 63.4% 75.6% 60.0% 10.6% 51.3%
SpatialVLA (zero-shot) 81.0% 69.6% 59.3% 71.9% 89.5% 71.7% 36.2% 68.8%
SpatialVLA (fine-tuning) 86.0% 77.9% 57.4% 75.1% 88.0% 72.7% 41.8% 70.7%

Note

SimplerEnv evaluation on WidowX Robot tasks.

Model Put Spoon on Towel Put Carrot on Plate Stack Green Block on Yellow Block Put Eggplant in Yellow Basket #Overall Average
Grasp Spoon Success Grasp Carrot Success Grasp Green Block Success Grasp Eggplant Success
RT-1-X 16.7% 0.0% 20.8% 4.2% 8.3% 0.0% 0.0% 0.0% 1.1%
Octo-Base 34.7% 12.5% 52.8% 8.3% 31.9% 0.0% 66.7% 43.1% 16.0%
Octo-Small 77.8% 47.2% 27.8% 9.7% 40.3% 4.2% 87.5% 56.9% 30.0%
OpenVLA 4.1% 0.0% 33.3% 0.0% 12.5% 0.0% 8.3% 4.1% 1.0%
RoboVLM (zero-shot) 37.5% 20.8% 33.3% 25.0% 8.3% 8.3% 0.0% 0.0% 13.5%
RoboVLM (fine-tuning) 54.2% 29.2% 25.0% 25.0% 45.8% 12.5% 58.3% 58.3% 31.3%
SpatialVLA (zero-shot) 25.0% 20.8% 41.7% 20.8% 58.3% 25.0% 79.2% 70.8% 34.4%
SpatialVLA (fine-tuning) 20.8% 16.7% 29.2% 25.0% 62.5% 29.2% 100.0% 100.0% 42.7%

Note

LIBERO Simulation Benchmark Results.

Model LIBERO-Spatial LIBERO-Object LIBERO-Goal LIBERO-Long Average
SR (↑) Rank (↓) SR (↑) Rank (↓) SR (↑) Rank (↓) SR (↑) Rank (↓) SR (↑) Rank (↓)
Diffusion Policy from scratch 78.3 Β± 1.1% 5 92.5 Β± 0.7% 1 68.3 Β± 1.2% 5 50.5 Β± 1.3% 5 72.4 Β± 0.7% 5
Octo fine-tuned 78.9 Β± 1.0% 4 85.7 Β± 0.9% 4 84.6 Β± 0.9% 1 51.1 Β± 1.3% 4 75.1 Β± 0.6% 3
OpenVLA fine-tuned 84.7 Β± 0.9% 2 88.4 Β± 0.8% 3 79.2 Β± 1.0% 2 53.7 Β± 1.3% 3 76.5 Β± 0.6% 2
TraceVLA fine-tuned 84.6 Β± 0.2% 3 85.2 Β± 0.4% 5 75.1 Β± 0.3% 4 54.1 Β± 1.0% 2 74.8 Β± 0.5% 4
SpatialVLA fine-tuned 88.2 Β± 0.5% 1 89.9 Β± 0.7% 2 78.6 Β± 0.6% 3 55.5 Β± 1.0% 1 78.1 Β± 0.7% 1

Note

Zero-shot Robot Control Evaluation on real-world WidowX Robot.

perform

Note

Spatial Understanding Capability Evaluation.

perform

Note

Adapting to New Robot Setups on Franka Robot.

perform

TODO List

  • Release pre-training / fine-tuning code for SpatialVLA series.
  • Release the code, model, and custom data of SpatialVLA.
  • Release the SimplerENV evaluation code for SpatialVLA series
  • Release SpatialVLA2

πŸ€— FAQs

If you encounter any issues, feel free to open an issue on GitHub or reach out through discussions. We appreciate your feedback and contributions! πŸš€

License

This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

Citation

If you find this project useful in your research, please consider cite:

@misc{qu2025spatialvlaexploringspatialrepresentations,
      title={SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model}, 
      author={Delin Qu and Haoming Song and Qizhi Chen and Yuanqi Yao and Xinyi Ye and Yan Ding and Zhigang Wang and JiaYuan Gu and Bin Zhao and Dong Wang and Xuelong Li},
      year={2025},
      eprint={2501.15830},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2501.15830}, 
}

Acknowledgement

SpatialVLA is built with reference to the code of the following projects: InternVL, Google Paligemma2, Transformers, OpenVLA and ZoeDepth. Thanks for their awesome work!

About

πŸ”₯ SpatialVLA: a spatial-enhanced vision-language-action model that is trained on 1.1 Million real robot episodes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.8%
  • Shell 4.2%