[ICLR 2025] X-Drive: Cross-modality Consistent Multi-Sensor Data Synthesis for Driving Scenarios

Abstract

Recent advancements have exploited diffusion models for the synthesis of either LiDAR point clouds or camera image data in driving scenarios. Despite their success in modeling single-modality data marginal distribution, there is an under- exploration in the mutual reliance between different modalities to describe complex driving scenes. To fill in this gap, we propose a novel framework, X-DRIVE, to model the joint distribution of point clouds and multi-view images via a dual- branch latent diffusion model architecture. Considering the distinct geometrical spaces of the two modalities, X-DRIVE conditions the synthesis of each modality on the corresponding local regions from the other modality, ensuring better alignment and realism. To further handle the spatial ambiguity during denoising, we design the cross-modality condition module based on epipolar lines to adaptively learn the cross-modality local correspondence. Besides, X-DRIVE allows for controllable generation through multi-level input conditions, including text, bounding box, image, and point clouds. Extensive results demonstrate the high-fidelity synthetic results of X-DRIVE for both point clouds and multi-view images, adhering to input conditions while ensuring reliable cross-modality consistency.

[paper link]

Framework

We jointly generate pairwise LiDAR-camera data with cross-modality consistency.

Qualitative Results

Improvement compared to the first ArXiv version

We incorporate RangeLDM model architecture and pretrained weights into our LiDAR branch to simplify our training pipleine. We thank the authors for releasing their excellent work.
We include the EMA model in our training pipeline.

Updates

Training code
DAS metric for multimodal alignment
Visualization code
Pretrained checkpoints & Generation of synthetic dataset

Getting start

Our code base is developed on the basis of MagicDrive, so our enviornment setup is almost same as them. We appreciate their efforts in making the code open-source!

Environment Setup

The code is tested with Pytorch==1.10.2 on A6000 GPUs. To set up the python environment, follow:

pip install -r requirements/dev.txt

We opt to install the source code for the following packages, with cd ${FOLDER}; pip -vvv install .

# install third-party
third_party/
├── bevfusion -> based on db75150
├── diffusers -> based on v0.17.1 (afcca39)
└── xformers  -> based on v0.0.19 (8bf59c9), optional

see note about our xformers. If you have issues with the environment setup, please check FAQ first.

Data Preparation

We prepare the nuScenes dataset similar to MagicDrive. Specifically,

Download the nuScenes dataset from the website and put them in ./data/. You should have these files:

data/nuscenes
├── maps
├── mini
├── samples
├── sweeps
├── v1.0-mini
└── v1.0-trainval

Prepare the mmdet3d annotation file

Tip

You can download the .pkl files from Google Drive.

Or alternatively, you can generate mmdet3d annotation files by:

```bash
python tools/create_data.py nuscenes --root-path ./data/nuscenes \
  --out-dir ./data/nuscenes_mmdet3d_2 --extra-tag nuscenes
```
You should have these files:
```bash
data/nuscenes_mmdet3d_2
├── nuscenes_dbinfos_train.pkl (-> ${bevfusion-version}/nuscenes_dbinfos_train.pkl)
├── nuscenes_gt_database (-> ${bevfusion-version}/nuscenes_gt_database)
├── nuscenes_infos_train.pkl
└── nuscenes_infos_val.pkl
```

Training Pipeline

Initial Weights

We initialize our multi-modal generation model with stable-diffusion-2.1 (for camera branch) and RangeLDM (for LiDAR branch). Please download them and make the directory as follow:

X-Drive
├── pretrained
    ├── stable-diffusion-2-1-base
    └── RangeLDM-nuScenes

Model Training

Our method requires two stages for training:

In the first stage, we train the diffusion model for single-modality point clouds data conditioned on text description and 3D box

Launch training (with 4xA6000 GPUs):

accelerate launch --mixed_precision fp16 --gpu_ids 0,1,2,3 --num_processes 4 tools/train.py --config-name=config_pc_RangeLDM_box  +exp=pc_ldm_box runner=4gpus_pc

In the second stage, we train the multi-modal diffusion model from the pretrained LiDAR branch and camera branch

Launch training (with 4xA6000 GPUs):

accelerate launch --mixed_precision fp16 --gpu_ids 0,1,2,3 --num_processes 4 tools/train.py --config-name=config_multi_box  +exp=multi_ldm_box runner=4gpus_multi

During training, you can check tensorboard for the log and intermediate results.

Model Inference and Visualization

Coming soon...

Reference

If our paper or code are helpful, please consider citing it as

@article{xie2024x,
  title={X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios},
  author={Xie, Yichen and Xu, Chenfeng and Peng, Chensheng and Zhao, Shuqi and Ho, Nhat and Pham, Alexander T and Ding, Mingyu and Tomizuka, Masayoshi and Zhan, Wei},
  journal={arXiv preprint arXiv:2411.01123},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
data		data
demo		demo
doc		doc
perception		perception
requirements		requirements
third_party		third_party
tools		tools
xdrive		xdrive
.gitignore		.gitignore
.gitmodules		.gitmodules
=3.8.0		=3.8.0
LICENSE		LICENSE
README.md		README.md
outputs		outputs
pretrained		pretrained

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICLR 2025] X-Drive: Cross-modality Consistent Multi-Sensor Data Synthesis for Driving Scenarios

Abstract

Framework

Qualitative Results

Improvement compared to the first ArXiv version

Updates

Getting start

Environment Setup

Data Preparation

Training Pipeline

Initial Weights

Model Training

Model Inference and Visualization

Reference

About

Releases

Packages

Contributors 3

Languages

License

yichen928/X-Drive

Folders and files

Latest commit

History

Repository files navigation

[ICLR 2025] X-Drive: Cross-modality Consistent Multi-Sensor Data Synthesis for Driving Scenarios

Abstract

Framework

Qualitative Results

Improvement compared to the first ArXiv version

Updates

Getting start

Environment Setup

Data Preparation

Training Pipeline

Initial Weights

Model Training

Model Inference and Visualization

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages