Skip to content

[ICLR 2025] Official code implementation for the paper "X-Drive: Cross-modality Consistent Multi-Sensor Data Synthesis for Driving Scenarios"

License

Notifications You must be signed in to change notification settings

yichen928/X-Drive

Repository files navigation

[ICLR 2025] X-Drive: Cross-modality Consistent Multi-Sensor Data Synthesis for Driving Scenarios

Abstract

Recent advancements have exploited diffusion models for the synthesis of either LiDAR point clouds or camera image data in driving scenarios. Despite their success in modeling single-modality data marginal distribution, there is an under- exploration in the mutual reliance between different modalities to describe complex driving scenes. To fill in this gap, we propose a novel framework, X-DRIVE, to model the joint distribution of point clouds and multi-view images via a dual- branch latent diffusion model architecture. Considering the distinct geometrical spaces of the two modalities, X-DRIVE conditions the synthesis of each modality on the corresponding local regions from the other modality, ensuring better alignment and realism. To further handle the spatial ambiguity during denoising, we design the cross-modality condition module based on epipolar lines to adaptively learn the cross-modality local correspondence. Besides, X-DRIVE allows for controllable generation through multi-level input conditions, including text, bounding box, image, and point clouds. Extensive results demonstrate the high-fidelity synthetic results of X-DRIVE for both point clouds and multi-view images, adhering to input conditions while ensuring reliable cross-modality consistency.

[paper link]

Framework

We jointly generate pairwise LiDAR-camera data with cross-modality consistency. 截屏2024-10-31 下午10 58 44

Qualitative Results

results

Improvement compared to the first ArXiv version

  • We incorporate RangeLDM model architecture and pretrained weights into our LiDAR branch to simplify our training pipleine. We thank the authors for releasing their excellent work.
  • We include the EMA model in our training pipeline.

Updates

  • Training code
  • DAS metric for multimodal alignment
  • Visualization code
  • Pretrained checkpoints & Generation of synthetic dataset

Getting start

Our code base is developed on the basis of MagicDrive, so our enviornment setup is almost same as them. We appreciate their efforts in making the code open-source!

Environment Setup

The code is tested with Pytorch==1.10.2 on A6000 GPUs. To set up the python environment, follow:

pip install -r requirements/dev.txt

We opt to install the source code for the following packages, with cd ${FOLDER}; pip -vvv install .

# install third-party
third_party/
├── bevfusion -> based on db75150
├── diffusers -> based on v0.17.1 (afcca39)
└── xformers  -> based on v0.0.19 (8bf59c9), optional

see note about our xformers. If you have issues with the environment setup, please check FAQ first.

Data Preparation

We prepare the nuScenes dataset similar to MagicDrive. Specifically,

  1. Download the nuScenes dataset from the website and put them in ./data/. You should have these files:

    data/nuscenes
    ├── maps
    ├── mini
    ├── samples
    ├── sweeps
    ├── v1.0-mini
    └── v1.0-trainval
  2. Prepare the mmdet3d annotation file

Tip

You can download the .pkl files from Google Drive.

Or alternatively, you can generate mmdet3d annotation files by:

```bash
python tools/create_data.py nuscenes --root-path ./data/nuscenes \
  --out-dir ./data/nuscenes_mmdet3d_2 --extra-tag nuscenes
```
You should have these files:
```bash
data/nuscenes_mmdet3d_2
├── nuscenes_dbinfos_train.pkl (-> ${bevfusion-version}/nuscenes_dbinfos_train.pkl)
├── nuscenes_gt_database (-> ${bevfusion-version}/nuscenes_gt_database)
├── nuscenes_infos_train.pkl
└── nuscenes_infos_val.pkl
```

Training Pipeline

Initial Weights

We initialize our multi-modal generation model with stable-diffusion-2.1 (for camera branch) and RangeLDM (for LiDAR branch). Please download them and make the directory as follow:

X-Drive
├── pretrained
    ├── stable-diffusion-2-1-base
    └── RangeLDM-nuScenes

Model Training

Our method requires two stages for training:

  1. In the first stage, we train the diffusion model for single-modality point clouds data conditioned on text description and 3D box

Launch training (with 4xA6000 GPUs):

accelerate launch --mixed_precision fp16 --gpu_ids 0,1,2,3 --num_processes 4 tools/train.py --config-name=config_pc_RangeLDM_box  +exp=pc_ldm_box runner=4gpus_pc
  1. In the second stage, we train the multi-modal diffusion model from the pretrained LiDAR branch and camera branch

Launch training (with 4xA6000 GPUs):

accelerate launch --mixed_precision fp16 --gpu_ids 0,1,2,3 --num_processes 4 tools/train.py --config-name=config_multi_box  +exp=multi_ldm_box runner=4gpus_multi

During training, you can check tensorboard for the log and intermediate results.

Model Inference and Visualization

Coming soon...

Reference

If our paper or code are helpful, please consider citing it as

@article{xie2024x,
  title={X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios},
  author={Xie, Yichen and Xu, Chenfeng and Peng, Chensheng and Zhao, Shuqi and Ho, Nhat and Pham, Alexander T and Ding, Mingyu and Tomizuka, Masayoshi and Zhan, Wei},
  journal={arXiv preprint arXiv:2411.01123},
  year={2024}
}

About

[ICLR 2025] Official code implementation for the paper "X-Drive: Cross-modality Consistent Multi-Sensor Data Synthesis for Driving Scenarios"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages