Recent advancements have exploited diffusion models for the synthesis of either LiDAR point clouds or camera image data in driving scenarios. Despite their success in modeling single-modality data marginal distribution, there is an under- exploration in the mutual reliance between different modalities to describe complex driving scenes. To fill in this gap, we propose a novel framework, X-DRIVE, to model the joint distribution of point clouds and multi-view images via a dual- branch latent diffusion model architecture. Considering the distinct geometrical spaces of the two modalities, X-DRIVE conditions the synthesis of each modality on the corresponding local regions from the other modality, ensuring better alignment and realism. To further handle the spatial ambiguity during denoising, we design the cross-modality condition module based on epipolar lines to adaptively learn the cross-modality local correspondence. Besides, X-DRIVE allows for controllable generation through multi-level input conditions, including text, bounding box, image, and point clouds. Extensive results demonstrate the high-fidelity synthetic results of X-DRIVE for both point clouds and multi-view images, adhering to input conditions while ensuring reliable cross-modality consistency.
We jointly generate pairwise LiDAR-camera data with cross-modality consistency.

- We incorporate RangeLDM model architecture and pretrained weights into our LiDAR branch to simplify our training pipleine. We thank the authors for releasing their excellent work.
- We include the EMA model in our training pipeline.
- Training code
- DAS metric for multimodal alignment
- Visualization code
- Pretrained checkpoints & Generation of synthetic dataset
Our code base is developed on the basis of MagicDrive, so our enviornment setup is almost same as them. We appreciate their efforts in making the code open-source!
The code is tested with Pytorch==1.10.2 on A6000 GPUs. To set up the python environment, follow:
pip install -r requirements/dev.txt
We opt to install the source code for the following packages, with cd ${FOLDER}; pip -vvv install .
# install third-party
third_party/
├── bevfusion -> based on db75150
├── diffusers -> based on v0.17.1 (afcca39)
└── xformers -> based on v0.0.19 (8bf59c9), optional
see note about our xformers. If you have issues with the environment setup, please check FAQ first.
We prepare the nuScenes dataset similar to MagicDrive. Specifically,
-
Download the nuScenes dataset from the website and put them in
./data/
. You should have these files:data/nuscenes ├── maps ├── mini ├── samples ├── sweeps ├── v1.0-mini └── v1.0-trainval
-
Prepare the mmdet3d annotation file
Tip
You can download the .pkl
files from Google Drive.
Or alternatively, you can generate mmdet3d annotation files by:
```bash
python tools/create_data.py nuscenes --root-path ./data/nuscenes \
--out-dir ./data/nuscenes_mmdet3d_2 --extra-tag nuscenes
```
You should have these files:
```bash
data/nuscenes_mmdet3d_2
├── nuscenes_dbinfos_train.pkl (-> ${bevfusion-version}/nuscenes_dbinfos_train.pkl)
├── nuscenes_gt_database (-> ${bevfusion-version}/nuscenes_gt_database)
├── nuscenes_infos_train.pkl
└── nuscenes_infos_val.pkl
```
We initialize our multi-modal generation model with stable-diffusion-2.1 (for camera branch) and RangeLDM (for LiDAR branch). Please download them and make the directory as follow:
X-Drive
├── pretrained
├── stable-diffusion-2-1-base
└── RangeLDM-nuScenes
Our method requires two stages for training:
- In the first stage, we train the diffusion model for single-modality point clouds data conditioned on text description and 3D box
Launch training (with 4xA6000 GPUs):
accelerate launch --mixed_precision fp16 --gpu_ids 0,1,2,3 --num_processes 4 tools/train.py --config-name=config_pc_RangeLDM_box +exp=pc_ldm_box runner=4gpus_pc
- In the second stage, we train the multi-modal diffusion model from the pretrained LiDAR branch and camera branch
Launch training (with 4xA6000 GPUs):
accelerate launch --mixed_precision fp16 --gpu_ids 0,1,2,3 --num_processes 4 tools/train.py --config-name=config_multi_box +exp=multi_ldm_box runner=4gpus_multi
During training, you can check tensorboard for the log and intermediate results.
Coming soon...
If our paper or code are helpful, please consider citing it as
@article{xie2024x,
title={X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios},
author={Xie, Yichen and Xu, Chenfeng and Peng, Chensheng and Zhao, Shuqi and Ho, Nhat and Pham, Alexander T and Ding, Mingyu and Tomizuka, Masayoshi and Zhan, Wei},
journal={arXiv preprint arXiv:2411.01123},
year={2024}
}