Tony Xu, Sepehr Hosseini, Chris Anderson, Anthony Rinaldi, Rahul G. Krishnan, Anne Martel, Maged Goubran
Paper link here
This codebase contains PyTorch code and our 3DINO-ViT pretrained model for the 3DINO self-supervised framework for training networks on unlabled 3D medical images, developed at Sunnybrook Research Institute.
Abstract: Current self-supervised learning methods for 3D medical imaging rely on simple pretext formulations and organ- or modality-specific datasets, limiting their generalizability and scalability. We present 3DINO, a cutting-edge SSL method adapted to 3D datasets, and use it to pretrain 3DINO-ViT: a general-purpose medical imaging model, on an exceptionally large, multimodal, and multi-organ dataset of ~100,000 3D medical imaging scans from over 10 organs. We validate 3DINO-ViT using extensive experiments on numerous medical imaging segmentation and classification tasks. Our results demonstrate that 3DINO-ViT generalizes across modalities and organs, including out-of-distribution tasks and datasets, outperforming state-of-the-art methods on the majority of evaluation metrics and labeled dataset sizes. Our 3DINO framework and 3DINO-ViT will be made available to enable research on 3D foundation models or further finetuning for a wide range of medical imaging applications.
3DINO code runs on Python 3.9. Clone the codebase, then use the provided requirements.txt
file and pip to install the necessary libraries for this repo:
pip install -r requirements.txt
3DINO-ViT model will be released on acceptance of the paper!
A barebones example for loading the pretrained network and applying it to an image to extract a feature vector representation is provided in this notebook.
We provide code to pretrain 3DINO on general 3D medical imaging datasets. We use MONAI for data loading, so any format that can be loaded by the LoadImage transform can be used. Datasets that we pretrained on can be found in the paper.
Datasets should be formatted as a list of dictionaries with the following format.
The image
key should point to the path of the image file, the shape
key should contain the shape of the image (e.g. when calling loaded_img.shape
), and the spacing
key should contain the voxel spacing of the image in arbitrary units (but consistent per image).
The shape
and spacing
keys are needed for 3D random resized cropping.
[
{
"image": "path/to/image1.nii.gz",
"shape": [128, 128, 64],
"spacing": [0.5, 0.5, 1.0],
},
{
"image": "path/to/image2.nii.gz",
"shape": [256, 256, 128],
"spacing": [0.7, 0.7, 1.0],
},
...
]
Save this as a JSON file, and adjust dataset_path
in the config file to point to this JSON file.
Standard pretraining can be run using the following command for a single node with 4 A100-80GB GPUs:
PYTHONPATH=. python -m torch.distributed.launch --nproc_per_node 4 --master_port 29501 dinov2/train/train3d.py \
--config-file 'dinov2/configs/ssl3d_default_config.yaml' \
--output-dir 'path/to/output_dir' \
--cache-dir 'path/to/cache_dir'
The cache-dir
argument is used for MONAI CacheNTransDataset caching.
This dataset saves images after a few preprocessing transforms to disk (potentially in a faster temporary storage system if training on a SLURM cluster).
We found this to greatly speed up loading. Remove this argument if you do not want to use caching.
Pretraining for 125000 iterations with a ViT-Large took approximately 10 days on 4 A100-80GB GPUs.
The training code saves the weights of the teacher in the eval
folder every 12500 iterations.
To perform high resolution adaptation on the pretrained network, use the following command.
Adjust full_pretrained_weights
in the config file to point to the saved teacher weights from standard pretraining.
PYTHONPATH=. python -m torch.distributed.launch --nproc_per_node 4 --master_port 29501 dinov2/train/train3d.py \
--config-file 'dinov2/configs/train/vit3d_highres.yaml' \
--output-dir 'path/to/highres_output_dir' \
--cache-dir 'path/to/cache_dir'
High-res adaption for 12500 iterations with a ViT-Large took approximately 1 day on 4 A100-80GB GPUs.
The training code regularly saves the teacher weights. In order to evaluate the model, run the following evaluation on a single node:
Datasets should be formatted as a dict with training
, validation
, and test
keys each with a list of dictionaries, as in the following format:
{
"training": [
{
"image": "path/to/train/image1.nii.gz",
"label": "path/to/train/label1.nii.gz", # or int for classification
},
...
],
"validation": [
{
"image": "path/to/val/image1.nii.gz",
"label": "path/to/val/label1.nii.gz",
},
...
],
"test": [
{
"image": "path/to/test/image1.nii.gz",
"label": "path/to/test/label1.nii.gz",
},
...
]
}
Save this as a JSON file with name <dataset_name>_100_datalist.json
, and adjust the base-data-dir
argument below to point to the directory where it is saved.
To finetune the model for segmentation, use the following command.
There are UNETR
, ViTAdapterUNETR
, and Linear
segmentation heads available.
ViTAdapterUNETR
uses our 3D adaptation of the original ViT-Adapter module, which injects spatial information into pretrained ViTs.
This module was also used in the original DINOv2 when performing segmentation tasks.
PYTHONPATH=. python dinov2/eval/segmentation3d.py \
--config-file 'dinov2/configs/train/vit3d_highres.yaml' \
--output-dir 'path/to/output_dir' \
--pretrained-weights 'path/to/eval/training_12499/teacher_checkpoint.pth' \
--dataset-name 'BraTS' \
--dataset-percent 100 \
--base-data-dir 'path/to/finetuning/jsonfile/base_dir' \
--segmentation-head 'ViTAdapterUNETR' \
--epochs 100 \
--epoch-length 300 \
--eval-iters 600 \
--warmup-iters 3000 \
--image-size 112 \
--batch-size 2 \
--num-workers 20 \
--learning-rate 1e-4 \
--cache-dir 'path/to/cache_dir' \
--resize-scale 1.0
Adding new segmentation datasets for finetuning requires the following steps:
- Create a new dataset name and save the json file as described.
- Create training and validation transforms for the dataset in:
dinov2/eval/segmentation_3d/augmentations.py
- Create new evaluation metrics for the dataset in:
dinov2/eval/segmentation_3d/metrics.py
- Create a loss function to train the network in:
dinov2/eval/segmentation3d.py
(line 214) - Create a new segmentation dataset in:
dinov2/data/loaders.py
To finetune the model for classification, use the following command.
PYTHONPATH=. python dinov2/eval/linear3d.py \
--config-file 'dinov2/configs/train/vit3d_highres.yaml' \
--output-dir 'path/to/output_dir' \
--pretrained-weights 'path/to/eval/training_12499/teacher_checkpoint.pth' \
--dataset-name 'COVID-CT-MD' \
--dataset-percent 100 \
--base-data-dir 'path/to/finetuning/jsonfile/base_dir' \
--epochs 100 \
--epoch-length 125 \
--save-checkpoint-frequency 50 \
--eval-period-iterations 50 \
--image-size 112 \
--batch-size 32 \
--num-workers 10 \
--dataset-seed 0 \
--cache-dir 'path/to/cache_dir'
Adding new classification datasets for finetuning requires the following steps:
- Create a new dataset name and save the json file as described.
- Create training and validation transforms for the dataset in:
dinov2/data/transforms.py
- Create a new classification dataset in:
dinov2/data/loaders.py
We provide example codes to generate unsupervised visualizations on an input image.
First, run the following command using the pretrained model to generate 3D representations of the image (vis-type
can be mhsa
or pca
):
Then, use the provided notebook to visualize the outputs.
PYTHONPATH=. python dinov2/eval/vis_pca.py \
--config-file 'dinov2/configs/train/vit3d_highres.yaml' \
--output-dir 'path/to/output_vis_dir' \
--pretrained-weights 'path/to/eval/training_12499/teacher_checkpoint.pth' \
--image-path 'path/to/image.nii.gz' \
--vis-type 'mhsa' \
--input-type 'full_image'
3DINO code and 3DINO-ViT weights are released under the CC BY-NC-ND 4.0 license.
✅ You MAY:
- Use this framework for academic, research, and educational purposes.
- Share or redistribute the original, unmodified version of this framework with proper attribution as detailed below.
❌ You MAY NOT:
- Use this framework for commercial purposes (as defined below).
- Modify, adapt, or create derivative works based on this framework.
- Distribute a modified version of this framework.
For full license details, refer to the official CC BY-NC-ND 4.0 License
By Commercial Purposes, we mean that this framework may not be used:
- By for-profit entities for internal research, product development, or services.
- In industry-funded or corporate-sponsored research.
- As part of commercially funded academic projects without prior approval.
- In any project where the results will be used for monetary gain (e.g., patent filings, proprietary software development, licensing to industry).
If you are unsure whether your use qualifies as non-commercial, contact [email protected].
This software is provided "as is" without warranty of any kind. Sunnybrook Research Institute makes no representations or guarantees regarding its accuracy, reliability, performance, or suitability for any particular purpose. Users assume full responsibility for its use and application.
For inquiries regarding permissions, exceptions, or licensing, contact [email protected].
This repo builds upon the excellent work from the original DINOv2 and ViT-Adapter for 2D natural images.
If you find this repository useful or use 3DINO-ViT in your research, please consider giving a star and citing the following paper:
@misc{xu2025generalizable3dframeworkmodel,
title={A generalizable 3D framework and model for self-supervised learning in medical imaging},
author={Tony Xu and Sepehr Hosseini and Chris Anderson and Anthony Rinaldi and Rahul G. Krishnan and Anne L. Martel and Maged Goubran},
year={2025},
eprint={2501.11755},
archivePrefix={arXiv},
primaryClass={eess.IV},
url={https://arxiv.org/abs/2501.11755},
}