A generalizable 3D framework and model for self-supervised learning in medical imaging

Tony Xu, Sepehr Hosseini, Chris Anderson, Anthony Rinaldi, Rahul G. Krishnan, Anne Martel, Maged Goubran

Paper link here

This codebase contains PyTorch code and our 3DINO-ViT pretrained model for the 3DINO self-supervised framework for training networks on unlabled 3D medical images, developed at Sunnybrook Research Institute.

Abstract: Current self-supervised learning methods for 3D medical imaging rely on simple pretext formulations and organ- or modality-specific datasets, limiting their generalizability and scalability. We present 3DINO, a cutting-edge SSL method adapted to 3D datasets, and use it to pretrain 3DINO-ViT: a general-purpose medical imaging model, on an exceptionally large, multimodal, and multi-organ dataset of ~100,000 3D medical imaging scans from over 10 organs. We validate 3DINO-ViT using extensive experiments on numerous medical imaging segmentation and classification tasks. Our results demonstrate that 3DINO-ViT generalizes across modalities and organs, including out-of-distribution tasks and datasets, outperforming state-of-the-art methods on the majority of evaluation metrics and labeled dataset sizes. Our 3DINO framework and 3DINO-ViT will be made available to enable research on 3D foundation models or further finetuning for a wide range of medical imaging applications.

Installation

3DINO code runs on Python 3.9. Clone the codebase, then use the provided requirements.txt file and pip to install the necessary libraries for this repo:

pip install -r requirements.txt

3DINO-ViT Model

3DINO-ViT model will be released on acceptance of the paper!

A barebones example for loading the pretrained network and applying it to an image to extract a feature vector representation is provided in this notebook.

Pretraining

We provide code to pretrain 3DINO on general 3D medical imaging datasets. We use MONAI for data loading, so any format that can be loaded by the LoadImage transform can be used. Datasets that we pretrained on can be found in the paper.

Pretraining Dataset

Datasets should be formatted as a list of dictionaries with the following format. The image key should point to the path of the image file, the shape key should contain the shape of the image (e.g. when calling loaded_img.shape), and the spacing key should contain the voxel spacing of the image in arbitrary units (but consistent per image). The shape and spacing keys are needed for 3D random resized cropping.

[
    {
        "image": "path/to/image1.nii.gz",
        "shape": [128, 128, 64],
        "spacing": [0.5, 0.5, 1.0],
    },
    {
        "image": "path/to/image2.nii.gz",
        "shape": [256, 256, 128],
        "spacing": [0.7, 0.7, 1.0],
    },
    ...
]

Save this as a JSON file, and adjust dataset_path in the config file to point to this JSON file.

Standard Pretraining

Standard pretraining can be run using the following command for a single node with 4 A100-80GB GPUs:

PYTHONPATH=. python -m torch.distributed.launch --nproc_per_node 4 --master_port 29501 dinov2/train/train3d.py \
  --config-file 'dinov2/configs/ssl3d_default_config.yaml' \
  --output-dir 'path/to/output_dir' \
  --cache-dir 'path/to/cache_dir'

The cache-dir argument is used for MONAI CacheNTransDataset caching. This dataset saves images after a few preprocessing transforms to disk (potentially in a faster temporary storage system if training on a SLURM cluster). We found this to greatly speed up loading. Remove this argument if you do not want to use caching. Pretraining for 125000 iterations with a ViT-Large took approximately 10 days on 4 A100-80GB GPUs.

The training code saves the weights of the teacher in the eval folder every 12500 iterations.

High Resolution Adaptation

To perform high resolution adaptation on the pretrained network, use the following command. Adjust full_pretrained_weights in the config file to point to the saved teacher weights from standard pretraining.

PYTHONPATH=. python -m torch.distributed.launch --nproc_per_node 4 --master_port 29501 dinov2/train/train3d.py \
  --config-file 'dinov2/configs/train/vit3d_highres.yaml' \
  --output-dir 'path/to/highres_output_dir' \
  --cache-dir 'path/to/cache_dir'

High-res adaption for 12500 iterations with a ViT-Large took approximately 1 day on 4 A100-80GB GPUs.

Finetuning

The training code regularly saves the teacher weights. In order to evaluate the model, run the following evaluation on a single node:

Finetuning dataset setup

Datasets should be formatted as a dict with training, validation, and test keys each with a list of dictionaries, as in the following format:

{
    "training": [
        {
            "image": "path/to/train/image1.nii.gz",
            "label": "path/to/train/label1.nii.gz",  # or int for classification
        },
        ...
    ],
    "validation": [
        {
            "image": "path/to/val/image1.nii.gz",
            "label": "path/to/val/label1.nii.gz",
        },
        ...
    ],
    "test": [
        {
            "image": "path/to/test/image1.nii.gz",
            "label": "path/to/test/label1.nii.gz",
        },
        ...
    ]
}

Save this as a JSON file with name <dataset_name>_100_datalist.json, and adjust the base-data-dir argument below to point to the directory where it is saved.

Segmentation finetuning

To finetune the model for segmentation, use the following command. There are UNETR, ViTAdapterUNETR, and Linear segmentation heads available.

ViTAdapterUNETR uses our 3D adaptation of the original ViT-Adapter module, which injects spatial information into pretrained ViTs. This module was also used in the original DINOv2 when performing segmentation tasks.

PYTHONPATH=. python dinov2/eval/segmentation3d.py \
  --config-file 'dinov2/configs/train/vit3d_highres.yaml' \
  --output-dir  'path/to/output_dir' \
  --pretrained-weights 'path/to/eval/training_12499/teacher_checkpoint.pth' \
  --dataset-name 'BraTS' \
  --dataset-percent 100 \
  --base-data-dir 'path/to/finetuning/jsonfile/base_dir' \
  --segmentation-head 'ViTAdapterUNETR' \
  --epochs 100 \
  --epoch-length 300 \
  --eval-iters 600 \
  --warmup-iters 3000 \
  --image-size 112 \
  --batch-size 2 \
  --num-workers 20 \
  --learning-rate 1e-4 \
  --cache-dir 'path/to/cache_dir' \
  --resize-scale 1.0

Adding custom segmentation datasets

Adding new segmentation datasets for finetuning requires the following steps:

Create a new dataset name and save the json file as described.
Create training and validation transforms for the dataset in: dinov2/eval/segmentation_3d/augmentations.py
Create new evaluation metrics for the dataset in: dinov2/eval/segmentation_3d/metrics.py
Create a loss function to train the network in: dinov2/eval/segmentation3d.py (line 214)
Create a new segmentation dataset in: dinov2/data/loaders.py

Classification finetuning

To finetune the model for classification, use the following command.

PYTHONPATH=. python dinov2/eval/linear3d.py \
  --config-file 'dinov2/configs/train/vit3d_highres.yaml' \
  --output-dir 'path/to/output_dir' \
  --pretrained-weights 'path/to/eval/training_12499/teacher_checkpoint.pth' \
  --dataset-name 'COVID-CT-MD' \
  --dataset-percent 100 \
  --base-data-dir 'path/to/finetuning/jsonfile/base_dir' \
  --epochs 100 \
  --epoch-length 125 \
  --save-checkpoint-frequency 50 \
  --eval-period-iterations 50 \
  --image-size 112 \
  --batch-size 32 \
  --num-workers 10 \
  --dataset-seed 0 \
  --cache-dir 'path/to/cache_dir'

Adding new classification datasets

Adding new classification datasets for finetuning requires the following steps:

Create a new dataset name and save the json file as described.
Create training and validation transforms for the dataset in: dinov2/data/transforms.py
Create a new classification dataset in: dinov2/data/loaders.py

Unsupervised Visualization

We provide example codes to generate unsupervised visualizations on an input image. First, run the following command using the pretrained model to generate 3D representations of the image (vis-type can be mhsa or pca): Then, use the provided notebook to visualize the outputs.

PYTHONPATH=. python dinov2/eval/vis_pca.py \
  --config-file 'dinov2/configs/train/vit3d_highres.yaml' \
  --output-dir 'path/to/output_vis_dir' \
  --pretrained-weights 'path/to/eval/training_12499/teacher_checkpoint.pth' \
  --image-path 'path/to/image.nii.gz' \
  --vis-type 'mhsa' \
  --input-type 'full_image'

License

3DINO code and 3DINO-ViT weights are released under the CC BY-NC-ND 4.0 license.

✅ You MAY:

Use this framework for academic, research, and educational purposes.
Share or redistribute the original, unmodified version of this framework with proper attribution as detailed below.

❌ You MAY NOT:

Use this framework for commercial purposes (as defined below).
Modify, adapt, or create derivative works based on this framework.
Distribute a modified version of this framework.

For full license details, refer to the official CC BY-NC-ND 4.0 License

By Commercial Purposes, we mean that this framework may not be used:

By for-profit entities for internal research, product development, or services.
In industry-funded or corporate-sponsored research.
As part of commercially funded academic projects without prior approval.
In any project where the results will be used for monetary gain (e.g., patent filings, proprietary software development, licensing to industry).

If you are unsure whether your use qualifies as non-commercial, contact [email protected].

Disclaimer

This software is provided "as is" without warranty of any kind. Sunnybrook Research Institute makes no representations or guarantees regarding its accuracy, reliability, performance, or suitability for any particular purpose. Users assume full responsibility for its use and application.

Contact

For inquiries regarding permissions, exceptions, or licensing, contact [email protected].

Acknowledgements

This repo builds upon the excellent work from the original DINOv2 and ViT-Adapter for 2D natural images.

Citing 3DINO

If you find this repository useful or use 3DINO-ViT in your research, please consider giving a star and citing the following paper:

@misc{xu2025generalizable3dframeworkmodel,
      title={A generalizable 3D framework and model for self-supervised learning in medical imaging}, 
      author={Tony Xu and Sepehr Hosseini and Chris Anderson and Anthony Rinaldi and Rahul G. Krishnan and Anne L. Martel and Maged Goubran},
      year={2025},
      eprint={2501.11755},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      url={https://arxiv.org/abs/2501.11755}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dinov2		dinov2
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pca_example.gif		pca_example.gif
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A generalizable 3D framework and model for self-supervised learning in medical imaging

Installation

3DINO-ViT Model

Pretraining

Pretraining Dataset

Standard Pretraining

High Resolution Adaptation

Finetuning

Finetuning dataset setup

Segmentation finetuning

Adding custom segmentation datasets

Classification finetuning

Adding new classification datasets

Unsupervised Visualization

License

Disclaimer

Contact

Acknowledgements

Citing 3DINO

About

Releases

Packages

Languages

License

AICONSlab/3DINO

Folders and files

Latest commit

History

Repository files navigation

A generalizable 3D framework and model for self-supervised learning in medical imaging

Installation

3DINO-ViT Model

Pretraining

Pretraining Dataset

Standard Pretraining

High Resolution Adaptation

Finetuning

Finetuning dataset setup

Segmentation finetuning

Adding custom segmentation datasets

Classification finetuning

Adding new classification datasets

Unsupervised Visualization

License

Disclaimer

Contact

Acknowledgements

Citing 3DINO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages