Introduction

RAD-DINO is a vision transformer model designed to encode chest X-rays using the self-supervised learning method DINOv2. It was developed by Microsoft Health Futures and is described in the paper "RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text Supervision."

Architecture

RAD-DINO is a vision transformer model finetuned from the dinov2-base model. It serves as a vision backbone capable of being integrated into other models for various downstream tasks such as image classification, segmentation, clustering, image retrieval, and report generation.

Training

RAD-DINO was trained using images from five public, deidentified chest X-ray datasets, totaling 882,775 images. The training utilized NVIDIA A100 GPUs on Azure Machine Learning with a batch size of 40 images per GPU. The checkpoint provided differs from the one described in the paper, as it uses only public data and was selected after 35,000 iterations.

Guide: Running Locally

  1. Install Dependencies

    • Ensure you have Python and pip installed.
    • Install necessary libraries: pip install torch transformers einops requests pillow.
  2. Download and Preprocess Image

    • Use the provided Python function to download a sample chest X-ray image.
  3. Model Setup

    • Load the RAD-DINO model and image processor from the Hugging Face Model Hub:
      from transformers import AutoModel, AutoImageProcessor
      repo = "microsoft/rad-dino"
      model = AutoModel.from_pretrained(repo)
      processor = AutoImageProcessor.from_pretrained(repo)
      
  4. Image Encoding

    • Preprocess the image and encode it using the model to obtain CLS embeddings.
  5. Consider Cloud GPUs

    • For performance, consider using cloud-based GPUs such as those offered by Azure or AWS.

License

RAD-DINO is distributed under the MSRLA license. For more details, see the license file.

More Related APIs in Image Feature Extraction