D F N5 B C L I P Vi T H 14 378

apple

Introduction

The DFN5B-CLIP-ViT-H-14-378 model is a CLIP (Contrastive Language-Image Pre-training) model fine-tuned on DFN-5B. It uses Data Filtering Networks (DFNs), which are small networks designed to filter large datasets. This model was trained on 5 billion filtered images from a pool of 43 billion uncurated image-text pairs.

Architecture

  • Model Type: Contrastive Image-Text, Zero-Shot Image Classification.
  • Dataset: DFN-5b.
  • Sample Size: 39 billion samples of size 224x224 and 5 billion samples of size 384x384.
  • Conversion: The model was converted from JAX checkpoints to PyTorch, making it compatible with OpenCLIP.

Training

The model was trained using 5 billion filtered images out of 43 billion uncurated pairs. It was converted to PyTorch and can be used with OpenCLIP for image and text processing. The model excels in zero-shot image classification and has been evaluated across various datasets, achieving high performance metrics.

Guide: Running Locally

Basic Steps

  1. Install Dependencies:

    • Ensure Python is installed.
    • Install PyTorch and OpenCLIP:
      pip install torch open_clip
      
  2. Load and Preprocess the Image:

    • Use an image URL to load and preprocess the image with PIL and OpenCLIP's preprocessing functions.
  3. Tokenize Text Labels:

    • Use OpenCLIP's tokenizer to prepare text labels for comparison against the image.
  4. Encode and Compare:

    • Use the model to encode both the image and text features.
    • Normalize and compute probabilities of text labels matching the image.

Suggested Cloud GPUs

Utilize cloud services like AWS, Google Cloud Platform, or Azure to access high-performance GPUs for more efficient processing, especially when dealing with large datasets or complex models.

License

The DFN5B-CLIP-ViT-H-14-378 model is licensed under the Apple Sample Code License. This license may contain specific terms and conditions regarding usage, distribution, and modification.

More Related APIs