D F N2 B C L I P Vi T L 14 39 B

apple

Introduction

The DFN2B-CLIP-ViT-L-14-39B is a CLIP (Contrastive Language-Image Pre-training) model developed by Apple. It utilizes Data Filtering Networks (DFNs) to filter uncurated data for training. This model is designed for contrastive image-text zero-shot image classification and has been trained on 2 billion images selected from a pool of 12.8 billion image-text pairs.

Architecture

The model is based on the CLIP architecture, which is suitable for tasks involving both image and text modalities. Originally developed in JAX via Axlearn, it has been converted to PyTorch for compatibility with the OpenCLIP framework. The dataset used for training, DFN-2B, is filtered using DFNs to improve data quality.

Training

The DFN2B-CLIP-ViT-L-14-39B model was trained on 2 billion images filtered from a larger set of 12.8 billion image-text pairs from the CommonPool-12.8B dataset. The training process involved leveraging the DFN methodology as outlined in the research paper "Data Filtering Networks" (arXiv:2309.17425).

Guide: Running Locally

  1. Clone the repository and navigate to the model directory.
  2. Install necessary dependencies, including PyTorch and OpenCLIP.
  3. Load the model using the PyTorch framework.
  4. Test the model with sample image-text pairs to ensure proper configuration.

For enhanced performance, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.

License

The model is distributed under the Apple Sample Code License. Please refer to the LICENSE file for detailed terms and conditions.

More Related APIs