D F N5 B C L I P Vi T H 14 378
appleIntroduction
The DFN5B-CLIP-ViT-H-14-378 model is a CLIP (Contrastive Language-Image Pre-training) model fine-tuned on DFN-5B. It uses Data Filtering Networks (DFNs), which are small networks designed to filter large datasets. This model was trained on 5 billion filtered images from a pool of 43 billion uncurated image-text pairs.
Architecture
- Model Type: Contrastive Image-Text, Zero-Shot Image Classification.
- Dataset: DFN-5b.
- Sample Size: 39 billion samples of size 224x224 and 5 billion samples of size 384x384.
- Conversion: The model was converted from JAX checkpoints to PyTorch, making it compatible with OpenCLIP.
Training
The model was trained using 5 billion filtered images out of 43 billion uncurated pairs. It was converted to PyTorch and can be used with OpenCLIP for image and text processing. The model excels in zero-shot image classification and has been evaluated across various datasets, achieving high performance metrics.
Guide: Running Locally
Basic Steps
-
Install Dependencies:
- Ensure Python is installed.
- Install PyTorch and OpenCLIP:
pip install torch open_clip
-
Load and Preprocess the Image:
- Use an image URL to load and preprocess the image with PIL and OpenCLIP's preprocessing functions.
-
Tokenize Text Labels:
- Use OpenCLIP's tokenizer to prepare text labels for comparison against the image.
-
Encode and Compare:
- Use the model to encode both the image and text features.
- Normalize and compute probabilities of text labels matching the image.
Suggested Cloud GPUs
Utilize cloud services like AWS, Google Cloud Platform, or Azure to access high-performance GPUs for more efficient processing, especially when dealing with large datasets or complex models.
License
The DFN5B-CLIP-ViT-H-14-378 model is licensed under the Apple Sample Code License. This license may contain specific terms and conditions regarding usage, distribution, and modification.