D F N5 B C L I P Vi T H 14 378 LLM Model

Introduction

The DFN5B-CLIP-ViT-H-14-378 model is a CLIP (Contrastive Language-Image Pre-training) model fine-tuned on DFN-5B. It uses Data Filtering Networks (DFNs), which are small networks designed to filter large datasets. This model was trained on 5 billion filtered images from a pool of 43 billion uncurated image-text pairs.

Architecture

Model Type: Contrastive Image-Text, Zero-Shot Image Classification.
Dataset: DFN-5b.
Sample Size: 39 billion samples of size 224x224 and 5 billion samples of size 384x384.
Conversion: The model was converted from JAX checkpoints to PyTorch, making it compatible with OpenCLIP.

Training

The model was trained using 5 billion filtered images out of 43 billion uncurated pairs. It was converted to PyTorch and can be used with OpenCLIP for image and text processing. The model excels in zero-shot image classification and has been evaluated across various datasets, achieving high performance metrics.

Guide: Running Locally

Basic Steps

Install Dependencies:
- Ensure Python is installed.
- Install PyTorch and OpenCLIP:
```
pip install torch open_clip
```
Load and Preprocess the Image:
- Use an image URL to load and preprocess the image with PIL and OpenCLIP's preprocessing functions.
Tokenize Text Labels:
- Use OpenCLIP's tokenizer to prepare text labels for comparison against the image.
Encode and Compare:
- Use the model to encode both the image and text features.
- Normalize and compute probabilities of text labels matching the image.

Suggested Cloud GPUs

Utilize cloud services like AWS, Google Cloud Platform, or Azure to access high-performance GPUs for more efficient processing, especially when dealing with large datasets or complex models.

License

The DFN5B-CLIP-ViT-H-14-378 model is licensed under the Apple Sample Code License. This license may contain specific terms and conditions regarding usage, distribution, and modification.

More Related APIs