C L I P Vi T B 32 laion2 B s34 B b79 K LLM Model

Introduction

The CLIP ViT-B/32 model, trained using the LAION-2B English subset of the LAION-5B dataset, leverages OpenCLIP for its architecture. Developed by Romain Beaumont on the stability.ai cluster, this model is designed for zero-shot image classification and related tasks.

Architecture

The model uses a Vision Transformer (ViT) architecture with a B/32 configuration. It is part of the OpenCLIP initiative, which extends the original CLIP model by OpenAI. The architecture supports tasks such as zero-shot image classification, image and text retrieval, and more.

Training

Training Data

The model was trained on the English subset of the LAION-5B dataset, comprising 2 billion samples. The dataset is uncurated, and users should exercise caution due to the potential for encountering disturbing content.

Training Procedure

Training was conducted using the stability.ai cluster, with detailed logs available in the training notes and wandb logs.

Evaluation

Evaluation was performed using the LAION CLIP Benchmark suite. The model achieved a 66.6 zero-shot top-1 accuracy on ImageNet-1k.

Guide: Running Locally

To run this model locally, follow these steps:

Install Dependencies: Ensure you have Python and PyTorch installed. Install the necessary libraries using:
```
pip install transformers open_clip_torch
```
Load the Model: Use Hugging Face's transformers library to load the model.
Run Inference: Prepare your images and labels, then run inference using the model.
Utilize Cloud GPUs: For efficient processing, consider using cloud-based GPU services like AWS, Google Cloud, or Azure.

License

This model is licensed under the MIT License, allowing for open use and modification under the terms of the license.

More Related APIs in Zero Shot Image Classification