C L I P Vi T B 32 laion2 B s34 B b79 K
laionIntroduction
The CLIP ViT-B/32 model, trained using the LAION-2B English subset of the LAION-5B dataset, leverages OpenCLIP for its architecture. Developed by Romain Beaumont on the stability.ai cluster, this model is designed for zero-shot image classification and related tasks.
Architecture
The model uses a Vision Transformer (ViT) architecture with a B/32 configuration. It is part of the OpenCLIP initiative, which extends the original CLIP model by OpenAI. The architecture supports tasks such as zero-shot image classification, image and text retrieval, and more.
Training
Training Data
The model was trained on the English subset of the LAION-5B dataset, comprising 2 billion samples. The dataset is uncurated, and users should exercise caution due to the potential for encountering disturbing content.
Training Procedure
Training was conducted using the stability.ai cluster, with detailed logs available in the training notes and wandb logs.
Evaluation
Evaluation was performed using the LAION CLIP Benchmark suite. The model achieved a 66.6 zero-shot top-1 accuracy on ImageNet-1k.
Guide: Running Locally
To run this model locally, follow these steps:
-
Install Dependencies: Ensure you have Python and PyTorch installed. Install the necessary libraries using:
pip install transformers open_clip_torch
-
Load the Model: Use Hugging Face's
transformers
library to load the model. -
Run Inference: Prepare your images and labels, then run inference using the model.
-
Utilize Cloud GPUs: For efficient processing, consider using cloud-based GPU services like AWS, Google Cloud, or Azure.
License
This model is licensed under the MIT License, allowing for open use and modification under the terms of the license.