C L I P Vi T H 14 laion2 B s32 B b79 K
laionCLIP ViT-H/14 - LAION-2B
Introduction
The CLIP ViT-H/14 model is a zero-shot image classification model trained on the LAION-2B English subset using OpenCLIP. It is designed for research purposes to understand and explore arbitrary image classification.
Architecture
The model utilizes the Vision Transformer (ViT) H/14 architecture and is part of the OpenCLIP initiative, which is an open-source project aimed at replicating and improving upon the original OpenAI CLIP model.
Training
- Training Data: The model was trained on the 2 billion sample English subset of the LAION-5B dataset, which is uncurated and crawled from public internet sources. Users are advised to use this dataset cautiously due to potentially disturbing content.
- Training Procedure: Model training was conducted using Stability.ai's computational resources. Additional training notes and performance logs are available for in-depth insights.
- Evaluation: The model was evaluated using the VTAB+ dataset for classification and COCO/Flickr for retrieval, achieving a 78.0% zero-shot top-1 accuracy on ImageNet-1k.
Guide: Running Locally
To run the model locally, you will need to set up an environment using Python and install dependencies such as Hugging Face Transformers and OpenCLIP.
Steps
- Environment Setup: Ensure you have Python installed. Set up a virtual environment.
- Install Dependencies: Use pip to install the necessary libraries:
pip install transformers open_clip_pytorch
- Load the Model: Use the Hugging Face Transformers library to load and use the model.
- Run Inference: Implement code to perform zero-shot image classification.
Suggested Cloud GPUs
Consider using cloud platforms like AWS, Google Cloud, or Azure which provide GPU instances for efficient model execution.
License
The model is licensed under the MIT License, which permits use, distribution, and modification, provided attribution is given to the original authors.