C L I P Vi T g 14 laion2 B s12 B b42 K
laionIntroduction
The CLIP ViT-g/14 model is a variant trained on the LAION-2B English subset, utilizing the OpenCLIP framework. The model is intended for zero-shot image classification and text-image retrieval, among other applications. It is developed primarily for research purposes to explore the capabilities of large-scale multi-modal models.
Architecture
The model employs a Vision Transformer (ViT) architecture, specifically the "g/14" configuration. This architecture is designed to process and understand visual data in conjunction with textual inputs, leveraging the large-scale dataset from LAION-5B.
Training
Training Data
The model was trained using the 2 billion sample English subset from the LAION-5B dataset. The dataset is uncurated and consists of a wide variety of internet-sourced content. It is recommended for research purposes due to its unfiltered nature.
Training Procedure
Training was conducted using the stability.ai compute cluster. Detailed notes on the training process and logs are available through provided links, including the use of the LAION CLIP Benchmark suite for evaluation.
Guide: Running Locally
To run the CLIP ViT-g/14 model locally, follow these steps:
-
Install Dependencies: Ensure you have Python and PyTorch installed. Install the required libraries, such as Hugging Face Transformers and OpenCLIP.
-
Download the Model: Access and download the model files from the Hugging Face model hub.
-
Run Inference: Use the provided scripts or create your own to run inference with the model. Ensure you have a compatible dataset for testing.
-
GPU Recommendation: For optimal performance, it is recommended to use cloud GPUs, such as those provided by AWS, Google Cloud, or Azure.
License
The CLIP ViT-g/14 model is released under the MIT license, allowing for flexibility in usage while ensuring openness and accessibility.