C L I P Vi T L 14 laion2 B s32 B b82 K
laionIntroduction
The CLIP-VIT-L-14-LAION2B-S32B-B82K model is a variant of the CLIP model developed using the LAION-2B English subset from the larger LAION-5B dataset. This model is designed for zero-shot image classification and other image-text retrieval tasks, leveraging OpenCLIP software.
Architecture
The model architecture is based on OpenAI's CLIP framework, utilizing Vision Transformers (ViT) for processing. It integrates advanced techniques like Normformer modifications and scaled cosine attention to handle large-scale datasets effectively.
Training
Training was conducted on the JUWELS Booster supercomputer, using 384 A100 GPUs across 160 virtual epochs, processing a total of 32 billion samples. The training utilized a combination of float16 and float32 precision to mitigate issues that arose, such as loss spike and NaN values. The training data consisted of the 2 billion sample English subset of LAION-5B, albeit uncurated, which necessitates caution due to potentially disturbing content.
Guide: Running Locally
To run the model locally, follow these steps:
- Set up the Environment: Ensure you have Python and necessary libraries installed, such as PyTorch and OpenCLIP.
- Clone Repository: Clone the OpenCLIP repository from GitHub.
- Download Model Weights: Obtain the model weights from the Hugging Face Model Hub.
- Run the Model: Use scripts provided in the repository to load the model and perform inference.
For efficient computation, it is recommended to use cloud-based solutions with NVIDIA A100 GPUs, such as AWS or Google Cloud Platform.
License
This model is distributed under the MIT License, allowing for wide-ranging use and modification within the bounds of the license terms.