C L I P Gm P Vi T L 14
zer0intIntroduction
The CLIP-GmP-ViT-L-14 model is a fine-tuned version of OpenAI's CLIP ViT-L/14 model, designed for improved zero-shot image classification. It leverages advanced techniques like Geometric Parametrization and custom loss functions to enhance performance.
Architecture
The model architecture is built upon CLIP ViT-L/14's multi-layer perceptron (MLP) structure. The original MLP uses linear layers with GELU activation functions. In the GmP version, these linear layers are replaced with GeometricLinear layers. These decompose weights into radial and angular components, preserving the directionality and magnitude of the weight vectors. This modification is applied to both the image and text transformer blocks.
Training
The fine-tuning process involves techniques like Geometric Parametrization, activation value manipulation, and a custom loss function with label smoothing. These methods collectively improve the model's ImageNet/ObjectNet accuracy to approximately 0.91, surpassing the original pre-trained model's accuracy of ~0.84.
Guide: Running Locally
- Setup Environment: Ensure Python and necessary libraries (such as
transformers
andtorch
) are installed. - Install Hugging Face Transformers: Use
pip install transformers
to install the library. - Load the Model:
from transformers import CLIPModel, CLIPProcessor, CLIPConfig model_id = "zer0int/CLIP-GmP-ViT-L-14" config = CLIPConfig.from_pretrained(model_id) model = CLIPModel.from_pretrained(model_id)
- GPU Recommendation: For improved performance, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure.
License
The model is released under the MIT License, allowing for wide usage and modification. For full license details, see OpenAI's CLIP documentation here.