C L I P Gm P Vi T L 14

zer0int

Introduction

The CLIP-GmP-ViT-L-14 model is a fine-tuned version of OpenAI's CLIP ViT-L/14 model, designed for improved zero-shot image classification. It leverages advanced techniques like Geometric Parametrization and custom loss functions to enhance performance.

Architecture

The model architecture is built upon CLIP ViT-L/14's multi-layer perceptron (MLP) structure. The original MLP uses linear layers with GELU activation functions. In the GmP version, these linear layers are replaced with GeometricLinear layers. These decompose weights into radial and angular components, preserving the directionality and magnitude of the weight vectors. This modification is applied to both the image and text transformer blocks.

Training

The fine-tuning process involves techniques like Geometric Parametrization, activation value manipulation, and a custom loss function with label smoothing. These methods collectively improve the model's ImageNet/ObjectNet accuracy to approximately 0.91, surpassing the original pre-trained model's accuracy of ~0.84.

Guide: Running Locally

  1. Setup Environment: Ensure Python and necessary libraries (such as transformers and torch) are installed.
  2. Install Hugging Face Transformers: Use pip install transformers to install the library.
  3. Load the Model:
    from transformers import CLIPModel, CLIPProcessor, CLIPConfig
    
    model_id = "zer0int/CLIP-GmP-ViT-L-14"
    config = CLIPConfig.from_pretrained(model_id)
    model = CLIPModel.from_pretrained(model_id)
    
  4. GPU Recommendation: For improved performance, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure.

License

The model is released under the MIT License, allowing for wide usage and modification. For full license details, see OpenAI's CLIP documentation here.

More Related APIs in Zero Shot Image Classification