Long C L I P Gm P Vi T L 14
zer0intIntroduction
LongCLIP-GmP-ViT-L-14 is an advanced version of the original Long-CLIP model, fine-tuned by the user zer0int. The model enhances zero-shot image classification capabilities using a refined architecture that improves upon the accuracy and efficiency of previous iterations, especially when dealing with longer text inputs.
Architecture
The model is built on the Long-CLIP framework, integrating Geometric Parametrization (GmP) in its multi-layer perceptron (MLP) layers. This involves decomposing weights into radial and angular components, preserving the directionality and magnitude of weight vectors. This modification enhances the model's performance on datasets with varied and complex characteristics, such as ImageNet/ObjectNet.
Training
The model has been fine-tuned using a custom loss function with label smoothing, which provides performance gains, especially in scenarios prone to overfitting. Fine-tuning has improved the ImageNet/ObjectNet accuracy to 0.89 from the original ~0.81. The training process leverages a diverse, high-quality dataset, with provisions for further fine-tuning using resources and scripts available on GitHub.
Guide: Running Locally
- Setup Environment: Ensure you have Python and PyTorch installed. Clone the GitHub repository for scripts and configuration files.
- Install Dependencies: Use
pip install transformers
andpip install safetensors
to install necessary libraries. - Download Model: Load the model using the Hugging Face Transformers library:
model_id = "zer0int/LongCLIP-GmP-ViT-L-14" model = CLIPModel.from_pretrained(model_id) processor = CLIPProcessor.from_pretrained(model_id)
- Adjust Tokenization: Implement the proper integration for handling 248 tokens as outlined in the README.
- Inference: Execute inference scripts to classify images using the model.
- Hardware Suggestion: Utilize cloud GPUs such as AWS EC2 P3, Google Cloud TPU, or NVIDIA GPUs to handle computational requirements effectively.
License
The LongCLIP-GmP-ViT-L-14 model is based on the pre-trained CLIP model by OpenAI, which is distributed under the MIT License.