C L I P S A E Vi T L 14
zer0intIntroduction
The CLIP-SAE-ViT-L-14 model by zer0int is a fine-tuned version of the original OpenAI CLIP model. It utilizes Sparse Autoencoder (SAE) informed adversarial training to enhance its performance, particularly in zero-shot image classification tasks.
Architecture
The model is based on the OpenAI CLIP architecture, specifically using the ViT-L-14 variant. The fine-tuning process incorporates a Sparse Autoencoder to improve adversarial robustness and maintain high accuracy on image classification benchmarks such as ImageNet and ObjectNet.
Training
The model achieves 89% accuracy on ImageNet/ObjectNet, surpassing the original OpenAI pre-trained model's 84.5%. The training process and scripts are available on zer0int's GitHub repository. The model is particularly effective when used with the LAION-AI/CLIP_benchmark for linear probing tasks.
Guide: Running Locally
- Setup Environment: Install the required libraries, primarily
transformers
from Hugging Face. - Download Model: Use the provided safetensor files from the Hugging Face model page.
- Inference Script: Implement or use existing scripts from the GitHub repository to run the model on your data.
- GPU Recommendation: For optimal performance, consider using cloud-based GPUs such as those from AWS, Google Cloud, or Azure.
License
This model is released under the MIT License, allowing for broad usage with minimal restrictions.