C L I P S A E Vi T L 14

zer0int

Introduction

The CLIP-SAE-ViT-L-14 model by zer0int is a fine-tuned version of the original OpenAI CLIP model. It utilizes Sparse Autoencoder (SAE) informed adversarial training to enhance its performance, particularly in zero-shot image classification tasks.

Architecture

The model is based on the OpenAI CLIP architecture, specifically using the ViT-L-14 variant. The fine-tuning process incorporates a Sparse Autoencoder to improve adversarial robustness and maintain high accuracy on image classification benchmarks such as ImageNet and ObjectNet.

Training

The model achieves 89% accuracy on ImageNet/ObjectNet, surpassing the original OpenAI pre-trained model's 84.5%. The training process and scripts are available on zer0int's GitHub repository. The model is particularly effective when used with the LAION-AI/CLIP_benchmark for linear probing tasks.

Guide: Running Locally

  1. Setup Environment: Install the required libraries, primarily transformers from Hugging Face.
  2. Download Model: Use the provided safetensor files from the Hugging Face model page.
  3. Inference Script: Implement or use existing scripts from the GitHub repository to run the model on your data.
  4. GPU Recommendation: For optimal performance, consider using cloud-based GPUs such as those from AWS, Google Cloud, or Azure.

License

This model is released under the MIT License, allowing for broad usage with minimal restrictions.

More Related APIs in Zero Shot Image Classification