Long C L I P S A E Vi T L 14

zer0int

Introduction

LongCLIP-SAE-ViT-L-14 is a machine learning model designed for zero-shot image classification. It builds upon the CLIP architecture and incorporates features such as sparse autoencoders (SAE) and adversarial training to improve performance and robustness, particularly in out-of-distribution scenarios.

Architecture

The model extends the capabilities of the original CLIP by allowing a maximum input of 77 tokens, with an effective length of approximately 20 tokens. The LongCLIP architecture uses a ViT-L/14 backbone and has been fine-tuned using SAE-informed adversarial training, which enhances its ability to handle diverse and challenging visual tasks.

Training

Training details and code for LongCLIP-SAE-ViT-L-14 can be found on GitHub. This model is fine-tuned with sparse autoencoders and adversarial techniques to improve zero-shot image classification capabilities. It is specifically optimized for use with the HunyuanVideo technology and the ComfyUI-HunyuanVideo-Nyan node.

Guide: Running Locally

  1. Installation: Clone the repository from GitHub and install the necessary dependencies.
  2. Download Model: Obtain the model weights from Hugging Face.
  3. Configuration: Set up the environment to use the model with compatible nodes like ComfyUI-HunyuanVideo-Nyan.
  4. Execution: Run your desired image classification tasks using the model.

For optimal performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

The LongCLIP-SAE-ViT-L-14 model and its associated resources are subject to the licensing terms provided by its creator, zer0int. Please review the specific license details on the project's GitHub repository before use.

More Related APIs in Zero Shot Image Classification