Long C L I P S A E Vi T L 14
zer0intIntroduction
LongCLIP-SAE-ViT-L-14 is a machine learning model designed for zero-shot image classification. It builds upon the CLIP architecture and incorporates features such as sparse autoencoders (SAE) and adversarial training to improve performance and robustness, particularly in out-of-distribution scenarios.
Architecture
The model extends the capabilities of the original CLIP by allowing a maximum input of 77 tokens, with an effective length of approximately 20 tokens. The LongCLIP architecture uses a ViT-L/14 backbone and has been fine-tuned using SAE-informed adversarial training, which enhances its ability to handle diverse and challenging visual tasks.
Training
Training details and code for LongCLIP-SAE-ViT-L-14 can be found on GitHub. This model is fine-tuned with sparse autoencoders and adversarial techniques to improve zero-shot image classification capabilities. It is specifically optimized for use with the HunyuanVideo technology and the ComfyUI-HunyuanVideo-Nyan node.
Guide: Running Locally
- Installation: Clone the repository from GitHub and install the necessary dependencies.
- Download Model: Obtain the model weights from Hugging Face.
- Configuration: Set up the environment to use the model with compatible nodes like ComfyUI-HunyuanVideo-Nyan.
- Execution: Run your desired image classification tasks using the model.
For optimal performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.
License
The LongCLIP-SAE-ViT-L-14 model and its associated resources are subject to the licensing terms provided by its creator, zer0int. Please review the specific license details on the project's GitHub repository before use.