clip variants
mlunarIntroduction
The CLIP Variants project offers various versions of the CLIP model originally developed by OpenAI, aimed at enhancing robustness in computer vision and generalizing to arbitrary image classification tasks in a zero-shot manner. This repository provides converted variants of the CLIP models for different data types and architectures.
Architecture
The CLIP model is a multimodal neural network architecture that processes both image and text data. The original models have been split into two modes: visual and textual. The models are available in several formats, such as float16, qint8, and quint8, and are implemented using the ONNX (Open Neural Network Exchange) format.
Training
The original CLIP models were trained by OpenAI for robustness in computer vision tasks. While the converted models in this repository have not undergone extensive testing, brief tests on float16 versions showed similarities to the original float32 versions. However, the qint8 and quint8 versions have shown a decrease in similarity.
Guide: Running Locally
To run CLIP Variants locally:
- Install ONNX Runtime: Ensure ONNX Runtime is installed in your environment.
- Download the Model: Choose the appropriate model variant (e.g.,
clip-vit-base-patch32-visual-float16.onnx
) and download it. - Run Example Code: Use the provided
example.py
to test the model.python example.py
- Hardware Recommendations: Utilize cloud GPUs such as NVIDIA's V100 or A100 for optimal performance, especially for larger model variants.
License
The conversion code is licensed under the MIT License. The original models retain the same license as the OpenAI CLIP models. The author of this repository has no affiliation with OpenAI.