clip variants

mlunar

Introduction

The CLIP Variants project offers various versions of the CLIP model originally developed by OpenAI, aimed at enhancing robustness in computer vision and generalizing to arbitrary image classification tasks in a zero-shot manner. This repository provides converted variants of the CLIP models for different data types and architectures.

Architecture

The CLIP model is a multimodal neural network architecture that processes both image and text data. The original models have been split into two modes: visual and textual. The models are available in several formats, such as float16, qint8, and quint8, and are implemented using the ONNX (Open Neural Network Exchange) format.

Training

The original CLIP models were trained by OpenAI for robustness in computer vision tasks. While the converted models in this repository have not undergone extensive testing, brief tests on float16 versions showed similarities to the original float32 versions. However, the qint8 and quint8 versions have shown a decrease in similarity.

Guide: Running Locally

To run CLIP Variants locally:

  1. Install ONNX Runtime: Ensure ONNX Runtime is installed in your environment.
  2. Download the Model: Choose the appropriate model variant (e.g., clip-vit-base-patch32-visual-float16.onnx) and download it.
  3. Run Example Code: Use the provided example.py to test the model.
    python example.py
    
  4. Hardware Recommendations: Utilize cloud GPUs such as NVIDIA's V100 or A100 for optimal performance, especially for larger model variants.

License

The conversion code is licensed under the MIT License. The original models retain the same license as the OpenAI CLIP models. The author of this repository has no affiliation with OpenAI.

More Related APIs