L L M2 C L I P Openai L 14 336

microsoft

Introduction

LLM2CLIP is a novel approach that leverages large language models (LLMs) to enhance the capabilities of the CLIP model. The method involves fine-tuning the LLM to improve textual discrimination in embeddings, allowing for more complex and longer captions. This advancement significantly boosts performance in cross-modal tasks and enables state-of-the-art cross-lingual capabilities.

Architecture

LLM2CLIP extends CLIP's capabilities by using an LLM as a teacher model to refine the visual encoder of CLIP. This approach allows for the use of longer, more intricate captions beyond the limitations of the standard CLIP text encoder. The model is pre-trained on datasets like CC3M, CC12M, YFCC15M, and a subset of Recap-DataComp-1B.

Training

The training process involves fine-tuning the LLM in a caption space using contrastive learning. This technique extracts textual capabilities into the output embeddings of the LLM, which then acts as a teacher for the CLIP visual encoder. The method has demonstrated significant performance improvements, including a 16.5% boost over previous state-of-the-art models in retrieval tasks.

Guide: Running Locally

To run LLM2CLIP locally, follow these steps:

  1. Setup Environment: Ensure you have Python and PyTorch installed. Install the transformers library from Hugging Face.
  2. Download Model: Use the Hugging Face transformers library to download the model with:
    from transformers import AutoModel
    model = AutoModel.from_pretrained("microsoft/LLM2CLIP-Openai-L-14-336")
    
  3. Process Images: Use PIL and CLIPImageProcessor from transformers to prepare image inputs.
  4. Inference: Ensure you have a compatible GPU and CUDA installed. Recommended cloud GPUs include NVIDIA Tesla T4 or V100.
  5. Run Script: Use the provided code samples to run image embedding and retrieval tasks.

License

LLM2CLIP is released under the Apache-2.0 License, which allows for use, distribution, and modification with proper attribution.

More Related APIs in Zero Shot Classification