L L M2 C L I P Llama 3 8 B Instruct C C Finetuned

microsoft

Introduction
LLM2CLIP is a novel approach that leverages large language models to enhance the capabilities of CLIP (Contrastive Language–Image Pretraining). By fine-tuning in the caption space using contrastive learning, LLM2CLIP improves textual discriminability and allows for the incorporation of longer and more complex captions, overcoming limitations of the traditional CLIP text encoder. This method significantly boosts performance in cross-modal tasks, transforming a CLIP model trained on English data into a top-performing cross-lingual model.

Architecture
LLM2CLIP extends the existing CLIP framework by integrating a large language model (LLM) as a teaching mechanism for CLIP's visual encoder. The model utilizes datasets such as CC3M, CC12M, YFCC15M, and a 30M subset of Recap-DataComp-1B for pretraining. This architecture enables significant improvements in text-to-image retrieval tasks by enhancing the visual and textual embedding spaces.

Training
The training process involves fine-tuning the LLM in the caption space with contrastive learning to extract textual capabilities into the output embeddings. This fine-tuned LLM serves as a teacher to the CLIP's visual encoder, allowing the model to handle more complex captions and improve cross-modal task performance.

Guide: Running Locally

  1. Install Requirements: Ensure transformers, torch, and PIL are installed in your Python environment.
  2. Clone the Repository: Download the model files and necessary scripts from the GitHub repository.
  3. Set Up Environment:
    • Use CUDA_VISIBLE_DEVICES to specify GPU usage.
    • Load the model and processor using AutoModel and CLIPImageProcessor.
  4. Processing Images and Text: Use the processor to prepare images and the tokenizer for text captions.
  5. Inference: Use the provided scripts to perform image-to-text retrieval tasks.
  6. Cloud GPUs: Consider using cloud services like AWS EC2, Google Cloud, or Azure for access to powerful GPUs for faster processing.

License
The LLM2CLIP model is available under the Apache 2.0 license, allowing for open usage and modification in both academic and commercial settings.

More Related APIs in Zero Shot Classification