L L M2 C L I P E V A02 L 14 336
microsoftIntroduction
LLM2CLIP is a model developed by Microsoft that extends CLIP's capabilities using large language models (LLMs). This approach improves the textual discriminability of the output layer by fine-tuning the LLM in the caption space with contrastive learning. It transforms CLIP into a state-of-the-art cross-lingual model, enhancing performance in both long-text and short-text retrieval tasks.
Architecture
The model uses a vision foundation model as its feature backbone and is pretrained on datasets like CC3M, CC12M, YFCC15M, and a subset of Recap-DataComp-1B. It employs a fine-tuned LLM as a teacher for CLIP's visual encoder, allowing for longer and more complex captions beyond the limitations of the original CLIP text encoder.
Training
The training involves using the LLM to improve the CLIP model's visual encoder by leveraging textual capabilities extracted through contrastive learning. The result is a substantial improvement in cross-modal tasks, outperforming previous models like EVA02 by 16.5%.
Guide: Running Locally
To run LLM2CLIP locally:
- Environment Setup: Ensure you have a suitable environment with PyTorch and other dependencies installed. A CUDA-enabled GPU is recommended.
- Clone Repository: Clone the GitHub repository.
- Load Model: Use the provided script to load the model, preprocess images, and encode text.
- Run Inference: Execute the inference script to process an image and compare it with provided text captions.
For improved performance, consider using cloud GPUs such as AWS EC2 instances with NVIDIA GPUs or Google Cloud's GPU offerings.
License
LLM2CLIP is licensed under the Apache-2.0 license, allowing for wide usage and modification under specified terms.