X L M Roberta Large Vit L 14
M-CLIPIntroduction
The Multilingual-CLIP model extends OpenAI's CLIP model to support multiple languages. This model includes a multilingual text encoder, while the image encoder can be accessed via OpenAI's CLIP repository. It supports 48 languages, enabling robust multilingual text-image tasks.
Architecture
Multilingual-CLIP combines the XLM-Roberta-Large text encoder with the ViT-L-14 image encoder. The text encoder processes multilingual text inputs, while the image encoder processes image inputs. These components facilitate cross-modal applications such as text-to-image retrieval.
Training
Details about the training process and datasets used for Multilingual-CLIP are outlined in the model card available on the model's GitHub page. Training involves leveraging large datasets to fine-tune text and image encoders for multilingual tasks.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Required Packages:
pip install multilingual-clip pip install git+https://github.com/openai/CLIP.git
-
Extract Text Embeddings:
from multilingual_clip import pt_multilingual_clip import transformers texts = [ 'Three blind horses listening to Mozart.', 'Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?' ] model_name = 'M-CLIP/XLM-Roberta-Large-Vit-L-14' model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name) tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) embeddings = model.forward(texts, tokenizer) print("Text features shape:", embeddings.shape)
-
Extract Image Embeddings:
import torch import clip import requests from PIL import Image device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-L/14", device=device) url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) image = preprocess(image).unsqueeze(0).to(device) with torch.no_grad(): image_features = model.encode_image(image) print("Image features shape:", image_features.shape)
Cloud GPUs such as those from AWS, Google Cloud, or Azure are recommended for optimal performance.
License
For licensing details, refer to the respective GitHub repositories of Multilingual-CLIP and OpenAI's CLIP.