X L M Roberta Large Vit L 14

M-CLIP

Introduction

The Multilingual-CLIP model extends OpenAI's CLIP model to support multiple languages. This model includes a multilingual text encoder, while the image encoder can be accessed via OpenAI's CLIP repository. It supports 48 languages, enabling robust multilingual text-image tasks.

Architecture

Multilingual-CLIP combines the XLM-Roberta-Large text encoder with the ViT-L-14 image encoder. The text encoder processes multilingual text inputs, while the image encoder processes image inputs. These components facilitate cross-modal applications such as text-to-image retrieval.

Training

Details about the training process and datasets used for Multilingual-CLIP are outlined in the model card available on the model's GitHub page. Training involves leveraging large datasets to fine-tune text and image encoders for multilingual tasks.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Required Packages:

    pip install multilingual-clip
    pip install git+https://github.com/openai/CLIP.git
    
  2. Extract Text Embeddings:

    from multilingual_clip import pt_multilingual_clip
    import transformers
    
    texts = [
        'Three blind horses listening to Mozart.',
        'Älgen är skogens konung!',
        'Wie leben Eisbären in der Antarktis?',
        'Вы знали, что все белые медведи левши?'
    ]
    model_name = 'M-CLIP/XLM-Roberta-Large-Vit-L-14'
    
    model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
    
    embeddings = model.forward(texts, tokenizer)
    print("Text features shape:", embeddings.shape)
    
  3. Extract Image Embeddings:

    import torch
    import clip
    import requests
    from PIL import Image
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess = clip.load("ViT-L/14", device=device)
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    image = preprocess(image).unsqueeze(0).to(device)
    
    with torch.no_grad():
        image_features = model.encode_image(image)
    
    print("Image features shape:", image_features.shape)
    

Cloud GPUs such as those from AWS, Google Cloud, or Azure are recommended for optimal performance.

License

For licensing details, refer to the respective GitHub repositories of Multilingual-CLIP and OpenAI's CLIP.

More Related APIs