X L M Roberta Large Vit B 16 Plus

M-CLIP

Introduction

The Multilingual-CLIP: XLM-Roberta-Large-Vit-B-16Plus model extends OpenAI's English text encoders to support multiple languages. It includes a multilingual text encoder and instructions to access the corresponding image model Vit-B-16Plus. This model is designed for multilingual tasks and can be utilized for extracting text and image embeddings.

Architecture

The model comprises a multilingual text encoder based on XLM-Roberta-Large and an image encoder that can be accessed through the open_clip repository. The text encoder supports 48 languages, enabling versatile applications across different linguistic contexts.

Training

Details about the model's training process and datasets are available in the model card. The model training focused on expanding the language capabilities of OpenAI's original text encoders, without extensive evaluation on specific tasks.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Prerequisites:

    pip install multilingual-clip
    pip install open_clip_torch
    
  2. Extract Text Embeddings:

    from multilingual_clip import pt_multilingual_clip
    import transformers
    
    texts = [
        'Three blind horses listening to Mozart.',
        'Älgen är skogens konung!',
        'Wie leben Eisbären in der Antarktis?',
        'Вы знали, что все белые медведи левши?'
    ]
    model_name = 'M-CLIP/XLM-Roberta-Large-Vit-B-16Plus'
    
    model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
    
    embeddings = model.forward(texts, tokenizer)
    print("Text features shape:", embeddings.shape)
    
  3. Extract Image Embeddings:

    import torch
    import open_clip
    import requests
    from PIL import Image
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-16-plus-240', pretrained="laion400m_e32")
    model.to(device)
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    image = preprocess(image).unsqueeze(0).to(device)
    
    with torch.no_grad():
        image_features = model.encode_image(image)
    
    print("Image features shape:", image_features.shape)
    
  4. Cloud GPUs: Consider using cloud services with GPU support such as AWS, Google Cloud, or Azure for efficient processing, especially when working with large datasets or models.

License

The model and its components are distributed under respective licenses. For specific licensing details, refer to the model card and associated repositories.

More Related APIs