X L M Roberta Large Vit B 32

M-CLIP

Introduction

The Multilingual-CLIP: XLM-Roberta-Large-Vit-B-32 model extends OpenAI's CLIP's capabilities to multiple languages, featuring a multilingual text encoder. The image model used in conjunction with this text encoder is ViT-B-32. The model supports 48 languages, offering enhanced text-image representation for multilingual contexts.

Architecture

The model architecture integrates the XLM-Roberta-Large text encoder with the ViT-B-32 image encoder. It leverages the strengths of OpenAI's CLIP, adapted to handle multilingual input effectively. The text encoder processes text from diverse languages, while the image encoder handles image features, facilitating cross-modal tasks.

Training

Training details and data specifics for the Multilingual-CLIP model are available in the extended documentation on the GitHub repository. The model's performance in tasks like text-to-image retrieval has been evaluated, showing competitive results across various languages.

Guide: Running Locally

Basic Steps

  1. Install Required Packages:
    pip install multilingual-clip
    pip install git+https://github.com/openai/CLIP.git
    
  2. Extract Text Embeddings:
    from multilingual_clip import pt_multilingual_clip
    import transformers
    
    texts = ['Example sentence in language X.']
    model_name = 'M-CLIP/XLM-Roberta-Large-Vit-B-32'
    model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
    embeddings = model.forward(texts, tokenizer)
    
  3. Extract Image Features:
    import torch
    import clip
    from PIL import Image
    import requests
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess = clip.load("ViT-B/32", device=device)
    image = Image.open(requests.get("IMAGE_URL", stream=True).raw)
    image = preprocess(image).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features = model.encode_image(image)
    

Suggest Cloud GPUs

Consider using cloud-based GPUs from providers like AWS, GCP, or Azure to expedite the model inference process, especially for large datasets.

License

The model and associated code are distributed under the licenses specified in the respective repositories. Ensure compliance with OpenAI's and Hugging Face's licensing terms when using the models and code.

More Related APIs