M B E R T Base Vi T B

M-CLIP

M-BERT Base ViT-B

Introduction

M-BERT Base ViT-B is a multilingual model designed to align with the embedding space of the CLIP text encoder and the ViT-B/32 vision encoder. It is based on a BERT-base-multilingual model fine-tuned for 69 languages.

Architecture

The model uses a BERT-base-multilingual architecture, tuned to align with the embedding space of a CLIP text encoder, and is paired with a ViT-B/32 vision encoder. The pre-training involves 100 languages, while fine-tuning is focused on 69 languages.

Training

Training data pairs were created by sampling 40K sentences per language from datasets like GCC, MSCOCO, and VizWiz. These sentences were translated using AWS Translate. The quality of translations may vary across the 69 languages.

Guide: Running Locally

  1. Clone the Repository:

  2. Install Dependencies:

    • Ensure you have the necessary Python libraries installed. This typically includes PyTorch and other relevant packages.
  3. Load the Model:

    from src import multilingual_clip
    
    model = multilingual_clip.load_model('M-BERT-Base-ViT')
    embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
    print(embeddings.shape)
    # Output: torch.Size([3, 640])
    
  4. Use Cloud GPUs:

    • For optimal performance, consider using cloud-based GPUs such as those offered by AWS, GCP, or Azure.

License

For license details, refer to the Multilingual-CLIP GitHub repository.

More Related APIs in Feature Extraction