M-BERT-Distil-40 Model

Introduction

The M-BERT-Distil-40 model is a multilingual version of the CLIP model, designed to work with 40 languages. It integrates a distilbert-base-multilingual model fine-tuned to align its embedding space with the CLIP text encoder, paired with a ResNet-50x4 vision encoder.

Architecture

The model is built on a distilbert-base-multilingual architecture, tailored to fit into the CLIP framework by aligning its text embeddings with the visual representations from a ResNet-50x4 encoder. It supports 40 languages, which are a subset of the 100 languages used in the pre-training phase.

Training

Training data was created by sampling 40k sentences per language from datasets like GCC, MSCOCO, and VizWiz, and translating these into the target languages using AWS Translate. The model was then fine-tuned to match the text embeddings with the CLIP text encoder. The quality of translations has not been systematically analyzed.

Guide: Running Locally

To run the model locally:

Clone the Repository:
Download the necessary code and additional weights from the Multilingual-CLIP GitHub repository.
Install Dependencies:
Ensure you have the required libraries, such as PyTorch and any other dependencies listed in the repository.

Load the Model:
Use the following code snippet to load and use the model:

from src import multilingual_clip

model = multilingual_clip.load_model('M-BERT-Distil-40')
embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
print(embeddings.shape)
# Yields: torch.Size([3, 640])

Cloud GPU Suggestion:
For optimal performance, consider using cloud GPU services like AWS EC2, Google Cloud Platform, or Azure.

License

For licensing information, refer to the GitHub repository where the model's code and resources are hosted.