M B E R T Base Vi T B
M-CLIPM-BERT Base ViT-B
Introduction
M-BERT Base ViT-B is a multilingual model designed to align with the embedding space of the CLIP text encoder and the ViT-B/32 vision encoder. It is based on a BERT-base-multilingual model fine-tuned for 69 languages.
Architecture
The model uses a BERT-base-multilingual architecture, tuned to align with the embedding space of a CLIP text encoder, and is paired with a ViT-B/32 vision encoder. The pre-training involves 100 languages, while fine-tuning is focused on 69 languages.
Training
Training data pairs were created by sampling 40K sentences per language from datasets like GCC, MSCOCO, and VizWiz. These sentences were translated using AWS Translate. The quality of translations may vary across the 69 languages.
Guide: Running Locally
-
Clone the Repository:
- Download the code and additional linear weights from the Multilingual-CLIP GitHub repository.
-
Install Dependencies:
- Ensure you have the necessary Python libraries installed. This typically includes PyTorch and other relevant packages.
-
Load the Model:
from src import multilingual_clip model = multilingual_clip.load_model('M-BERT-Base-ViT') embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?']) print(embeddings.shape) # Output: torch.Size([3, 640])
-
Use Cloud GPUs:
- For optimal performance, consider using cloud-based GPUs such as those offered by AWS, GCP, or Azure.
License
For license details, refer to the Multilingual-CLIP GitHub repository.