clip Vi T B 32 multilingual v1
sentence-transformersIntroduction
The clip-ViT-B-32-multilingual-v1
is a multilingual model developed by Sentence Transformers, based on the OpenAI CLIP-ViT-B32 architecture. It enables mapping of text in over 50 languages and images into a shared dense vector space. This model is suitable for tasks like image search and multilingual zero-shot image classification.
Architecture
The model architecture consists of:
- A
Transformer
module using a DistilBertModel with a maximum sequence length of 128. - A
Pooling
layer configuration with mean token pooling. - A
Dense
layer reducing features from 768 to 512 dimensions, with an identity activation function.
Training
The model was trained using Multilingual Knowledge Distillation. The teacher model is the original CLIP-ViT-B-32, while the student model is a multilingual DistilBERT. The student model is trained to align vector spaces across multiple languages using parallel data. This approach allows the model to perform well in text embedding across 50+ languages, although it supports over 100 languages.
Guide: Running Locally
-
Installation:
pip install -U sentence-transformers
-
Usage:
- Import required libraries and models.
- Load images and texts for encoding.
- Use the model to encode images and texts.
- Compute cosine similarities between text and image embeddings.
-
Cloud GPUs:
- For enhanced performance, consider using cloud-based GPUs such as those offered by Google Colab, AWS, or Azure.
License
The model is licensed under the Apache-2.0 License, allowing for both personal and commercial use.