clip Vi T B 32 multilingual v1 LLM Model

Introduction

The clip-ViT-B-32-multilingual-v1 is a multilingual model developed by Sentence Transformers, based on the OpenAI CLIP-ViT-B32 architecture. It enables mapping of text in over 50 languages and images into a shared dense vector space. This model is suitable for tasks like image search and multilingual zero-shot image classification.

Architecture

The model architecture consists of:

A Transformer module using a DistilBertModel with a maximum sequence length of 128.
A Pooling layer configuration with mean token pooling.
A Dense layer reducing features from 768 to 512 dimensions, with an identity activation function.

Training

The model was trained using Multilingual Knowledge Distillation. The teacher model is the original CLIP-ViT-B-32, while the student model is a multilingual DistilBERT. The student model is trained to align vector spaces across multiple languages using parallel data. This approach allows the model to perform well in text embedding across 50+ languages, although it supports over 100 languages.

Guide: Running Locally

Installation:
```
pip install -U sentence-transformers
```
Usage:
- Import required libraries and models.
- Load images and texts for encoding.
- Use the model to encode images and texts.
- Compute cosine similarities between text and image embeddings.
Cloud GPUs:
- For enhanced performance, consider using cloud-based GPUs such as those offered by Google Colab, AWS, or Azure.

License

The model is licensed under the Apache-2.0 License, allowing for both personal and commercial use.

More Related APIs in Sentence Similarity