clip Vi T B 32 multilingual v1

sentence-transformers

Introduction

The clip-ViT-B-32-multilingual-v1 is a multilingual model developed by Sentence Transformers, based on the OpenAI CLIP-ViT-B32 architecture. It enables mapping of text in over 50 languages and images into a shared dense vector space. This model is suitable for tasks like image search and multilingual zero-shot image classification.

Architecture

The model architecture consists of:

  • A Transformer module using a DistilBertModel with a maximum sequence length of 128.
  • A Pooling layer configuration with mean token pooling.
  • A Dense layer reducing features from 768 to 512 dimensions, with an identity activation function.

Training

The model was trained using Multilingual Knowledge Distillation. The teacher model is the original CLIP-ViT-B-32, while the student model is a multilingual DistilBERT. The student model is trained to align vector spaces across multiple languages using parallel data. This approach allows the model to perform well in text embedding across 50+ languages, although it supports over 100 languages.

Guide: Running Locally

  1. Installation:

    pip install -U sentence-transformers
    
  2. Usage:

    • Import required libraries and models.
    • Load images and texts for encoding.
    • Use the model to encode images and texts.
    • Compute cosine similarities between text and image embeddings.
  3. Cloud GPUs:

    • For enhanced performance, consider using cloud-based GPUs such as those offered by Google Colab, AWS, or Azure.

License

The model is licensed under the Apache-2.0 License, allowing for both personal and commercial use.

More Related APIs in Sentence Similarity