distiluse base multilingual cased v2

sentence-transformers

Introduction

The distiluse-base-multilingual-cased-v2 model is a part of the Sentence Transformers library, designed to map sentences and paragraphs to a 512-dimensional dense vector space. This can be used for tasks such as clustering and semantic search. It supports 50 languages, including English, Spanish, and Chinese, among others.

Architecture

The model architecture consists of three main components:

  • Transformer Layer: Utilizes a DistilBertModel with a maximum sequence length of 128 tokens.
  • Pooling Layer: Gathers word embeddings, with configurations to pool using mean tokens.
  • Dense Layer: Reduces the feature dimension from 768 to 512 using a Tanh activation function.

Training

The model is based on the Sentence-BERT framework, which uses Siamese BERT-Networks for training. It has been evaluated using the Sentence Embeddings Benchmark, demonstrating its effectiveness in generating meaningful sentence embeddings. For detailed evaluation results, refer to the Sentence Embeddings Benchmark.

Guide: Running Locally

To use the distiluse-base-multilingual-cased-v2 model locally, follow these steps:

  1. Install Sentence Transformers:
    pip install -U sentence-transformers
    
  2. Load and Use the Model:
    from sentence_transformers import SentenceTransformer
    sentences = ["This is an example sentence", "Each sentence is converted"]
    
    model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Hardware Recommendation: For optimal performance, especially with large datasets, consider using cloud GPU services such as AWS EC2 with GPU support, Google Cloud Platform, or Azure.

License

The distiluse-base-multilingual-cased-v2 model is released under the Apache 2.0 license, allowing for both commercial and non-commercial use, modification, and distribution.

More Related APIs in Sentence Similarity