use cmlm multilingual

sentence-transformers

Introduction

The use-cmlm-multilingual model is a PyTorch implementation of the universal-sentence-encoder-cmlm/multilingual-base-br model, designed to map 109 languages into a shared vector space. It is based on LaBSE and performs well on various downstream tasks.

Architecture

The model uses the following architecture:

  • A Transformer layer with a max sequence length of 256 and without case lowering, utilizing a BertModel.
  • A Pooling layer that computes the mean of token embeddings with a word embedding dimension of 768.
  • A Normalization layer.

Training

This model leverages the capabilities of sentence-transformers to enable multilingual sentence similarity tasks. It is designed to effectively handle sentence embedding tasks, allowing for efficient feature extraction and inference across multiple languages.

Guide: Running Locally

To use this model locally, follow these steps:

  1. Install the sentence-transformers library:

    pip install -U sentence-transformers
    
  2. Import and initialize the model in your Python script:

    from sentence_transformers import SentenceTransformer
    
    sentences = ["This is an example sentence", "Each sentence is converted"]
    model = SentenceTransformer('sentence-transformers/use-cmlm-multilingual')
    embeddings = model.encode(sentences)
    print(embeddings)
    

For optimal performance, consider using cloud GPUs such as AWS, Google Cloud, or Azure to handle the computations efficiently when working with large datasets or complex tasks.

License

The use-cmlm-multilingual model is released under the Apache 2.0 license, allowing users to freely use, modify, and distribute the model while complying with the license terms.

More Related APIs in Sentence Similarity