use cmlm multilingual
sentence-transformersIntroduction
The use-cmlm-multilingual
model is a PyTorch implementation of the universal-sentence-encoder-cmlm/multilingual-base-br model, designed to map 109 languages into a shared vector space. It is based on LaBSE and performs well on various downstream tasks.
Architecture
The model uses the following architecture:
- A Transformer layer with a max sequence length of 256 and without case lowering, utilizing a BertModel.
- A Pooling layer that computes the mean of token embeddings with a word embedding dimension of 768.
- A Normalization layer.
Training
This model leverages the capabilities of sentence-transformers to enable multilingual sentence similarity tasks. It is designed to effectively handle sentence embedding tasks, allowing for efficient feature extraction and inference across multiple languages.
Guide: Running Locally
To use this model locally, follow these steps:
-
Install the
sentence-transformers
library:pip install -U sentence-transformers
-
Import and initialize the model in your Python script:
from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('sentence-transformers/use-cmlm-multilingual') embeddings = model.encode(sentences) print(embeddings)
For optimal performance, consider using cloud GPUs such as AWS, Google Cloud, or Azure to handle the computations efficiently when working with large datasets or complex tasks.
License
The use-cmlm-multilingual
model is released under the Apache 2.0 license, allowing users to freely use, modify, and distribute the model while complying with the license terms.