distiluse base multilingual cased v2
sentence-transformersIntroduction
The distiluse-base-multilingual-cased-v2
model is a part of the Sentence Transformers library, designed to map sentences and paragraphs to a 512-dimensional dense vector space. This can be used for tasks such as clustering and semantic search. It supports 50 languages, including English, Spanish, and Chinese, among others.
Architecture
The model architecture consists of three main components:
- Transformer Layer: Utilizes a
DistilBertModel
with a maximum sequence length of 128 tokens. - Pooling Layer: Gathers word embeddings, with configurations to pool using mean tokens.
- Dense Layer: Reduces the feature dimension from 768 to 512 using a Tanh activation function.
Training
The model is based on the Sentence-BERT framework, which uses Siamese BERT-Networks for training. It has been evaluated using the Sentence Embeddings Benchmark, demonstrating its effectiveness in generating meaningful sentence embeddings. For detailed evaluation results, refer to the Sentence Embeddings Benchmark.
Guide: Running Locally
To use the distiluse-base-multilingual-cased-v2
model locally, follow these steps:
- Install Sentence Transformers:
pip install -U sentence-transformers
- Load and Use the Model:
from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2') embeddings = model.encode(sentences) print(embeddings)
- Hardware Recommendation: For optimal performance, especially with large datasets, consider using cloud GPU services such as AWS EC2 with GPU support, Google Cloud Platform, or Azure.
License
The distiluse-base-multilingual-cased-v2
model is released under the Apache 2.0 license, allowing for both commercial and non-commercial use, modification, and distribution.