xlm r base en ko nli ststb
sentence-transformersIntroduction
The sentence-transformers/xlm-r-base-en-ko-nli-ststb
model is part of the Sentence Transformers library designed for mapping sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. However, this model is deprecated due to low-quality sentence embeddings. Users are advised to refer to recommended models on SBERT.net.
Architecture
The model employs a SentenceTransformer
architecture with two main components:
- Transformer: Uses an XLM-RobertaModel with a maximum sequence length of 128.
- Pooling: Configured to apply mean pooling on tokens, transforming embeddings into sentence embeddings.
Training
This model was trained under the Sentence Transformers framework. The training process involved using a Siamese BERT-Networks approach to generate sentence embeddings, as detailed in the publication "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" by Nils Reimers and Iryna Gurevych.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install the Sentence Transformers library:
pip install -U sentence-transformers
-
Load and use the model with Sentence Transformers:
from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('sentence-transformers/xlm-r-base-en-ko-nli-ststb') embeddings = model.encode(sentences) print(embeddings)
-
Alternatively, use Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ['This is an example sentence', 'Each sentence is converted'] tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/xlm-r-base-en-ko-nli-ststb') model = AutoModel.from_pretrained('sentence-transformers/xlm-r-base-en-ko-nli-ststb') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
-
Cloud GPUs:
- Consider using cloud services such as AWS, Google Cloud, or Azure for GPU support to speed up processing, especially for large datasets.
License
The model is released under the Apache 2.0 License.