distilbert base nli mean tokens
sentence-transformersIntroduction
The distilbert-base-nli-mean-tokens
model is part of the Sentence Transformers library. It maps sentences and paragraphs to a 768-dimensional dense vector space, useful for tasks such as clustering and semantic search. However, this model is deprecated due to its low-quality sentence embeddings, and alternative models are recommended at SBERT.net.
Architecture
The model architecture consists of a Transformer component based on DistilBertModel
with a maximum sequence length of 128, followed by a pooling layer that uses mean pooling of token embeddings to generate sentence embeddings.
Training
This model was trained by the Sentence Transformers team as described in their publication "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". The training method involved using a Siamese network structure to produce sentence embeddings tailored for semantic similarity tasks.
Guide: Running Locally
To run the distilbert-base-nli-mean-tokens
model locally, follow these steps:
-
Install Sentence Transformers:
pip install -U sentence-transformers
-
Load and Use the Model:
from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('sentence-transformers/distilbert-base-nli-mean-tokens') embeddings = model.encode(sentences) print(embeddings)
-
Alternative Using Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ['This is an example sentence', 'Each sentence is converted'] tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/distilbert-base-nli-mean-tokens') model = AutoModel.from_pretrained('sentence-transformers/distilbert-base-nli-mean-tokens') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
For better performance, consider using cloud GPUs like those offered by AWS, GCP, or Azure.
License
The model is released under the Apache 2.0 License. Use of the model should adhere to the terms outlined in this license.