paraphrase mpnet base v2
sentence-transformersIntroduction
The Paraphrase-MPNET-Base-V2 is a model from the Sentence-Transformers library, designed to map sentences and paragraphs into a 768-dimensional dense vector space. This capability is useful for tasks such as clustering and semantic search.
Architecture
The model is based on the MPNet architecture and consists of two main components:
- Transformer: Configured with a maximum sequence length of 512 and case sensitivity enabled.
- Pooling Layer: Utilizes mean pooling to generate sentence embeddings from token embeddings.
Training
The model employs the Sentence-BERT approach, which is detailed in the paper "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" by Reimers and Gurevych. This approach enhances the semantic understanding of sentences by using a Siamese network structure.
Guide: Running Locally
To use the model locally, follow these steps:
-
Install the Sentence-Transformers Library:
pip install -U sentence-transformers
-
Use with Sentence-Transformers:
from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('sentence-transformers/paraphrase-mpnet-base-v2') embeddings = model.encode(sentences) print(embeddings)
-
Use with Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ['This is an example sentence', 'Each sentence is converted'] tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-mpnet-base-v2') model = AutoModel.from_pretrained('sentence-transformers/paraphrase-mpnet-base-v2') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
-
Cloud GPUs: For more demanding tasks, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure for faster processing.
License
The Paraphrase-MPNET-Base-V2 model is licensed under the Apache 2.0 License. This permits use, distribution, and modification under defined terms.