pubmedbert base embeddings
NeuMLIntroduction
PubMedBERT Embeddings is a model fine-tuned from the PubMedBERT-base for mapping sentences and paragraphs to a 768-dimensional dense vector space. It is tailored for tasks such as clustering or semantic search, particularly in medical literature. The model is fine-tuned using sentence-transformers and trained on a dataset comprising PubMed title-abstract pairs.
Architecture
The model architecture includes a SentenceTransformer
with the following components:
- Transformer Model: BertModel with a maximum sequence length of 512 and case sensitivity set to false.
- Pooling Layer: Configured for mean pooling over token embeddings, which aggregates the token embeddings into a single sentence embedding.
Training
The model was trained using the following parameters:
- DataLoader: Batch size of 24 with a random sampler.
- Loss Function:
MultipleNegativesRankingLoss
with a scale of 20.0 and cosine similarity. - Training Parameters: 1 epoch, evaluation every 500 steps, and an optimizer using AdamW with a learning rate of 2e-05. The scheduler used is
WarmupLinear
with a warmup of 10,000 steps and a weight decay of 0.01.
Guide: Running Locally
Steps
-
Install Required Libraries:
txtai
sentence-transformers
transformers
torch
-
Using txtai:
import txtai embeddings = txtai.Embeddings(path="neuml/pubmedbert-base-embeddings", content=True) embeddings.index(documents()) results = embeddings.search("query to run")
-
Using SentenceTransformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("neuml/pubmedbert-base-embeddings") embeddings = model.encode(["This is an example sentence", "Each sentence is converted"])
-
Using Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained("neuml/pubmedbert-base-embeddings") model = AutoModel.from_pretrained("neuml/pubmedbert-base-embeddings") inputs = tokenizer(['This is an example sentence', 'Each sentence is converted'], padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): output = model(**inputs) def meanpooling(output, mask): embeddings = output[0] mask = mask.unsqueeze(-1).expand(embeddings.size()).float() return torch.sum(embeddings * mask, 1) / torch.clamp(mask.sum(1), min=1e-9) embeddings = meanpooling(output, inputs['attention_mask'])
Cloud GPUs
Consider using cloud GPU services like AWS EC2, Google Cloud, or Azure for efficient processing, especially for large datasets.
License
This model is licensed under the Apache-2.0 License.