pubmedbert base embeddings

NeuML

Introduction

PubMedBERT Embeddings is a model fine-tuned from the PubMedBERT-base for mapping sentences and paragraphs to a 768-dimensional dense vector space. It is tailored for tasks such as clustering or semantic search, particularly in medical literature. The model is fine-tuned using sentence-transformers and trained on a dataset comprising PubMed title-abstract pairs.

Architecture

The model architecture includes a SentenceTransformer with the following components:

  • Transformer Model: BertModel with a maximum sequence length of 512 and case sensitivity set to false.
  • Pooling Layer: Configured for mean pooling over token embeddings, which aggregates the token embeddings into a single sentence embedding.

Training

The model was trained using the following parameters:

  • DataLoader: Batch size of 24 with a random sampler.
  • Loss Function: MultipleNegativesRankingLoss with a scale of 20.0 and cosine similarity.
  • Training Parameters: 1 epoch, evaluation every 500 steps, and an optimizer using AdamW with a learning rate of 2e-05. The scheduler used is WarmupLinear with a warmup of 10,000 steps and a weight decay of 0.01.

Guide: Running Locally

Steps

  1. Install Required Libraries:

    • txtai
    • sentence-transformers
    • transformers
    • torch
  2. Using txtai:

    import txtai
    embeddings = txtai.Embeddings(path="neuml/pubmedbert-base-embeddings", content=True)
    embeddings.index(documents())
    results = embeddings.search("query to run")
    
  3. Using SentenceTransformers:

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("neuml/pubmedbert-base-embeddings")
    embeddings = model.encode(["This is an example sentence", "Each sentence is converted"])
    
  4. Using Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained("neuml/pubmedbert-base-embeddings")
    model = AutoModel.from_pretrained("neuml/pubmedbert-base-embeddings")
    inputs = tokenizer(['This is an example sentence', 'Each sentence is converted'], padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        output = model(**inputs)
    
    def meanpooling(output, mask):
        embeddings = output[0]
        mask = mask.unsqueeze(-1).expand(embeddings.size()).float()
        return torch.sum(embeddings * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)
    
    embeddings = meanpooling(output, inputs['attention_mask'])
    

Cloud GPUs

Consider using cloud GPU services like AWS EC2, Google Cloud, or Azure for efficient processing, especially for large datasets.

License

This model is licensed under the Apache-2.0 License.

More Related APIs in Sentence Similarity