Sap B E R T from Pub Med B E R T fulltext

cambridgeltl

Introduction

SapBERT is a model developed for biomedical entity representation, leveraging the PubMedBERT framework. It is designed to improve semantic understanding in the biomedical domain, particularly for tasks like entity linking.

Architecture

SapBERT builds upon the BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext model, utilizing UMLS datasets for training. It focuses on self-alignment pretraining to enhance the representation space of biomedical entities, optimizing for tasks that require understanding entity relations, such as synonymy.

Training

The model is trained using the UMLS 2020AA dataset, which contains an extensive collection of biomedical concepts. The training process focuses on aligning biomedical entity representations to capture fine-grained semantic relationships effectively.

Guide: Running Locally

To extract embeddings using SapBERT, follow these steps:

  1. Setup Environment: Ensure you have torch, transformers, and numpy installed.
  2. Load Model and Tokenizer:
    from transformers import AutoTokenizer, AutoModel  
    tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")  
    model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()
    
  3. Prepare Input Data: Create a list of biomedical entity names you wish to process.
  4. Process Data: Use batch encoding and feed it to the model to obtain embeddings.
    import numpy as np
    from tqdm.auto import tqdm
    
    all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"] 
    bs = 128
    all_embs = []
    for i in tqdm(np.arange(0, len(all_names), bs)):
        toks = tokenizer.batch_encode_plus(all_names[i:i+bs], padding="max_length", max_length=25, truncation=True, return_tensors="pt")
        toks_cuda = {k: v.cuda() for k, v in toks.items()}
        cls_rep = model(**toks_cuda)[0][:,0,:]
        all_embs.append(cls_rep.cpu().detach().numpy())
    
    all_embs = np.concatenate(all_embs, axis=0)
    

Suggestion for Cloud GPUs

For efficient processing, especially with large datasets, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

SapBERT is distributed under the Apache 2.0 license, allowing for both personal and commercial use with proper attribution.

More Related APIs in Feature Extraction