Sap B E R T from Pub Med B E R T fulltext
cambridgeltlIntroduction
SapBERT is a model developed for biomedical entity representation, leveraging the PubMedBERT framework. It is designed to improve semantic understanding in the biomedical domain, particularly for tasks like entity linking.
Architecture
SapBERT builds upon the BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
model, utilizing UMLS datasets for training. It focuses on self-alignment pretraining to enhance the representation space of biomedical entities, optimizing for tasks that require understanding entity relations, such as synonymy.
Training
The model is trained using the UMLS 2020AA dataset, which contains an extensive collection of biomedical concepts. The training process focuses on aligning biomedical entity representations to capture fine-grained semantic relationships effectively.
Guide: Running Locally
To extract embeddings using SapBERT, follow these steps:
- Setup Environment: Ensure you have
torch
,transformers
, andnumpy
installed. - Load Model and Tokenizer:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext") model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()
- Prepare Input Data: Create a list of biomedical entity names you wish to process.
- Process Data: Use batch encoding and feed it to the model to obtain embeddings.
import numpy as np from tqdm.auto import tqdm all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"] bs = 128 all_embs = [] for i in tqdm(np.arange(0, len(all_names), bs)): toks = tokenizer.batch_encode_plus(all_names[i:i+bs], padding="max_length", max_length=25, truncation=True, return_tensors="pt") toks_cuda = {k: v.cuda() for k, v in toks.items()} cls_rep = model(**toks_cuda)[0][:,0,:] all_embs.append(cls_rep.cpu().detach().numpy()) all_embs = np.concatenate(all_embs, axis=0)
Suggestion for Cloud GPUs
For efficient processing, especially with large datasets, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
SapBERT is distributed under the Apache 2.0 license, allowing for both personal and commercial use with proper attribution.