Sap B E R T U M L S 2020 A B all lang from X L M R
cambridgeltlIntroduction
SapBERT is a model developed to generate domain-specialized representations suitable for cross-lingual biomedical entity linking. It is trained using the Unified Medical Language System (UMLS) dataset and employs xlm-roberta-base as the foundational model. This model is especially useful for tasks requiring multilingual understanding in the biomedical domain.
Architecture
SapBERT uses xlm-roberta-base as its base model, which is a transformer model designed for multilingual tasks. The [CLS] token is utilized as the primary representation of the input, enabling efficient feature extraction for various tasks.
Training
SapBERT is trained with the UMLS 2020AB dataset, which provides a comprehensive resource for biomedical concepts. The training process is designed to optimize the model for cross-lingual tasks by leveraging multilingual pre-trained transformers.
Guide: Running Locally
To run SapBERT locally, follow these steps:
-
Install Dependencies: Ensure you have Python, PyTorch, and the Transformers library installed.
-
Load the Model and Tokenizer:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext") model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()
-
Prepare the Data: Replace
all_names
with your list of entity names. -
Extract Embeddings: Use the provided script to convert entity names into embeddings:
# batch size during inference bs = 128 # your list of entity names all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"] all_embs = [] for i in tqdm(np.arange(0, len(all_names), bs)): # tokenize and encode toks = tokenizer.batch_encode_plus(all_names[i:i+bs], padding="max_length", max_length=25, truncation=True, return_tensors="pt") toks_cuda = {k: v.cuda() for k, v in toks.items()} cls_rep = model(**toks_cuda)[0][:,0,:] all_embs.append(cls_rep.cpu().detach().numpy()) all_embs = np.concatenate(all_embs, axis=0)
-
Cloud GPUs: For optimal performance, especially for large datasets, using a cloud GPU service like AWS, Azure, or Google Cloud is recommended.
License
For more details about licensing, refer to the SapBERT GitHub repository. The UMLS dataset's licensing terms must also be considered when using the model.