Sap B E R T U M L S 2020 A B all lang from X L M R

cambridgeltl

Introduction

SapBERT is a model developed to generate domain-specialized representations suitable for cross-lingual biomedical entity linking. It is trained using the Unified Medical Language System (UMLS) dataset and employs xlm-roberta-base as the foundational model. This model is especially useful for tasks requiring multilingual understanding in the biomedical domain.

Architecture

SapBERT uses xlm-roberta-base as its base model, which is a transformer model designed for multilingual tasks. The [CLS] token is utilized as the primary representation of the input, enabling efficient feature extraction for various tasks.

Training

SapBERT is trained with the UMLS 2020AB dataset, which provides a comprehensive resource for biomedical concepts. The training process is designed to optimize the model for cross-lingual tasks by leveraging multilingual pre-trained transformers.

Guide: Running Locally

To run SapBERT locally, follow these steps:

  1. Install Dependencies: Ensure you have Python, PyTorch, and the Transformers library installed.

  2. Load the Model and Tokenizer:

    from transformers import AutoTokenizer, AutoModel  
    tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")  
    model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()
    
  3. Prepare the Data: Replace all_names with your list of entity names.

  4. Extract Embeddings: Use the provided script to convert entity names into embeddings:

    # batch size during inference
    bs = 128 
    # your list of entity names
    all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"] 
    
    all_embs = []
    for i in tqdm(np.arange(0, len(all_names), bs)):
        # tokenize and encode
        toks = tokenizer.batch_encode_plus(all_names[i:i+bs], 
                                           padding="max_length", 
                                           max_length=25, 
                                           truncation=True,
                                           return_tensors="pt")
        toks_cuda = {k: v.cuda() for k, v in toks.items()}
        cls_rep = model(**toks_cuda)[0][:,0,:]
        all_embs.append(cls_rep.cpu().detach().numpy())
    
    all_embs = np.concatenate(all_embs, axis=0)
    
  5. Cloud GPUs: For optimal performance, especially for large datasets, using a cloud GPU service like AWS, Azure, or Google Cloud is recommended.

License

For more details about licensing, refer to the SapBERT GitHub repository. The UMLS dataset's licensing terms must also be considered when using the model.

More Related APIs in Feature Extraction