roberta base biomedical clinical es

PlanTL-GOB-ES

Introduction
The RoBERTa-base Biomedical Clinical ES model is a Spanish language model specifically designed for biomedical and clinical text. It utilizes a RoBERTa architecture trained on a specialized corpus, making it suitable for tasks like masked language modeling and potentially other NLP tasks with fine-tuning, such as Named Entity Recognition (NER) and Text Classification.

Architecture
This model leverages the RoBERTa architecture, which is a robust transformer model known for its efficiency in handling masked language modeling tasks. The training employed Byte-Pair Encoding (BPE) with a vocabulary size of 52,000 tokens.

Training
The model was trained on a curated biomedical-clinical corpus in Spanish, assembled from sources like clinical notes, medical publications, and patents. The corpus underwent cleaning processes to ensure quality, resulting in over 1 billion tokens. The training process utilized 16 NVIDIA V100 GPUs over 48 hours, employing the Adam optimizer with a peak learning rate of 0.0005.

Guide: Running Locally

  1. Install the Transformers library by Hugging Face.
  2. Load the model and tokenizer using the following code:
    from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
    
    tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
    model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
    unmasker = pipeline('fill-mask', model=model)
    
    result = unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
    print(result)
    
  3. For enhanced performance, consider using cloud GPUs such as those available from AWS, Google Cloud, or Azure.

License
The model is distributed under the Apache License 2.0, allowing for wide usage and modification with proper attribution.

More Related APIs in Fill Mask