biobertpt clin
pucprBioBERTpt - Clinical Named Entity Recognition
Introduction
BioBERTpt is a Portuguese neural language model specifically designed for clinical named entity recognition. It is based on BERT architecture and has been trained on clinical notes and biomedical literature in Portuguese. The model aims to improve natural language processing (NLP) tasks, especially named-entity recognition (NER), within the clinical domain using electronic health records from Brazilian hospitals.
Architecture
BioBERTpt is initialized with BERT-Multilingual-Cased and further trained on domain-specific data. This includes clinical narratives and biomedical literature, allowing the model to capture specialized vocabulary and context used in Brazilian Portuguese healthcare settings.
Training
The BioBERTpt model underwent a transfer learning process using a multilingual BERT model. It was trained on clinical narratives and biomedical papers, resulting in improved performance on NER tasks for Portuguese text. The model was evaluated using two annotated corpora, outperforming baseline BERT models with a 2.72% increase in F1-score across 11 out of 13 assessed entities.
Guide: Running Locally
To use the BioBERTpt model locally, follow these steps:
-
Install the Transformers Library:
pip install transformers
-
Load the Model and Tokenizer:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("pucpr/biobertpt-clin") model = AutoModel.from_pretrained("pucpr/biobertpt-clin")
-
Utilize a Cloud GPU:
For efficient computation, consider using cloud services like AWS, Google Cloud, or Azure to access GPUs.
License
This research was partially funded by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) under Finance Code 001. For further usage permissions, refer to the original publication or contact the authors.