ner disease ncbi bionlp bc5cdr pubmed
raynardjIntroduction
The NER-DISEASE-NCBI-BIONLP-BC5CDR-PUBMED model is designed for token classification tasks, specifically recognizing named entities related to diseases using the NCBI Disease and BC5CDR datasets. The model is built using the RoBERTa architecture and fine-tuned for use in bioinformatics applications.
Architecture
The model leverages the RoBERTa architecture, pre-trained on PubMed data. It is specifically tailored for Named Entity Recognition (NER) tasks, classifying tokens into predefined categories, such as "Disease" and "O" (Other). The labels do not include prefixes like 'B-' or 'I-', simplifying the classification process.
Training
The model was trained using the NCBI Disease and BC5CDR datasets. It utilizes a token classification approach to identify disease names within text, informed by pre-training on the RoBERTa model adapted for PubMed literature.
Guide: Running Locally
To utilize the model locally, follow these steps:
-
Install Transformers Library: Ensure you have the
transformers
library installed.pip install transformers
-
Load the Pre-trained Model:
from transformers import pipeline PRETRAINED = "raynardj/ner-disease-ncbi-bionlp-bc5cdr-pubmed" ner = pipeline(task="ner", model=PRETRAINED, tokenizer=PRETRAINED)
-
Run the NER Pipeline:
results = ner("Your text", aggregation_strategy="first")
-
Clean and Organize Outputs:
import pandas as pd from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(PRETRAINED) # Define the cleaning function as provided # Use entity_table function to get a structured DataFrame entity_table(ner)(YOUR_VERY_CONTENTFUL_TEXT)
For enhanced performance, consider using cloud GPU services like AWS, Google Cloud, or Azure to accelerate processing.
License
The model is released under the Apache 2.0 License, which allows for free use, modification, and distribution, provided that the conditions of the license are met.