ner gene dna rna jnlpba pubmed
raynardjIntroduction
The ner-gene-dna-rna-jnlpba-pubmed
model is a token classification model designed for Named Entity Recognition (NER) tasks, specifically to identify genes, gene products, and related biological entities. It is built on the RoBERTa architecture, pretrained on PubMed data, and fine-tuned with the JNLPBA dataset.
Architecture
The model utilizes the RoBERTa transformer architecture, which is optimized for understanding complex text data. It employs a token classification pipeline to categorize tokens into predefined classes, such as DNA, RNA, proteins, cell lines, and cell types.
Training
The model was trained using the JNLPBA dataset and leverages a PubMed-pretrained RoBERTa model. The label mapping includes categories like DNA, RNA, protein, cell_line, and cell_type, with numerical identifiers for each. The model omits prefixes like 'B-' and 'I-' in the data labels to simplify the classification task.
Guide: Running Locally
To run the model locally, follow these steps:
- Install the necessary Python libraries, including
transformers
andpandas
. - Load the model and tokenizer using the
transformers
library:from transformers import pipeline PRETRAINED = "raynardj/ner-gene-dna-rna-jnlpba-pubmed" ner = pipeline(task="ner", model=PRETRAINED, tokenizer=PRETRAINED)
- Use the model for NER tasks with the text of your choice:
ner("Your text", aggregation_strategy="first")
- To enhance the output readability, use the provided
clean_output
andentity_table
functions.
For efficient training and inference, using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure is recommended.
License
The model is distributed under the Apache 2.0 License, allowing for both personal and commercial use with minimal restrictions.