ner gene dna rna jnlpba pubmed

raynardj

Introduction

The ner-gene-dna-rna-jnlpba-pubmed model is a token classification model designed for Named Entity Recognition (NER) tasks, specifically to identify genes, gene products, and related biological entities. It is built on the RoBERTa architecture, pretrained on PubMed data, and fine-tuned with the JNLPBA dataset.

Architecture

The model utilizes the RoBERTa transformer architecture, which is optimized for understanding complex text data. It employs a token classification pipeline to categorize tokens into predefined classes, such as DNA, RNA, proteins, cell lines, and cell types.

Training

The model was trained using the JNLPBA dataset and leverages a PubMed-pretrained RoBERTa model. The label mapping includes categories like DNA, RNA, protein, cell_line, and cell_type, with numerical identifiers for each. The model omits prefixes like 'B-' and 'I-' in the data labels to simplify the classification task.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install the necessary Python libraries, including transformers and pandas.
  2. Load the model and tokenizer using the transformers library:
    from transformers import pipeline
    PRETRAINED = "raynardj/ner-gene-dna-rna-jnlpba-pubmed"
    ner = pipeline(task="ner", model=PRETRAINED, tokenizer=PRETRAINED)
    
  3. Use the model for NER tasks with the text of your choice:
    ner("Your text", aggregation_strategy="first")
    
  4. To enhance the output readability, use the provided clean_output and entity_table functions.

For efficient training and inference, using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure is recommended.

License

The model is distributed under the Apache 2.0 License, allowing for both personal and commercial use with minimal restrictions.

More Related APIs in Token Classification