ner chemical bionlp bc5cdr pubmed LLM Model

Introduction

The NER-CHEMICAL-BIONLP-BC5CDR-PUBMED model is designed for token classification tasks, specifically for recognizing chemical entities in biomedical texts. It is based on a RoBERTa model pretrained on PubMed articles and fine-tuned using the BioNLP and BC5CDR datasets. This model is optimized for extracting chemical-related information, aiding in bioinformatics research.

Architecture

The model utilizes the RoBERTa architecture, a variant of the BERT model, known for its robust performance on natural language processing tasks. It is trained to classify tokens into predefined categories relevant to chemical and bioinformatics research. The model simplifies token classification by eliminating the 'B-', 'I-' prefixes in its labels, supporting a streamlined entity recognition process.

Training

The model was fine-tuned on the BioNLP and BC5CDR datasets, focusing on identifying chemical entities. It employs a customized training approach that avoids entropy loss for subwords by labeling only the first subword token, enhancing prediction accuracy for chemical names and terms.

Guide: Running Locally

Install Dependencies
Ensure you have Python installed, and then run:
```
!pip install forgebox
```

Load and Use the Model
Import necessary libraries and load the model:

from forgebox.hf.train import NERInference
ner = NERInference.from_pretrained("raynardj/ner-chemical-bionlp-bc5cdr-pubmed")

Make Predictions
Use the model to predict entities in your text:
```
a_df = ner.predict(["text1", "text2"])
```
Cloud GPU Suggestion
For better performance, especially when dealing with large datasets, consider using cloud-based GPUs. Providers like AWS, GCP, or Azure offer scalable GPU solutions.

License

This model is released under the Apache 2.0 License, allowing for both personal and commercial use, modification, and distribution, with proper attribution to the original authors.

More Related APIs in Token Classification