postagger bio portuguese

pucpr-br

Introduction

The POS-TAGGER BIO PORTUGUESE is a fine-tuned model based on the BioBERTpt(all) architecture, utilizing the MacMorpho corpus for Portuguese language processing. This model is developed for token classification tasks with a focus on clinical and biomedical text processing in Brazilian Portuguese.

Architecture

  • Model Base: BioBERTpt(all)
  • Language: Portuguese
  • Corpus: MacMorpho
  • Task: Token Classification

Training

The model was fine-tuned over 10 epochs, achieving a general F1-Score of 0.9818. Training metrics are as follows:

  • Accuracy: 0.9818
  • Macro Average: Precision 0.95, Recall 0.94, F1 0.94
  • Weighted Average: Precision 0.98, Recall 0.98, F1 0.98

Training Parameters

  • Number of Classes: 27
  • Total Epochs: 30
  • Early Stopping Epochs: 12
  • Batch Size: 32
  • Learning Rate: 1e-5
  • Early Stopping Patience: 3
  • Max Sequence Length: 200

Guide: Running Locally

  1. Setup Environment: Ensure you have Python installed. Use a virtual environment to manage dependencies.
  2. Install Dependencies: Use pip to install necessary libraries such as PyTorch and Transformers.
  3. Download Model: Access the model via the Hugging Face Model Hub.
  4. Run Inference: Load the model in your script and pass input text for token classification.

Cloud GPUs: Consider using cloud services like AWS, GCP, or Azure to leverage GPU resources for faster processing.

License

The study and model development were partially funded by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES), Finance Code 001.

For further questions, visit the NLP Portuguese Chunking GitHub repository.

More Related APIs in Token Classification