postagger bio portuguese
pucpr-brIntroduction
The POS-TAGGER BIO PORTUGUESE is a fine-tuned model based on the BioBERTpt(all) architecture, utilizing the MacMorpho corpus for Portuguese language processing. This model is developed for token classification tasks with a focus on clinical and biomedical text processing in Brazilian Portuguese.
Architecture
- Model Base: BioBERTpt(all)
- Language: Portuguese
- Corpus: MacMorpho
- Task: Token Classification
Training
The model was fine-tuned over 10 epochs, achieving a general F1-Score of 0.9818. Training metrics are as follows:
- Accuracy: 0.9818
- Macro Average: Precision 0.95, Recall 0.94, F1 0.94
- Weighted Average: Precision 0.98, Recall 0.98, F1 0.98
Training Parameters
- Number of Classes: 27
- Total Epochs: 30
- Early Stopping Epochs: 12
- Batch Size: 32
- Learning Rate: 1e-5
- Early Stopping Patience: 3
- Max Sequence Length: 200
Guide: Running Locally
- Setup Environment: Ensure you have Python installed. Use a virtual environment to manage dependencies.
- Install Dependencies: Use
pip
to install necessary libraries such as PyTorch and Transformers. - Download Model: Access the model via the Hugging Face Model Hub.
- Run Inference: Load the model in your script and pass input text for token classification.
Cloud GPUs: Consider using cloud services like AWS, GCP, or Azure to leverage GPU resources for faster processing.
License
The study and model development were partially funded by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES), Finance Code 001.
For further questions, visit the NLP Portuguese Chunking GitHub repository.