legal bert base cased ptbr
dominguesmIntroduction
legal-bert-base-cased-ptbr
is a Portuguese language model designed for the legal domain, based on the BERTimbau base model. It uses a fill-mask objective to assist in NLP research related to legal texts, computer law, and legal technology applications. The model was pre-trained with various legal documents in Portuguese.
Architecture
The model is built upon the BERT architecture, specifically the BERTimbau base model. It is designed to handle tasks such as fill-mask prediction, making it suitable for processing and understanding legal texts in Portuguese.
Training
The pre-training corpus included a variety of legal documents provided by the Brazilian Supreme Federal Tribunal. Here are the key statistics from the training process:
- Number of examples: 353,435
- Number of epochs: 3
- Batch size per device: 4
- Total training batch size: 32
- Gradient accumulation steps: 1
- Total optimization steps: 33,135
- Training loss: 0.6108
- Evaluation loss: 0.4725
- Perplexity: 1.6040
Guide: Running Locally
To use the legal-bert-base-cased-ptbr
model locally, follow these steps:
-
Install the Transformers Library: Ensure you have the
transformers
library installed.pip install transformers
-
Load the Model and Tokenizer:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("dominguesm/legal-bert-base-cased-ptbr") model = AutoModel.from_pretrained("dominguesm/legal-bert-base-cased-ptbr")
-
Optional: Use with a Pipeline:
from transformers import pipeline fill_mask = pipeline('fill-mask', model="dominguesm/legal-bert-base-cased-ptbr")
Consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure for more efficient processing, especially for large-scale tasks.
License
The model is licensed under the Creative Commons Attribution 4.0 International (cc-by-4.0).