legal bert base cased ptbr

dominguesm

Introduction

legal-bert-base-cased-ptbr is a Portuguese language model designed for the legal domain, based on the BERTimbau base model. It uses a fill-mask objective to assist in NLP research related to legal texts, computer law, and legal technology applications. The model was pre-trained with various legal documents in Portuguese.

Architecture

The model is built upon the BERT architecture, specifically the BERTimbau base model. It is designed to handle tasks such as fill-mask prediction, making it suitable for processing and understanding legal texts in Portuguese.

Training

The pre-training corpus included a variety of legal documents provided by the Brazilian Supreme Federal Tribunal. Here are the key statistics from the training process:

  • Number of examples: 353,435
  • Number of epochs: 3
  • Batch size per device: 4
  • Total training batch size: 32
  • Gradient accumulation steps: 1
  • Total optimization steps: 33,135
  • Training loss: 0.6108
  • Evaluation loss: 0.4725
  • Perplexity: 1.6040

Guide: Running Locally

To use the legal-bert-base-cased-ptbr model locally, follow these steps:

  1. Install the Transformers Library: Ensure you have the transformers library installed.

    pip install transformers
    
  2. Load the Model and Tokenizer:

    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained("dominguesm/legal-bert-base-cased-ptbr")
    model = AutoModel.from_pretrained("dominguesm/legal-bert-base-cased-ptbr")
    
  3. Optional: Use with a Pipeline:

    from transformers import pipeline
    
    fill_mask = pipeline('fill-mask', model="dominguesm/legal-bert-base-cased-ptbr")
    

Consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure for more efficient processing, especially for large-scale tasks.

License

The model is licensed under the Creative Commons Attribution 4.0 International (cc-by-4.0).

More Related APIs in Fill Mask