Legal B E R Timbau base
rufimeloIntroduction
Legal_BERTimbau Large is a fine-tuned BERT model based on BERTimbau, designed for Brazilian Portuguese. The original BERTimbau model is state-of-the-art for tasks such as Named Entity Recognition, Sentence Textual Similarity, and Recognizing Textual Entailment. Legal_BERTimbau adapts BERTimbau for the legal domain by performing a pre-training epoch over 30,000 legal documents.
Architecture
Legal_BERTimbau is available in two architectures:
- BERT-Base: 12 layers with 110 million parameters.
- BERT-Large: 24 layers with 335 million parameters.
Training
The model was fine-tuned using 30,000 legal documents in Portuguese to create a language model adapted for the legal domain. This fine-tuning allows it to handle domain-specific language nuances effectively.
Guide: Running Locally
-
Installation: Ensure Python and PyTorch are installed. Install the
transformers
library from Hugging Face.pip install transformers torch
-
Usage: Load the model and tokenizer using the code below:
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("rufimelo/Legal-BERTimbau-base") model = AutoModelForMaskedLM.from_pretrained("rufimelo/Legal-BERTimbau-base")
-
Prediction: Use the model for masked language modeling:
from transformers import pipeline pipe = pipeline('fill-mask', model=model, tokenizer=tokenizer) pipe('O advogado apresentou [MASK] para o juíz')
-
Embeddings: Generate embeddings using:
import torch from transformers import AutoModel model = AutoModel.from_pretrained('rufimelo/Legal-BERTimbau-base') input_ids = tokenizer.encode('O advogado apresentou recurso para o juíz', return_tensors='pt') with torch.no_grad(): outs = model(input_ids) encoded = outs[0][0, 1:-1]
For optimal performance, consider using cloud-based GPUs such as those offered by AWS, Google Cloud, or Azure.
License
The Legal_BERTimbau model is licensed under the MIT License, permitting use, modification, and distribution with proper attribution.