bert large portuguese cased
neuralmindIntroduction
BERTimbau Large is a pretrained BERT model specifically designed for Brazilian Portuguese, achieving state-of-the-art results in Named Entity Recognition, Sentence Textual Similarity, and Recognizing Textual Entailment. The model is available in two sizes: Base and Large.
Architecture
The BERTimbau model comes in two variations:
- BERT-Base: 12 layers, 110 million parameters.
- BERT-Large: 24 layers, 335 million parameters.
Training
BERTimbau Large was trained on the brWaC dataset, a comprehensive collection of Portuguese language data. The model supports various NLP tasks, such as masked language modeling and producing contextual embeddings.
Guide: Running Locally
-
Install Transformers: Ensure you have the
transformers
library installed:pip install transformers
-
Load Model and Tokenizer:
from transformers import AutoTokenizer, AutoModelForPreTraining tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-large-portuguese-cased', do_lower_case=False) model = AutoModelForPreTraining.from_pretrained('neuralmind/bert-large-portuguese-cased')
-
Masked Language Modeling Example:
from transformers import pipeline pipe = pipeline('fill-mask', model=model, tokenizer=tokenizer) result = pipe('Tinha uma [MASK] no meio do caminho.')
-
Use BERT for Embeddings:
import torch from transformers import AutoModel model = AutoModel.from_pretrained('neuralmind/bert-large-portuguese-cased') input_ids = tokenizer.encode('Tinha uma pedra no meio do caminho.', return_tensors='pt') with torch.no_grad(): outs = model(input_ids) encoded = outs[0][0, 1:-1]
To leverage the full potential of BERTimbau Large, it is recommended to use cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
BERTimbau Large is released under the MIT license, allowing for wide usage and modification.