bert large portuguese cased

neuralmind

Introduction

BERTimbau Large is a pretrained BERT model specifically designed for Brazilian Portuguese, achieving state-of-the-art results in Named Entity Recognition, Sentence Textual Similarity, and Recognizing Textual Entailment. The model is available in two sizes: Base and Large.

Architecture

The BERTimbau model comes in two variations:

  • BERT-Base: 12 layers, 110 million parameters.
  • BERT-Large: 24 layers, 335 million parameters.

Training

BERTimbau Large was trained on the brWaC dataset, a comprehensive collection of Portuguese language data. The model supports various NLP tasks, such as masked language modeling and producing contextual embeddings.

Guide: Running Locally

  1. Install Transformers: Ensure you have the transformers library installed:

    pip install transformers
    
  2. Load Model and Tokenizer:

    from transformers import AutoTokenizer, AutoModelForPreTraining
    
    tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-large-portuguese-cased', do_lower_case=False)
    model = AutoModelForPreTraining.from_pretrained('neuralmind/bert-large-portuguese-cased')
    
  3. Masked Language Modeling Example:

    from transformers import pipeline
    
    pipe = pipeline('fill-mask', model=model, tokenizer=tokenizer)
    result = pipe('Tinha uma [MASK] no meio do caminho.')
    
  4. Use BERT for Embeddings:

    import torch
    from transformers import AutoModel
    
    model = AutoModel.from_pretrained('neuralmind/bert-large-portuguese-cased')
    input_ids = tokenizer.encode('Tinha uma pedra no meio do caminho.', return_tensors='pt')
    
    with torch.no_grad():
        outs = model(input_ids)
        encoded = outs[0][0, 1:-1]
    

To leverage the full potential of BERTimbau Large, it is recommended to use cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

BERTimbau Large is released under the MIT license, allowing for wide usage and modification.

More Related APIs in Fill Mask