bert base spanish wwm cased

dccuchile

Introduction

BETO is a BERT model trained specifically on a large Spanish corpus using the Whole Word Masking technique. It is comparable in size to BERT-Base and includes both cased and uncased versions. BETO is benchmarked against Multilingual BERT and other non-BERT-based models for Spanish language tasks.

Architecture

BETO employs a vocabulary of approximately 31,000 BPE subwords constructed using SentencePiece. It was trained for 2 million steps, making it robust for a wide variety of Spanish language processing tasks.

Training

BETO was trained using TensorFlow and PyTorch with Whole Word Masking. The training process was supported by Adereso and the Millennium Institute for Foundational Research on Data, as well as Google's TensorFlow Research Cloud program.

Guide: Running Locally

  1. Installation: Ensure Python is installed and set up a virtual environment.
  2. Dependencies: Install the Hugging Face Transformers library.
    pip install transformers
    
  3. Download Model: Use the Hugging Face model hub to download BETO.
    from transformers import BertTokenizer, BertModel
    tokenizer = BertTokenizer.from_pretrained('dccuchile/bert-base-spanish-wwm-cased')
    model = BertModel.from_pretrained('dccuchile/bert-base-spanish-wwm-cased')
    
  4. Inference: Tokenize input text and run inference using BETO.
  5. Cloud GPUs: Consider using cloud services like Google Colab or AWS for GPU support if running extensive tasks.

License

The BETO model is best described by the CC BY 4.0 license, though users should verify the compatibility of the original datasets' licenses, particularly for commercial use.

More Related APIs in Fill Mask