bert base spanish wwm uncased

dccuchile

Introduction

BETO is a BERT model specifically trained on a large Spanish corpus. It is comparable in size to BERT-Base and utilizes the Whole Word Masking technique. The model is available in both cased and uncased versions and offers weights for TensorFlow and PyTorch. BETO's performance is benchmarked against Multilingual BERT and other non-BERT models across several Spanish language tasks.

Architecture

BETO employs the BERT architecture with a vocabulary of approximately 31,000 BPE subwords created using SentencePiece. The model was trained for 2 million steps, leveraging the Whole Word Masking approach to handle Spanish text effectively. The architecture includes both cased and uncased versions to cater to different use cases.

Training

The model was trained using datasets provided by Adereso and the Millennium Institute for Foundational Research on Data. The training process was supported by Google through the TensorFlow Research Cloud program. The training involved extensive benchmarking to ensure competitive performance in various Spanish language tasks.

Guide: Running Locally

  1. Installation: Ensure you have Python and the Hugging Face Transformers library installed.

    pip install transformers
    
  2. Model Loading: Load the model using the Transformers library.

    from transformers import AutoModel, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
    model = AutoModel.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
    
  3. Inference: Use the model for tasks like fill-mask or sentence classification.

  4. Cloud GPUs: For more intensive tasks, consider using cloud GPU services such as AWS, Google Cloud Platform, or Azure to speed up processing.

License

The work is best described by the CC BY 4.0 license. However, there may be ambiguities regarding the compatibility of all datasets used, especially for commercial purposes. Users should verify the licenses of the original resources to ensure they meet their requirements.

More Related APIs in Fill Mask