bert base spanish wwm uncased
dccuchileIntroduction
BETO is a BERT model specifically trained on a large Spanish corpus. It is comparable in size to BERT-Base and utilizes the Whole Word Masking technique. The model is available in both cased and uncased versions and offers weights for TensorFlow and PyTorch. BETO's performance is benchmarked against Multilingual BERT and other non-BERT models across several Spanish language tasks.
Architecture
BETO employs the BERT architecture with a vocabulary of approximately 31,000 BPE subwords created using SentencePiece. The model was trained for 2 million steps, leveraging the Whole Word Masking approach to handle Spanish text effectively. The architecture includes both cased and uncased versions to cater to different use cases.
Training
The model was trained using datasets provided by Adereso and the Millennium Institute for Foundational Research on Data. The training process was supported by Google through the TensorFlow Research Cloud program. The training involved extensive benchmarking to ensure competitive performance in various Spanish language tasks.
Guide: Running Locally
-
Installation: Ensure you have Python and the Hugging Face Transformers library installed.
pip install transformers
-
Model Loading: Load the model using the Transformers library.
from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased") model = AutoModel.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
-
Inference: Use the model for tasks like fill-mask or sentence classification.
-
Cloud GPUs: For more intensive tasks, consider using cloud GPU services such as AWS, Google Cloud Platform, or Azure to speed up processing.
License
The work is best described by the CC BY 4.0 license. However, there may be ambiguities regarding the compatibility of all datasets used, especially for commercial purposes. Users should verify the licenses of the original resources to ensure they meet their requirements.