Myan B E R Ta
UCSYNLPIntroduction
MyanBERTa is a BERT-based pre-trained language model specifically designed for the Myanmar language. It was developed by UCSYNLP and trained on a large dataset of word-segmented Myanmar text.
Architecture
The model is based on the BERT architecture and uses a byte-level BPE tokenizer. The tokenizer was trained on a segmented vocabulary of 30,522 subword units, optimized for the Myanmar language. MyanBERTa was pre-trained on a dataset with over 5.9 million sentences and 136 million words.
Training
MyanBERTa was pre-trained for 528,000 steps using a Myanmar dataset. The training data consisted of word-segmented sentences sourced from various web and corpus datasets. The model is designed to support tasks such as fill-mask and is compatible with PyTorch and the Transformers library.
Guide: Running Locally
- Setup Environment: Install Python and ensure you have access to a package manager like
pip
. - Install Dependencies: Use
pip install torch transformers
to install PyTorch and Hugging Face Transformers. - Download Model: Access the model through Hugging Face's Model Hub and download it to your local environment.
- Load Model: Use the Transformers library to load the model and tokenizer in your script.
- Inference: Execute inference tasks using the model, such as fill-mask.
Suggested Cloud GPUs
Consider using cloud-based GPU services like AWS EC2, Google Cloud AI Platform, or Azure Machine Learning to leverage powerful hardware for faster training and inference.
License
MyanBERTa is released under the Apache License 2.0, allowing for both personal and commercial use while requiring proper attribution.