Myan B E R Ta

UCSYNLP

Introduction

MyanBERTa is a BERT-based pre-trained language model specifically designed for the Myanmar language. It was developed by UCSYNLP and trained on a large dataset of word-segmented Myanmar text.

Architecture

The model is based on the BERT architecture and uses a byte-level BPE tokenizer. The tokenizer was trained on a segmented vocabulary of 30,522 subword units, optimized for the Myanmar language. MyanBERTa was pre-trained on a dataset with over 5.9 million sentences and 136 million words.

Training

MyanBERTa was pre-trained for 528,000 steps using a Myanmar dataset. The training data consisted of word-segmented sentences sourced from various web and corpus datasets. The model is designed to support tasks such as fill-mask and is compatible with PyTorch and the Transformers library.

Guide: Running Locally

  1. Setup Environment: Install Python and ensure you have access to a package manager like pip.
  2. Install Dependencies: Use pip install torch transformers to install PyTorch and Hugging Face Transformers.
  3. Download Model: Access the model through Hugging Face's Model Hub and download it to your local environment.
  4. Load Model: Use the Transformers library to load the model and tokenizer in your script.
  5. Inference: Execute inference tasks using the model, such as fill-mask.

Suggested Cloud GPUs

Consider using cloud-based GPU services like AWS EC2, Google Cloud AI Platform, or Azure Machine Learning to leverage powerful hardware for faster training and inference.

License

MyanBERTa is released under the Apache License 2.0, allowing for both personal and commercial use while requiring proper attribution.

More Related APIs in Fill Mask