BERT-KHMER-BASE-UNCASED-TOKENIZED

Introduction

The BERT-KHMER-BASE-UNCASED-TOKENIZED model is designed for natural language processing tasks in the Khmer language. It is based on the BERT architecture and supports the fill-mask task using the Transformers library with PyTorch. This model is suitable for inference endpoints.

Architecture

This model utilizes the BERT architecture, which is a bidirectional transformer model pre-trained on a large corpus of text. It is specifically adapted for the Khmer language and is uncased, meaning it does not differentiate between uppercase and lowercase text.

Training

Details about the training process are provided in the associated research paper. The model was trained to optimize performance on Khmer language tasks, leveraging a significant amount of linguistic data.

Guide: Running Locally

To run this model locally, follow these steps:

Clone the repository from GitHub:
git clone https://github.com/GKLMIP/Pretrained-Models-For-Khmer
Install the necessary dependencies, preferably in a virtual environment.
Load the model using the Transformers library in PyTorch.
Execute tasks such as fill-mask using the model.

For optimal performance, especially during training or intensive inference tasks, consider using cloud GPUs provided by platforms such as AWS, Google Cloud, or Azure.

License

The model and its associated resources are available under the terms specified on the GitHub repository. Users are encouraged to cite the research paper if the model is used in academic or professional projects:

@article{,
author="Jiang, Shengyi
and Fu, Sihui
and Lin, Nankai
and Fu, Yingwen",
title="Pre-trained Models and Evaluation Data for the Khmer Language",
year="2021",
publisher="Tsinghua Science and Technology",
}