bert base thai
monsoon-nlpIntroduction
BERT-Base-Thai is a Thai language adaptation of Google's BERT model, designed to improve text representation by pre-training specifically on Thai data. The model addresses the challenge of Thai word segmentation, offering a dedicated solution for Thai text processing.
Architecture
The model is based on the BERT-Base architecture and has been specifically pre-trained on Thai text data. Unlike the original multilingual BERT, BERT-Base-Thai is tailored to handle the intricacies of the Thai language, including unique tokenization and sentence segmentation requirements.
Training
Training data was sourced from a Thai Wikipedia dump, with sentence segmentation done through heuristic methods. The model uses SentencePiece for tokenization, a method that does not require prior word segmentation. Pre-training involved preparing data into a specific format and training the model on a Tesla K80 GPU over 1 million steps, although a snapshot at 0.8 million steps was found to yield better results for downstream tasks.
Guide: Running Locally
Pre-tokenization
- Install necessary packages:
pip install pythainlp six sentencepiece python-crfsuite
- Clone the repository:
git clone https://github.com/ThAIKeras/bert
- Download the
.vocab
and.model
files from the provided links or the ThAIKeras repository.
Tokenization Setup
- Initialize the
ThaiTokenizer
class with the downloaded vocab and model files. - Tokenize text using the tokenizer, ensuring to pre-segment sentences if needed.
Cloud GPUs
For efficient training and evaluation, consider using cloud-based GPU services such as AWS EC2 with Tesla K80 or other high-performance GPU options.
License
The model and its associated code are available under a license that permits free usage and modification, provided that proper attribution is given and any derived work is distributed under the same terms.