bert base thai

monsoon-nlp

Introduction

BERT-Base-Thai is a Thai language adaptation of Google's BERT model, designed to improve text representation by pre-training specifically on Thai data. The model addresses the challenge of Thai word segmentation, offering a dedicated solution for Thai text processing.

Architecture

The model is based on the BERT-Base architecture and has been specifically pre-trained on Thai text data. Unlike the original multilingual BERT, BERT-Base-Thai is tailored to handle the intricacies of the Thai language, including unique tokenization and sentence segmentation requirements.

Training

Training data was sourced from a Thai Wikipedia dump, with sentence segmentation done through heuristic methods. The model uses SentencePiece for tokenization, a method that does not require prior word segmentation. Pre-training involved preparing data into a specific format and training the model on a Tesla K80 GPU over 1 million steps, although a snapshot at 0.8 million steps was found to yield better results for downstream tasks.

Guide: Running Locally

Pre-tokenization

  1. Install necessary packages:
    pip install pythainlp six sentencepiece python-crfsuite
    
  2. Clone the repository:
    git clone https://github.com/ThAIKeras/bert
    
  3. Download the .vocab and .model files from the provided links or the ThAIKeras repository.

Tokenization Setup

  1. Initialize the ThaiTokenizer class with the downloaded vocab and model files.
  2. Tokenize text using the tokenizer, ensuring to pre-segment sentences if needed.

Cloud GPUs

For efficient training and evaluation, consider using cloud-based GPU services such as AWS EC2 with Tesla K80 or other high-performance GPU options.

License

The model and its associated code are available under a license that permits free usage and modification, provided that proper attribution is given and any derived work is distributed under the same terms.

More Related APIs in Feature Extraction