K R B E R T char16424
snunlpIntroduction
KR-BERT is a Korean-specific, small-scale BERT model developed by the Computational Linguistics Lab at Seoul National University. It offers performances comparable to or better than existing models like Multilingual BERT, KorBERT, and KoBERT. The model is detailed in the paper "KR-BERT: A Small-Scale Korean-Specific Language Model."
Architecture
KR-BERT is available in two forms: character-based and sub-character-based models. It uses the BidirectionalWordPiece tokenizer to efficiently tokenize Korean text by considering both forward and backward directions, which is crucial due to the nature of Hangul syllable characters that can be broken down into sub-characters or graphemes.
Training
The KR-BERT models were trained with different vocabularies and parameter sizes. The character-based model has a vocabulary size of 16,424 and a parameter size of 99,265,066. The sub-character model uses a separate training process to decompose Korean text for more effective learning. The models were evaluated on tasks like the NAVER Sentiment Movie Corpus (NSMC) with notable accuracy, demonstrating the effectiveness of both character and sub-character models.
Guide: Running Locally
Basic Steps
- Install Dependencies: Ensure you have Python and the necessary libraries installed, including
transformers==2.1.1
andtensorflow<2.0
. - Download Models: Obtain the pre-trained models and place them in the appropriate directories (
models
for TensorFlow andpretrained
for PyTorch). - Preprocess Data: Use the provided code for tokenizing text into sub-characters if using the sub-character model.
- Train and Evaluate:
- For PyTorch: Run
python3 train.py --subchar {True, False} --tokenizer {bert, ranked}
. - For TensorFlow: Use
python3 run_classifier.py
with specified arguments for tasks like NSMC.
- For PyTorch: Run
Cloud GPUs
For efficient training and evaluation, consider using cloud-based GPU services, such as AWS EC2, Google Cloud Platform, or Azure, which can significantly speed up computation.
License
The KR-BERT models and associated code are available for use, but specific licensing details are not provided in the documentation. Please refer to the model's repository or contact the authors for further information.