kcbert base

beomi

Introduction

KCBERT (Korean Comments BERT) is a BERT model specifically trained on Korean comments and replies from Naver News. It addresses the unique characteristics of Korean informal text, which often includes slang, typos, and colloquial expressions not typically found in formal writings. KCBERT is designed to handle these nuances by being pre-trained on datasets that reflect these real-world conditions.

Architecture

KCBERT is available in two configurations: Base and Large, both utilizing BERT's architecture. The Base model features 12 hidden layers, 12 attention heads, and a hidden size of 768, while the Large model includes 24 hidden layers, 16 attention heads, and a hidden size of 1024. Both models use a vocabulary size of 30,000.

Training

KCBERT was trained using data collected from 2019 to 2020, consisting of comments from popular news articles. The dataset was preprocessed to include Korean characters, English letters, special characters, and emojis. The training involved using TPU v3-8, with the Base model trained over approximately 2.5 days and the Large model over about 5 days. The pretraining loss was observed to decrease quickly in the initial stages, stabilizing after 400k steps.

Guide: Running Locally

To use KCBERT locally, follow these steps:

  1. Install Dependencies:

    pip install transformers==3.0.1 torch==1.8.0 emoji==0.6.0 soynlp==0.0.493
    
  2. Load the Model:

    from transformers import AutoTokenizer, AutoModelWithLMHead
    
    # For Base Model
    tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-base")
    model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-base")
    
    # For Large Model
    tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-large")
    model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-large")
    
  3. Finetuning and Inference: Use the model for various downstream tasks like sentiment analysis or named entity recognition by fine-tuning on specific datasets.

  4. Cloud GPUs: Consider using cloud services like Google Colab with GPUs or TPUs for training to reduce computation time.

License

KCBERT is released under the Apache 2.0 License, allowing for free use, modification, and distribution of the model and its components.

More Related APIs in Fill Mask