bert base

klue

Introduction

KLUE BERT Base is a pre-trained BERT model developed specifically for the Korean language as part of the Korean Language Understanding Evaluation (KLUE) Benchmark. It is a transformer-based language model licensed under cc-by-sa-4.0. The model aims to facilitate various natural language processing tasks in Korean.

Architecture

KLUE BERT Base is a transformer-based language model. It utilizes a morpheme-based subword tokenization approach with a vocabulary size of 32k, designed to reflect morphemes without a morphological analyzer during inference. This improves usability and speed.

Training

The model was trained using a diverse set of Korean language corpora, totaling approximately 62GB. The corpora include MODU, CC-100-Kor, NAMUWIKI, NEWSCRAWL, and PETITION. The preprocessing involved filtering noisy and non-Korean text and tokenizing using a new morpheme-based method. The training details are elaborated in the associated research paper.

Guide: Running Locally

To run KLUE BERT Base locally, follow these basic steps:

  1. Install Transformers Library: Ensure you have the Hugging Face Transformers library installed.

    pip install transformers
    
  2. Load the Model and Tokenizer:

    from transformers import AutoModel, AutoTokenizer
    
    model = AutoModel.from_pretrained("klue/bert-base")
    tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")
    
  3. Use the Model: Apply the model for tasks such as semantic textual similarity or named entity recognition as per the KLUE Benchmark.

For better performance, consider using cloud GPUs available on platforms like AWS, GCP, or Azure.

License

The KLUE BERT Base model is distributed under the Creative Commons Attribution-ShareAlike 4.0 International License (cc-by-sa-4.0).

More Related APIs in Fill Mask