bert base
klueIntroduction
KLUE BERT Base is a pre-trained BERT model developed specifically for the Korean language as part of the Korean Language Understanding Evaluation (KLUE) Benchmark. It is a transformer-based language model licensed under cc-by-sa-4.0. The model aims to facilitate various natural language processing tasks in Korean.
Architecture
KLUE BERT Base is a transformer-based language model. It utilizes a morpheme-based subword tokenization approach with a vocabulary size of 32k, designed to reflect morphemes without a morphological analyzer during inference. This improves usability and speed.
Training
The model was trained using a diverse set of Korean language corpora, totaling approximately 62GB. The corpora include MODU, CC-100-Kor, NAMUWIKI, NEWSCRAWL, and PETITION. The preprocessing involved filtering noisy and non-Korean text and tokenizing using a new morpheme-based method. The training details are elaborated in the associated research paper.
Guide: Running Locally
To run KLUE BERT Base locally, follow these basic steps:
-
Install Transformers Library: Ensure you have the Hugging Face Transformers library installed.
pip install transformers
-
Load the Model and Tokenizer:
from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("klue/bert-base") tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")
-
Use the Model: Apply the model for tasks such as semantic textual similarity or named entity recognition as per the KLUE Benchmark.
For better performance, consider using cloud GPUs available on platforms like AWS, GCP, or Azure.
License
The KLUE BERT Base model is distributed under the Creative Commons Attribution-ShareAlike 4.0 International License (cc-by-sa-4.0).