Kc E L E C T R A base v2022
beomiIntroduction
KcELECTRA is a Korean language model designed to handle User-Generated Noisy text domains, such as comments and responses collected from Naver News. It improves performance over previous models like KcBERT by expanding the dataset and vocabulary. The model is built using Hugging Face's Transformers library and can be accessed directly without additional downloads.
Architecture
KcELECTRA is based on the ELECTRA architecture and is specifically trained to handle noisy text with frequent colloquial expressions and typos. The model was trained from scratch using a unique tokenizer and a large dataset of comments and replies. It is optimized for tasks involving Korean text, outperforming many existing Korean language models.
Training
The model was trained using data collected from 2019 to early 2021, totaling approximately 17.3GB of text, processed to include Korean, English, special characters, and emojis. Preprocessing steps included removing duplicates, filtering out short responses, and handling profanity. The training employed the BertWordPieceTokenizer with a vocab size of 30,000, and the process took about 10 days on a TPU v3-8.
Guide: Running Locally
-
Requirements:
- Install
pytorch ~= 1.8.0
- Install
transformers ~= 4.11.3
- Install
emoji ~= 0.6.0
- Install
soynlp ~= 0.0.493
- Install
-
Model Usage:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("beomi/KcELECTRA-base") model = AutoModel.from_pretrained("beomi/KcELECTRA-base")
-
Preprocessing: Use the provided
clean
function to preprocess text data for better performance:import re import emoji from soynlp.normalizer import repeat_normalize def clean(x): pattern = re.compile(f'[^ .,?!/@$%~%·∼()\x00-\x7Fㄱ-ㅣ가-힣]+') url_pattern = re.compile( r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)') x = pattern.sub(' ', x) x = emoji.replace_emoji(x, replace='') # Remove emojis x = url_pattern.sub('', x) x = x.strip() x = repeat_normalize(x, num_repeats=2) return x
-
Suggested Hardware: Use cloud GPUs for better performance, such as those available through Google Cloud Platform.
License
The KcELECTRA model is licensed under the MIT License. For academic citations, please refer to the provided citation format.