Kc E L E C T R A base
beomiIntroduction
KcELECTRA is a Korean-specific pretrained ELECTRA model developed to effectively handle user-generated, noisy text domains, such as comments on news articles. Unlike many Korean models trained on well-organized text like Wikipedia and news articles, KcELECTRA is optimized for informal text, including slang and typos.
Architecture
KcELECTRA is based on the ELECTRA architecture, which uses a generator-discriminator model for efficient training. It improves over previous models like KcBERT by expanding the dataset and vocabulary size, leading to better performance on various downstream tasks.
Training
KcELECTRA was trained from scratch using Korean comments and replies collected from Naver News. The training corpus consists of 17GB of text, comprising over 180 million sentences. The data underwent preprocessing steps such as emoji inclusion, duplicate string reduction, and removal of inappropriate content. The model was trained using a TPU v3-8 over 10 days and achieved significant performance improvements over existing models.
Guide: Running Locally
Requirements
- PyTorch:
~1.8.0
- Transformers:
~4.11.3
- Emoji:
~0.6.0
- Soynlp:
~0.0.493
Steps
-
Install the required libraries:
pip install torch transformers emoji soynlp
-
Use KcELECTRA with Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("beomi/KcELECTRA-base") model = AutoModel.from_pretrained("beomi/KcELECTRA-base")
-
Preprocessing example:
import re import emoji from soynlp.normalizer import repeat_normalize pattern = re.compile(f'[^ .,?!/@$%~%·∼()\x00-\x7Fㄱ-ㅣ가-힣]+') url_pattern = re.compile(r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)') def clean(text): text = pattern.sub(' ', text) text = emoji.replace_emoji(text, replace='') text = url_pattern.sub('', text) text = text.strip() text = repeat_normalize(text, num_repeats=2) return text
Cloud GPU Recommendation
For large-scale training or fine-tuning, consider using cloud services like Google Cloud Platform (GCP) that offer access to TPUs, which can significantly speed up the training process.
License
KcELECTRA is licensed under the MIT License, allowing for broad use and distribution. Ensure to include proper attribution when using the model in your projects.