electra ko en base
tunibIntroduction
TUNiB-Electra is a set of pre-trained bilingual models combining Korean and English, developed to enhance language processing capabilities across these languages. These models are trained on a substantial corpus of Korean and English text, providing robust performance on various language tasks.
Architecture
The TUNiB-Electra models are based on the ELECTRA architecture, designed for efficient language representation learning. They are bilingual and leverage a balanced corpus of Korean and English, distinguishing them from previous monolingual models.
Training
The models were trained using a large dataset comprising 100 GB of Korean text sourced from blogs, comments, news, and web novels, alongside English texts. This extensive training data enables the models to perform well on both Korean and English language tasks.
Guide: Running Locally
To use the TUNiB-Electra model locally, follow these steps:
-
Install Transformers Library
Ensure you have thetransformers
library installed. You can do this using pip:pip install transformers
-
Load the Model and Tokenizer
Use the following Python code to load the model and tokenizer:from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('tunib/electra-ko-en-base') model = AutoModel.from_pretrained('tunib/electra-ko-en-base')
-
Tokenize Text
Tokenize Korean or English text using the tokenizer:tokens = tokenizer.tokenize("Your text here")
-
Cloud GPUs
For training or deploying models at scale, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure for efficient processing.
License
The TUNiB-Electra models are released under an open license, allowing for widespread use and adaptation in various applications. Make sure to review the specific terms and conditions provided with the model files.