electra ko en base

tunib

Introduction

TUNiB-Electra is a set of pre-trained bilingual models combining Korean and English, developed to enhance language processing capabilities across these languages. These models are trained on a substantial corpus of Korean and English text, providing robust performance on various language tasks.

Architecture

The TUNiB-Electra models are based on the ELECTRA architecture, designed for efficient language representation learning. They are bilingual and leverage a balanced corpus of Korean and English, distinguishing them from previous monolingual models.

Training

The models were trained using a large dataset comprising 100 GB of Korean text sourced from blogs, comments, news, and web novels, alongside English texts. This extensive training data enables the models to perform well on both Korean and English language tasks.

Guide: Running Locally

To use the TUNiB-Electra model locally, follow these steps:

  1. Install Transformers Library
    Ensure you have the transformers library installed. You can do this using pip:

    pip install transformers
    
  2. Load the Model and Tokenizer
    Use the following Python code to load the model and tokenizer:

    from transformers import AutoModel, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained('tunib/electra-ko-en-base')
    model = AutoModel.from_pretrained('tunib/electra-ko-en-base')
    
  3. Tokenize Text
    Tokenize Korean or English text using the tokenizer:

    tokens = tokenizer.tokenize("Your text here")
    
  4. Cloud GPUs
    For training or deploying models at scale, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure for efficient processing.

License

The TUNiB-Electra models are released under an open license, allowing for widespread use and adaptation in various applications. Make sure to review the specific terms and conditions provided with the model files.

More Related APIs