kobart base v2

gogamza

Introduction

KoBART-base-v2 is a Korean language model based on BART, designed for tasks such as feature extraction. It was trained using over 40GB of Korean text to perform text infilling, which is the process of reconstructing original text from noisy input.

Architecture

The model is an encoder-decoder language model, consisting of:

  • 124M parameters
  • 6 layers for both encoder and decoder
  • 16 attention heads
  • 3072-dimensional feed-forward networks
  • 768 hidden dimensions

Training

Training Data

The model was trained on a diverse set of Korean-language data, including:

  • 5 million sentences from Korean Wikipedia
  • Additional data from news, books, and the "Modu Corpus v1.0" (conversations, news, etc.)
  • The Blue House National Petition corpus

The vocabulary size is 30,000, with added emojis and emoticons to enhance the model's capability in recognizing frequent conversational tokens.

Training Procedure

The training utilized the Character BPE tokenizer from the Hugging Face tokenizers package.

Guide: Running Locally

To run KoBART-base-v2 locally, follow these steps:

  1. Install dependencies: Make sure you have Python and the Hugging Face Transformers library installed.

    pip install transformers
    
  2. Load the model and tokenizer:

    from transformers import PreTrainedTokenizerFast, BartModel
    
    tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2')
    model = BartModel.from_pretrained('gogamza/kobart-base-v2')
    
  3. Run the model: Use the tokenizer and model to perform your desired tasks.

For enhanced performance, especially on large datasets, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

KoBART-base-v2 is distributed under the MIT License, allowing for wide usage and modification.

More Related APIs in Feature Extraction