kobart base v2 LLM Model — Open LLM List

Introduction

KoBART-base-v2 is a Korean language model based on BART, designed for tasks such as feature extraction. It was trained using over 40GB of Korean text to perform text infilling, which is the process of reconstructing original text from noisy input.

Architecture

The model is an encoder-decoder language model, consisting of:

124M parameters
6 layers for both encoder and decoder
16 attention heads
3072-dimensional feed-forward networks
768 hidden dimensions

Training

Training Data

The model was trained on a diverse set of Korean-language data, including:

5 million sentences from Korean Wikipedia
Additional data from news, books, and the "Modu Corpus v1.0" (conversations, news, etc.)
The Blue House National Petition corpus

The vocabulary size is 30,000, with added emojis and emoticons to enhance the model's capability in recognizing frequent conversational tokens.

Training Procedure

The training utilized the Character BPE tokenizer from the Hugging Face tokenizers package.

Guide: Running Locally

To run KoBART-base-v2 locally, follow these steps:

Install dependencies: Make sure you have Python and the Hugging Face Transformers library installed.
```
pip install transformers
```

Load the model and tokenizer:

from transformers import PreTrainedTokenizerFast, BartModel

tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2')
model = BartModel.from_pretrained('gogamza/kobart-base-v2')

Run the model: Use the tokenizer and model to perform your desired tasks.

For enhanced performance, especially on large datasets, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

KoBART-base-v2 is distributed under the MIT License, allowing for wide usage and modification.

More Related APIs in Feature Extraction