bart base cantonese LLM Model

Introduction

The BART-BASE-CANTONESE model is a Cantonese adaptation of the BART base model, developed through second-stage pre-training using the LIHKG dataset. It is based on the fnlp/bart-base-chinese model and supported by Google's TPU Research Cloud.

Architecture

The architecture is based on BART (Bidirectional and Auto-Regressive Transformers), tailored to the Cantonese language through additional training. The model is designed for fill-mask tasks and utilizes a unique tokenizer for Cantonese text processing.

Training

Optimiser: Stochastic Gradient Descent (SGD) with a learning rate of 0.03 and Adaptive Gradient Clipping at 0.1.
Dataset: 172,937,863 sentences, padded or truncated to 64 tokens each.
Batch Size: 640
Epochs: 7 epochs with an additional 61,440 steps.
Training Time: 44 hours on a Google Cloud TPU v4-16 instance.

Guide: Running Locally

Installation: Ensure you have PyTorch and the Transformers library installed.

Import Model:

from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
tokenizer = BertTokenizer.from_pretrained('Ayaka/bart-base-cantonese')
model = BartForConditionalGeneration.from_pretrained('Ayaka/bart-base-cantonese')

Generate Text:

text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
output = text2text_generator('聽日就要返香港，我激動到[MASK]唔着', max_length=50, do_sample=False)
print(output[0]['generated_text'].replace(' ', ''))

Cloud GPUs: For optimal performance, consider using cloud GPUs from providers like Google Cloud or AWS.

License

The model is released under an unspecified license. Users should avoid using this model for any purposes that might infringe copyright laws.

More Related APIs in Fill Mask