bart base cantonese
AyakaIntroduction
The BART-BASE-CANTONESE model is a Cantonese adaptation of the BART base model, developed through second-stage pre-training using the LIHKG dataset. It is based on the fnlp/bart-base-chinese model and supported by Google's TPU Research Cloud.
Architecture
The architecture is based on BART (Bidirectional and Auto-Regressive Transformers), tailored to the Cantonese language through additional training. The model is designed for fill-mask tasks and utilizes a unique tokenizer for Cantonese text processing.
Training
- Optimiser: Stochastic Gradient Descent (SGD) with a learning rate of 0.03 and Adaptive Gradient Clipping at 0.1.
- Dataset: 172,937,863 sentences, padded or truncated to 64 tokens each.
- Batch Size: 640
- Epochs: 7 epochs with an additional 61,440 steps.
- Training Time: 44 hours on a Google Cloud TPU v4-16 instance.
Guide: Running Locally
- Installation: Ensure you have PyTorch and the Transformers library installed.
- Import Model:
from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline tokenizer = BertTokenizer.from_pretrained('Ayaka/bart-base-cantonese') model = BartForConditionalGeneration.from_pretrained('Ayaka/bart-base-cantonese')
- Generate Text:
text2text_generator = Text2TextGenerationPipeline(model, tokenizer) output = text2text_generator('聽日就要返香港,我激動到[MASK]唔着', max_length=50, do_sample=False) print(output[0]['generated_text'].replace(' ', ''))
- Cloud GPUs: For optimal performance, consider using cloud GPUs from providers like Google Cloud or AWS.
License
The model is released under an unspecified license. Users should avoid using this model for any purposes that might infringe copyright laws.