guwenbert base
ethanytIntroduction
GuwenBERT is a RoBERTa-based model pre-trained on Classical Chinese texts. It is suitable for tasks like sentence breaking, punctuation, and named entity recognition in literary Chinese contexts.
Architecture
The model architecture is built on RoBERTa, leveraging the capabilities of transformers for natural language processing tasks. It is designed to handle Classical Chinese, including ancient and literary forms.
Training
Training Data
GuwenBERT is trained on the Daizhige dataset, comprising 15,694 books in Classical Chinese, covering disciplines such as Buddhism, Confucianism, and Medicine. The dataset includes 1.7 billion characters, with traditional characters converted to simplified ones, and a vocabulary size of 23,292.
Training Procedure
The model is initialized from hfl/chinese-roberta-wwm-ext
and pre-trained using a 2-step strategy. Initially, only word embeddings are updated during Masked Language Modeling (MLM), followed by updating all parameters. Training used 4 V100 GPUs, over 120,000 steps, with a batch size of 2,048 and sequence length of 512. The Adam optimizer was employed with specific hyperparameters for learning rate decay and warmup.
Guide: Running Locally
To run GuwenBERT locally, follow these steps:
-
Install Transformers Library:
pip install transformers
-
Load Model and Tokenizer:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-base") model = AutoModel.from_pretrained("ethanyt/guwenbert-base")
-
Inference: Use the model for inference tasks, such as filling masked tokens in Classical Chinese texts.
For optimal performance, consider using cloud GPU services from providers like AWS, Google Cloud, or Azure.
License
GuwenBERT is released under the Apache 2.0 License, allowing for broad use in commercial and non-commercial applications.