guwenbert large
ethanytIntroduction
GuwenBERT is a RoBERTa-based model pre-trained on Classical Chinese texts. It is designed for tasks such as sentence breaking, punctuation, and named entity recognition. The model can be fine-tuned for specific applications using additional data.
Architecture
GuwenBERT utilizes the RoBERTa architecture tailored for Classical Chinese language processing. It is initialized with the hfl/chinese-roberta-wwm-ext-large
model and trained further on a domain-specific dataset.
Training
Training Data
The model is trained on the Daizhige dataset, consisting of 15,694 Classical Chinese books across various subjects like Buddhism, Confucianism, and History. The dataset contains about 1.7 billion characters, with 76% punctuated texts. All traditional characters are converted to simplified characters, and the vocabulary size is 23,292.
Training Procedure
The training process involves a two-step strategy:
- The model learns masked language modeling (MLM) with only word embeddings updated until convergence.
- All parameters are updated during further training.
Training is conducted on 4 V100 GPUs over 120,000 steps, with a batch size of 2,048 and sequence length of 512. The optimizer used is Adam with specific configurations: a learning rate of 1e-4, adam-betas
of (0.9, 0.98), adam-eps
of 1e-6, weight decay of 0.01, learning rate warmup for 5,000 steps, and linear decay.
Eval Results
The model achieved second place in the "Gulian Cup" Ancient Books Named Entity Recognition evaluation, with a precision of 83.88%, recall of 85.39%, and an F1 score of 84.63%.
Guide: Running Locally
-
Installation: Ensure Python and PyTorch are installed. Install the
transformers
library using pip:pip install transformers
-
Model Loading:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-large") model = AutoModel.from_pretrained("ethanyt/guwenbert-large")
-
Inference: Utilize the tokenizer and model for tasks like fill-mask or other language processing tasks.
-
Hardware Recommendation: For efficient processing, using a cloud GPU such as NVIDIA V100 or A100 is recommended, especially for large-scale data.
License
GuwenBERT is licensed under the Apache-2.0 License. This allows for both personal and commercial use, with proper attribution and without warranty.