guwenbert base

ethanyt

Introduction

GuwenBERT is a RoBERTa-based model pre-trained on Classical Chinese texts. It is suitable for tasks like sentence breaking, punctuation, and named entity recognition in literary Chinese contexts.

Architecture

The model architecture is built on RoBERTa, leveraging the capabilities of transformers for natural language processing tasks. It is designed to handle Classical Chinese, including ancient and literary forms.

Training

Training Data

GuwenBERT is trained on the Daizhige dataset, comprising 15,694 books in Classical Chinese, covering disciplines such as Buddhism, Confucianism, and Medicine. The dataset includes 1.7 billion characters, with traditional characters converted to simplified ones, and a vocabulary size of 23,292.

Training Procedure

The model is initialized from hfl/chinese-roberta-wwm-ext and pre-trained using a 2-step strategy. Initially, only word embeddings are updated during Masked Language Modeling (MLM), followed by updating all parameters. Training used 4 V100 GPUs, over 120,000 steps, with a batch size of 2,048 and sequence length of 512. The Adam optimizer was employed with specific hyperparameters for learning rate decay and warmup.

Guide: Running Locally

To run GuwenBERT locally, follow these steps:

  1. Install Transformers Library:

    pip install transformers
    
  2. Load Model and Tokenizer:

    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-base")
    model = AutoModel.from_pretrained("ethanyt/guwenbert-base")
    
  3. Inference: Use the model for inference tasks, such as filling masked tokens in Classical Chinese texts.

For optimal performance, consider using cloud GPU services from providers like AWS, Google Cloud, or Azure.

License

GuwenBERT is released under the Apache 2.0 License, allowing for broad use in commercial and non-commercial applications.

More Related APIs in Fill Mask