guwenbert large

ethanyt

Introduction

GuwenBERT is a RoBERTa-based model pre-trained on Classical Chinese texts. It is designed for tasks such as sentence breaking, punctuation, and named entity recognition. The model can be fine-tuned for specific applications using additional data.

Architecture

GuwenBERT utilizes the RoBERTa architecture tailored for Classical Chinese language processing. It is initialized with the hfl/chinese-roberta-wwm-ext-large model and trained further on a domain-specific dataset.

Training

Training Data

The model is trained on the Daizhige dataset, consisting of 15,694 Classical Chinese books across various subjects like Buddhism, Confucianism, and History. The dataset contains about 1.7 billion characters, with 76% punctuated texts. All traditional characters are converted to simplified characters, and the vocabulary size is 23,292.

Training Procedure

The training process involves a two-step strategy:

  1. The model learns masked language modeling (MLM) with only word embeddings updated until convergence.
  2. All parameters are updated during further training.

Training is conducted on 4 V100 GPUs over 120,000 steps, with a batch size of 2,048 and sequence length of 512. The optimizer used is Adam with specific configurations: a learning rate of 1e-4, adam-betas of (0.9, 0.98), adam-eps of 1e-6, weight decay of 0.01, learning rate warmup for 5,000 steps, and linear decay.

Eval Results

The model achieved second place in the "Gulian Cup" Ancient Books Named Entity Recognition evaluation, with a precision of 83.88%, recall of 85.39%, and an F1 score of 84.63%.

Guide: Running Locally

  1. Installation: Ensure Python and PyTorch are installed. Install the transformers library using pip:

    pip install transformers
    
  2. Model Loading:

    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-large")
    model = AutoModel.from_pretrained("ethanyt/guwenbert-large")
    
  3. Inference: Utilize the tokenizer and model for tasks like fill-mask or other language processing tasks.

  4. Hardware Recommendation: For efficient processing, using a cloud GPU such as NVIDIA V100 or A100 is recommended, especially for large-scale data.

License

GuwenBERT is licensed under the Apache-2.0 License. This allows for both personal and commercial use, with proper attribution and without warranty.

More Related APIs in Fill Mask