macbert4csc base chinese

shibing624

Introduction

The MACBERT4CSC-BASE-CHINESE model is a text-to-text generation model designed for Chinese spelling correction. It is based on the MacBERT architecture and achieves state-of-the-art results on the SIGHAN2015 test data for Chinese spelling correction tasks.

Architecture

MacBERT is an enhanced version of BERT that incorporates improvements such as:

  • Whole Word Masking (WWM): Masks entire words rather than subword units.
  • N-gram Masking: Masks contiguous sequences of tokens.
  • Sentence-Order Prediction (SOP): A pre-training task predicting the order of sentences.
  • MLM as Correction: A novel pre-training task that predicts masked tokens as corrections.

The model's neural architecture remains compatible with BERT, allowing MacBERT to replace BERT in applications without architectural changes.

Training

The model was trained using the SIGHAN+Wang271K Chinese spelling correction dataset, which includes data from the SIGHAN13, 14, and 15 datasets, as well as additional data generated by dimmywang. The training utilizes MacBERT's pre-training tasks to enhance performance on spelling correction tasks.

Guide: Running Locally

  1. Installation: Ensure you have Python and PyTorch installed. Install the necessary libraries:

    pip install transformers
    pip install pycorrector
    
  2. Model Setup: Use the pycorrector library to load the model:

    from pycorrector.macbert.macbert_corrector import MacBertCorrector
    
    m = MacBertCorrector("shibing624/macbert4csc-base-chinese")
    corrected_text = m.correct("今天新情很好")
    print(corrected_text)
    
  3. Using Transformers: Alternatively, load the model using transformers:

    import torch
    from transformers import BertTokenizer, BertForMaskedLM
    
    tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
    model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
    model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
    
  4. Cloud GPUs: For faster inference, consider using cloud services like AWS, Google Cloud, or Azure that provide GPU support.

License

The MACBERT4CSC-BASE-CHINESE model is licensed under the Apache-2.0 License, allowing free use, distribution, and modification.

More Related APIs in Text2text Generation