macbert4csc base chinese
shibing624Introduction
The MACBERT4CSC-BASE-CHINESE model is a text-to-text generation model designed for Chinese spelling correction. It is based on the MacBERT architecture and achieves state-of-the-art results on the SIGHAN2015 test data for Chinese spelling correction tasks.
Architecture
MacBERT is an enhanced version of BERT that incorporates improvements such as:
- Whole Word Masking (WWM): Masks entire words rather than subword units.
- N-gram Masking: Masks contiguous sequences of tokens.
- Sentence-Order Prediction (SOP): A pre-training task predicting the order of sentences.
- MLM as Correction: A novel pre-training task that predicts masked tokens as corrections.
The model's neural architecture remains compatible with BERT, allowing MacBERT to replace BERT in applications without architectural changes.
Training
The model was trained using the SIGHAN+Wang271K Chinese spelling correction dataset, which includes data from the SIGHAN13, 14, and 15 datasets, as well as additional data generated by dimmywang. The training utilizes MacBERT's pre-training tasks to enhance performance on spelling correction tasks.
Guide: Running Locally
-
Installation: Ensure you have Python and PyTorch installed. Install the necessary libraries:
pip install transformers pip install pycorrector
-
Model Setup: Use the
pycorrector
library to load the model:from pycorrector.macbert.macbert_corrector import MacBertCorrector m = MacBertCorrector("shibing624/macbert4csc-base-chinese") corrected_text = m.correct("今天新情很好") print(corrected_text)
-
Using Transformers: Alternatively, load the model using
transformers
:import torch from transformers import BertTokenizer, BertForMaskedLM tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese") model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese") model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
-
Cloud GPUs: For faster inference, consider using cloud services like AWS, Google Cloud, or Azure that provide GPU support.
License
The MACBERT4CSC-BASE-CHINESE model is licensed under the Apache-2.0 License, allowing free use, distribution, and modification.