roberta base word chinese cluecorpussmall
uerIntroduction
The Chinese word-based RoBERTa models are a series of language models pre-trained by UER-py and TencentPretrain. These models are developed for improved performance over character-based models, offering advantages in speed and results due to shorter sequence lengths. They are designed to facilitate the reproduction of results using publicly available datasets and tools.
Architecture
The models are based on the RoBERTa architecture and come in five different sizes: Tiny, Mini, Small, Medium, and Base. They are specifically pre-trained on the CLUECorpusSmall dataset using Google's SentencePiece for word segmentation. The models support parameters exceeding one billion and extend to a multimodal pre-training framework.
Training
Pre-training involves two stages on the CLUECorpusSmall dataset using UER-py on Tencent Cloud. The first stage uses a sequence length of 128 for 1,000,000 steps, and the second stage uses a sequence length of 512 for 250,000 steps. SentencePiece is used for word segmentation, and training involves a variety of hyperparameters and configurations.
Guide: Running Locally
- Install Dependencies: Ensure you have Python and the
transformers
library installed. - Load the Model: Use the following code to load the model for masked language modeling:
from transformers import pipeline unmasker = pipeline('fill-mask', model='uer/roberta-medium-word-chinese-cluecorpussmall')
- Run Inference: Use the model for predictions as demonstrated in the example code.
- Training: If needed, follow the training procedure using UER-py for pre-training models.
- Convert Model: Convert the model into Hugging Face's format after pre-training.
For efficient training and inference, consider using cloud GPUs such as those available on AWS, Google Cloud, or Azure.
License
The models and code are provided under the Apache License 2.0. Users are encouraged to review the license for compliance and usage rights.