roberta base wwm chinese cluecorpussmall

uer

Introduction

The Chinese Whole Word Masking RoBERTa models are a set of six pre-trained language models developed by UER-py. These models are designed for Chinese language tasks, utilizing whole word masking to enhance performance. They are built to handle various model sizes, extending to over a billion parameters, and can be pre-trained using the TencentPretrain framework.

Architecture

The models are part of the RoBERTa family and support whole word masking, specifically tailored for Chinese language processing. They range from Tiny to Large configurations, with parameters increasing from 2 layers and 128 hidden size (Tiny) to 24 layers and 1024 hidden size (Large). The architecture has been optimized based on standard BERT techniques to ensure broad applicability across different model sizes.

Training

Training was conducted using the CLUECorpusSmall dataset through UER-py on Tencent Cloud. The procedure involved two stages:

  • Stage 1: Pre-training with a sequence length of 128 for 1,000,000 steps.
  • Stage 2: Continued pre-training with a sequence length of 512 for an additional 250,000 steps. The training used consistent hyperparameters across model sizes, employing the jieba tool for word segmentation.

Guide: Running Locally

  1. Installation: Install the transformers library from Hugging Face.

    pip install transformers
    
  2. Usage: Load the model using the pipeline function for masked language modeling.

    from transformers import pipeline
    unmasker = pipeline('fill-mask', model='uer/roberta-tiny-wwm-chinese-cluecorpussmall')
    unmasker("北京是[MASK]国的首都。")
    
  3. PyTorch & TensorFlow: For feature extraction, use BertTokenizer and BertModel in PyTorch or TFBertModel in TensorFlow.

  4. Cloud GPUs: For efficient model training and inference, consider using cloud services that offer GPU support, such as AWS, Google Cloud, or Azure.

License

The models and their associated code are provided by UER-py, with contributions from TencentPretrain. Users should refer to the respective repositories for specific licensing terms.

More Related APIs in Fill Mask