gpt2 chinese cluecorpussmall

uer

Introduction

The GPT2-Chinese-CLUECorpusSmall model is a Chinese text generation model pre-trained by UER-py on the CLUECorpusSmall dataset. It supports various configurations, including GPT2-distil, GPT2, GPT2-medium, GPT2-large, and GPT2-xlarge. This model can be downloaded from Hugging Face and is utilized for generating Chinese text.

Architecture

The architecture is based on the GPT2 model, with different configurations to suit various computational needs:

  • GPT2-distil: 6 layers, hidden size 768.
  • GPT2: 12 layers, hidden size 768.
  • GPT2-medium: 24 layers, hidden size 1024.
  • GPT2-large: 36 layers, hidden size 1280.
  • GPT2-xlarge: 48 layers, hidden size 1600.

The GPT2-xlarge model is specifically pre-trained with TencentPretrain to support over one billion parameters and multimodal pre-training.

Training

The training data used is the CLUECorpusSmall dataset. Training involves two stages:

  1. Stage 1: Pre-train with sequence length of 128 for 1,000,000 steps.
  2. Stage 2: Further pre-train with sequence length of 1024 for 250,000 steps.

For GPT2-xlarge, DeepSpeed is used for efficient training and checkpoint management. Pre-trained weights are converted into Hugging Face's format for compatibility.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Transformers and PyTorch:

    pip install transformers torch
    
  2. Load the Model:

    from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
    
    tokenizer = BertTokenizer.from_pretrained("uer/gpt2-distil-chinese-cluecorpussmall")
    model = GPT2LMHeadModel.from_pretrained("uer/gpt2-distil-chinese-cluecorpussmall")
    text_generator = TextGenerationPipeline(model, tokenizer)
    
  3. Generate Text:

    result = text_generator("这是很久之前的事情了", max_length=100, do_sample=True)
    print(result)
    

For optimal performance, consider using cloud GPUs such as those available via AWS or Google Cloud.

License

The model is open-source and available under the Apache 2.0 License, which allows for both personal and commercial use, with attribution.

More Related APIs in Text Generation