C P M Generate

TsinghuaAI

Introduction

CPM-Generate is a Chinese Pre-trained Language Model developed by TsinghuaAI, based on the Transformer architecture. It is designed to assist with various Chinese NLP tasks such as conversation, essay generation, cloze tests, and language understanding. CPM features 2.6 billion parameters and is trained on 100GB of Chinese data, making it one of the largest Chinese pre-trained language models available.

Architecture

CPM-Generate is an autoregressive language model utilizing a Transformer-based architecture. It supports different versions, including CPM-Small, CPM-Medium, and CPM-Large, with varying numbers of parameters and layers, improving performance as the model size increases. The architecture employs dense attention with a maximum sequence length of 1,024 tokens, and future versions aim to incorporate sparse attention.

Training

The model is trained on diverse text sources, including encyclopedia entries, web pages, stories, news articles, and dialogues. Training involves a learning rate of 1.5×10⁻⁴ and a batch size of 3,072, using the Adam optimizer. The model undergoes 20,000 training steps, with the initial 5,000 steps serving as a warm-up period. The training process is conducted over two weeks using 64 NVIDIA V100 GPUs. Evaluation shows that larger models perform better across tasks like text classification, Chinese idiom cloze tests, and short text conversation generation.

Guide: Running Locally

To run CPM-Generate locally, follow these steps:

  1. Install Dependencies: Ensure you have Python installed, then install the necessary libraries, such as transformers.

    pip install transformers
    
  2. Load the Model and Tokenizer:

    from transformers import TextGenerationPipeline, AutoTokenizer, AutoModelWithLMHead
    
    tokenizer = AutoTokenizer.from_pretrained("TsinghuaAI/CPM-Generate")
    model = AutoModelWithLMHead.from_pretrained("TsinghuaAI/CPM-Generate")
    
    text_generator = TextGenerationPipeline(model, tokenizer)
    
  3. Generate Text:

    text_generator('清华大学', max_length=50, do_sample=True, top_p=0.9)
    

For enhanced performance, especially when handling large models, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

CPM-Generate is licensed under the MIT License, allowing for broad use and modification while providing minimal restrictions on reuse.

More Related APIs in Text Generation