pegasus base chinese cluecorpussmall

uer

Introduction

The PEGASUS-BASE-CHINESE-CLUECORPUSSMALL model is a Chinese text-to-text generation model pre-trained on the CLUECorpusSmall dataset. It leverages the UER-py toolkit for pre-training, with support for large parameter models and multimodal pre-training through TencentPretrain.

Architecture

The model architecture includes a base version (PEGASUS-Base) with 12 layers and a hidden size of 768, and a large version (PEGASUS-Large) with 16 layers and a hidden size of 1024. It is designed for text2text generation tasks, utilizing transformers and PyTorch or TensorFlow for implementation.

Training

The training data consists of the CLUECorpusSmall dataset. The model was pre-trained for 1,000,000 steps using a sequence length of 512 on Tencent Cloud. The training process involves preprocessing the corpus, pre-training the model, and converting the pre-trained model into Hugging Face's format for deployment.

Guide: Running Locally

  1. Install dependencies: Ensure you have the transformers library installed.
  2. Load the model:
    from transformers import BertTokenizer, PegasusForConditionalGeneration, Text2TextGenerationPipeline
    tokenizer = BertTokenizer.from_pretrained("uer/pegasus-base-chinese-cluecorpussmall")
    model = PegasusForConditionalGeneration.from_pretrained("uer/pegasus-base-chinese-cluecorpussmall")
    text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
    
  3. Generate text: Use the pipeline to generate text from input sequences.
    text2text_generator("内容丰富、版式设计考究、图片华丽、印制精美。[MASK]纸箱内还放了充气袋用于保护。", max_length=50, do_sample=False)
    
  4. Cloud GPUs: For more efficient processing, consider using cloud GPUs from providers like AWS, Azure, or Google Cloud.

License

The model and its associated tools are open-source, enabling free use and modification. For detailed licensing information, refer to the respective GitHub repositories for UER-py and TencentPretrain.

More Related APIs in Text2text Generation