pegasus base chinese cluecorpussmall
uerIntroduction
The PEGASUS-BASE-CHINESE-CLUECORPUSSMALL model is a Chinese text-to-text generation model pre-trained on the CLUECorpusSmall dataset. It leverages the UER-py toolkit for pre-training, with support for large parameter models and multimodal pre-training through TencentPretrain.
Architecture
The model architecture includes a base version (PEGASUS-Base) with 12 layers and a hidden size of 768, and a large version (PEGASUS-Large) with 16 layers and a hidden size of 1024. It is designed for text2text generation tasks, utilizing transformers and PyTorch or TensorFlow for implementation.
Training
The training data consists of the CLUECorpusSmall dataset. The model was pre-trained for 1,000,000 steps using a sequence length of 512 on Tencent Cloud. The training process involves preprocessing the corpus, pre-training the model, and converting the pre-trained model into Hugging Face's format for deployment.
Guide: Running Locally
- Install dependencies: Ensure you have the
transformers
library installed. - Load the model:
from transformers import BertTokenizer, PegasusForConditionalGeneration, Text2TextGenerationPipeline tokenizer = BertTokenizer.from_pretrained("uer/pegasus-base-chinese-cluecorpussmall") model = PegasusForConditionalGeneration.from_pretrained("uer/pegasus-base-chinese-cluecorpussmall") text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
- Generate text: Use the pipeline to generate text from input sequences.
text2text_generator("内容丰富、版式设计考究、图片华丽、印制精美。[MASK]纸箱内还放了充气袋用于保护。", max_length=50, do_sample=False)
- Cloud GPUs: For more efficient processing, consider using cloud GPUs from providers like AWS, Azure, or Google Cloud.
License
The model and its associated tools are open-source, enabling free use and modification. For detailed licensing information, refer to the respective GitHub repositories for UER-py and TencentPretrain.