t5 base chinese cluecorpussmall

uer

Introduction

The T5-Base-Chinese-CLUECorpusSmall model is a variant of the Text-to-Text Transfer Transformer (T5), pre-trained specifically for Chinese language tasks using the CLUECorpusSmall dataset. Developed by UER-py, the model supports text-to-text generation and can be utilized for various NLP applications.

Architecture

The T5 architecture leverages a unified text-to-text format, employing sentinel tokens to mask spans of the input sequence. This model is pre-trained with a sequence length of 128 and later with 512, utilizing parameters from UER-py and TencentPretrain to support models with over one billion parameters.

Training

The model training involves two stages:

  • Stage 1: Pre-training for 1,000,000 steps with a sequence length of 128 using dynamic masking.
  • Stage 2: Additional 250,000 steps with a sequence length of 512.

The training was conducted using the CLUECorpusSmall dataset on Tencent Cloud, implementing a span-masking strategy with defined hyperparameters.

Guide: Running Locally

  1. Install Dependencies: Ensure you have transformers and torch libraries installed:

    pip install transformers torch
    
  2. Load Model: Use the following Python script to initialize the model and tokenizer:

    from transformers import BertTokenizer, T5ForConditionalGeneration, Text2TextGenerationPipeline
    
    tokenizer = BertTokenizer.from_pretrained("uer/t5-small-chinese-cluecorpussmall")
    model = T5ForConditionalGeneration.from_pretrained("uer/t5-small-chinese-cluecorpussmall")
    text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
    
  3. Generate Text: Run the model for text generation:

    text2text_generator("中国的首都是extra0京", max_length=50, do_sample=False)
    
  4. Cloud GPU: For better performance, consider using cloud-based GPUs such as AWS, Google Cloud, or Azure.

License

The T5-Base-Chinese-CLUECorpusSmall model is open-source, available for use under the respective licenses of the UER-py and TencentPretrain toolkits. Always refer to the official repositories for detailed licensing information.

More Related APIs in Text2text Generation