t5 base chinese cluecorpussmall LLM Model

Introduction

The T5-Base-Chinese-CLUECorpusSmall model is a variant of the Text-to-Text Transfer Transformer (T5), pre-trained specifically for Chinese language tasks using the CLUECorpusSmall dataset. Developed by UER-py, the model supports text-to-text generation and can be utilized for various NLP applications.

Architecture

The T5 architecture leverages a unified text-to-text format, employing sentinel tokens to mask spans of the input sequence. This model is pre-trained with a sequence length of 128 and later with 512, utilizing parameters from UER-py and TencentPretrain to support models with over one billion parameters.

Training

The model training involves two stages:

Stage 1: Pre-training for 1,000,000 steps with a sequence length of 128 using dynamic masking.
Stage 2: Additional 250,000 steps with a sequence length of 512.

The training was conducted using the CLUECorpusSmall dataset on Tencent Cloud, implementing a span-masking strategy with defined hyperparameters.

Guide: Running Locally

Install Dependencies: Ensure you have transformers and torch libraries installed:
```
pip install transformers torch
```

Load Model: Use the following Python script to initialize the model and tokenizer:

from transformers import BertTokenizer, T5ForConditionalGeneration, Text2TextGenerationPipeline

tokenizer = BertTokenizer.from_pretrained("uer/t5-small-chinese-cluecorpussmall")
model = T5ForConditionalGeneration.from_pretrained("uer/t5-small-chinese-cluecorpussmall")
text2text_generator = Text2TextGenerationPipeline(model, tokenizer)

Generate Text: Run the model for text generation:

text2text_generator("中国的首都是extra0京", max_length=50, do_sample=False)

Cloud GPU: For better performance, consider using cloud-based GPUs such as AWS, Google Cloud, or Azure.

License

The T5-Base-Chinese-CLUECorpusSmall model is open-source, available for use under the respective licenses of the UER-py and TencentPretrain toolkits. Always refer to the official repositories for detailed licensing information.

More Related APIs in Text2text Generation