t5 base chinese cluecorpussmall
uerIntroduction
The T5-Base-Chinese-CLUECorpusSmall model is a variant of the Text-to-Text Transfer Transformer (T5), pre-trained specifically for Chinese language tasks using the CLUECorpusSmall dataset. Developed by UER-py, the model supports text-to-text generation and can be utilized for various NLP applications.
Architecture
The T5 architecture leverages a unified text-to-text format, employing sentinel tokens to mask spans of the input sequence. This model is pre-trained with a sequence length of 128 and later with 512, utilizing parameters from UER-py and TencentPretrain to support models with over one billion parameters.
Training
The model training involves two stages:
- Stage 1: Pre-training for 1,000,000 steps with a sequence length of 128 using dynamic masking.
- Stage 2: Additional 250,000 steps with a sequence length of 512.
The training was conducted using the CLUECorpusSmall dataset on Tencent Cloud, implementing a span-masking strategy with defined hyperparameters.
Guide: Running Locally
-
Install Dependencies: Ensure you have
transformers
andtorch
libraries installed:pip install transformers torch
-
Load Model: Use the following Python script to initialize the model and tokenizer:
from transformers import BertTokenizer, T5ForConditionalGeneration, Text2TextGenerationPipeline tokenizer = BertTokenizer.from_pretrained("uer/t5-small-chinese-cluecorpussmall") model = T5ForConditionalGeneration.from_pretrained("uer/t5-small-chinese-cluecorpussmall") text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
-
Generate Text: Run the model for text generation:
text2text_generator("中国的首都是extra0京", max_length=50, do_sample=False)
-
Cloud GPU: For better performance, consider using cloud-based GPUs such as AWS, Google Cloud, or Azure.
License
The T5-Base-Chinese-CLUECorpusSmall model is open-source, available for use under the respective licenses of the UER-py and TencentPretrain toolkits. Always refer to the official repositories for detailed licensing information.