ptt5 base portuguese vocab
unicamp-dlIntroduction
PTT5 is a pretrained T5 model optimized for Portuguese language tasks, utilizing the BrWac corpus—a comprehensive collection of web pages in Portuguese. It enhances performance in tasks such as sentence similarity and entailment in Portuguese. The model is available in three sizes: small, base, and large, with vocabularies from both Google's T5 and a custom one trained on Portuguese Wikipedia.
Architecture
PTT5 is based on the T5 architecture, which is designed for text-to-text tasks. It is available in three sizes:
- Small: 60M parameters
- Base: 220M parameters (recommended)
- Large: 740M parameters
Each size can be used with either Google's original T5 vocabulary or a custom Portuguese vocabulary.
Training
PTT5 was pretrained using the BrWac corpus, which consists of a large collection of Portuguese web pages. The model has been validated for its effectiveness in Portuguese text generation and understanding tasks.
Guide: Running Locally
-
Install Transformers Library:
Make sure you have thetransformers
library installed. You can do this via pip:pip install transformers
-
Load the Model and Tokenizer:
Use the following Python code to load the PTT5 model and tokenizer:from transformers import T5Tokenizer, T5ForConditionalGeneration, TFT5ForConditionalGeneration model_name = 'unicamp-dl/ptt5-base-portuguese-vocab' tokenizer = T5Tokenizer.from_pretrained(model_name) # For PyTorch model_pt = T5ForConditionalGeneration.from_pretrained(model_name) # For TensorFlow model_tf = TFT5ForConditionalGeneration.from_pretrained(model_name)
-
Cloud GPU Recommendation:
For optimal performance, especially with larger models, consider using cloud GPU services such as AWS, GCP, or Azure.
License
PTT5 is licensed under the MIT License, allowing for broad reuse with attribution.