ptt5 base portuguese vocab

unicamp-dl

Introduction

PTT5 is a pretrained T5 model optimized for Portuguese language tasks, utilizing the BrWac corpus—a comprehensive collection of web pages in Portuguese. It enhances performance in tasks such as sentence similarity and entailment in Portuguese. The model is available in three sizes: small, base, and large, with vocabularies from both Google's T5 and a custom one trained on Portuguese Wikipedia.

Architecture

PTT5 is based on the T5 architecture, which is designed for text-to-text tasks. It is available in three sizes:

  • Small: 60M parameters
  • Base: 220M parameters (recommended)
  • Large: 740M parameters

Each size can be used with either Google's original T5 vocabulary or a custom Portuguese vocabulary.

Training

PTT5 was pretrained using the BrWac corpus, which consists of a large collection of Portuguese web pages. The model has been validated for its effectiveness in Portuguese text generation and understanding tasks.

Guide: Running Locally

  1. Install Transformers Library:
    Make sure you have the transformers library installed. You can do this via pip:

    pip install transformers
    
  2. Load the Model and Tokenizer:
    Use the following Python code to load the PTT5 model and tokenizer:

    from transformers import T5Tokenizer, T5ForConditionalGeneration, TFT5ForConditionalGeneration
    
    model_name = 'unicamp-dl/ptt5-base-portuguese-vocab'
    tokenizer = T5Tokenizer.from_pretrained(model_name)
    
    # For PyTorch
    model_pt = T5ForConditionalGeneration.from_pretrained(model_name)
    
    # For TensorFlow
    model_tf = TFT5ForConditionalGeneration.from_pretrained(model_name)
    
  3. Cloud GPU Recommendation:
    For optimal performance, especially with larger models, consider using cloud GPU services such as AWS, GCP, or Azure.

License

PTT5 is licensed under the MIT License, allowing for broad reuse with attribution.

More Related APIs in Text2text Generation