byt5 small qa squad v1.1 portuguese

pierreguillou

Introduction

The BYT5-SMALL-QA-SQUAD-V1.1-PORTUGUESE model is a language model fine-tuned for question answering in Portuguese, based on the SQUAD v1.1 dataset. Developed by Pierre Guillou, it builds upon the ByT5 small model, a tokenizer-free version of Google's T5, optimized for handling noisy text data.

Architecture

ByT5 generally follows the architecture of MT5 and was pre-trained on the mC4 dataset without supervised training. The model is designed to handle byte-level tokenization, which allows it to process text without relying on a predefined vocabulary. This architecture is particularly effective for processing noisy text, such as tweets.

Training

The model was trained using the SQUAD v1.1 dataset in Portuguese provided by the Deep Learning Brasil group. The model's fine-tuning was performed on Google Colab using the ByT5 small model. The training data may contain biases given its unfiltered nature.

Guide: Running Locally

To run the model locally, you can use the Hugging Face Transformers library. Below are the basic steps:

  1. Install Transformers and PyTorch:

    pip install transformers torch
    
  2. Using the Pipeline:

    from transformers import pipeline
    
    model_name = 'pierreguillou/byt5-small-qa-squad-v1.1-portuguese'
    nlp = pipeline("text2text-generation", model=model_name)
    
    input_text = """question: "Quando começou a pandemia de Covid-19 no mundo?" context: "A pandemia de COVID-19..."""
    result = nlp(input_text)
    print(result)
    
  3. Using Auto Classes:

    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    
    model_name = 'pierreguillou/byt5-small-qa-squad-v1.1-portuguese'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    
    input_text = """question: "Quando começou a pandemia de Covid-19 no mundo?" context: "A pandemia de COVID-19..."""
    input_ids = tokenizer(input_text, return_tensors='pt').input_ids
    outputs = model.generate(input_ids, max_length=64, num_beams=1)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(result)
    

Cloud GPU

For enhanced performance and faster processing, consider using cloud GPU services such as Google Colab, AWS, or Azure.

License

This project is licensed under the Apache-2.0 License.

More Related APIs in Text2text Generation