byt5 small qa squad v1.1 portuguese
pierreguillouIntroduction
The BYT5-SMALL-QA-SQUAD-V1.1-PORTUGUESE model is a language model fine-tuned for question answering in Portuguese, based on the SQUAD v1.1 dataset. Developed by Pierre Guillou, it builds upon the ByT5 small model, a tokenizer-free version of Google's T5, optimized for handling noisy text data.
Architecture
ByT5 generally follows the architecture of MT5 and was pre-trained on the mC4 dataset without supervised training. The model is designed to handle byte-level tokenization, which allows it to process text without relying on a predefined vocabulary. This architecture is particularly effective for processing noisy text, such as tweets.
Training
The model was trained using the SQUAD v1.1 dataset in Portuguese provided by the Deep Learning Brasil group. The model's fine-tuning was performed on Google Colab using the ByT5 small model. The training data may contain biases given its unfiltered nature.
Guide: Running Locally
To run the model locally, you can use the Hugging Face Transformers library. Below are the basic steps:
-
Install Transformers and PyTorch:
pip install transformers torch
-
Using the Pipeline:
from transformers import pipeline model_name = 'pierreguillou/byt5-small-qa-squad-v1.1-portuguese' nlp = pipeline("text2text-generation", model=model_name) input_text = """question: "Quando começou a pandemia de Covid-19 no mundo?" context: "A pandemia de COVID-19...""" result = nlp(input_text) print(result)
-
Using Auto Classes:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_name = 'pierreguillou/byt5-small-qa-squad-v1.1-portuguese' tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) input_text = """question: "Quando começou a pandemia de Covid-19 no mundo?" context: "A pandemia de COVID-19...""" input_ids = tokenizer(input_text, return_tensors='pt').input_ids outputs = model.generate(input_ids, max_length=64, num_beams=1) result = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True) print(result)
Cloud GPU
For enhanced performance and faster processing, consider using cloud GPU services such as Google Colab, AWS, or Azure.
License
This project is licensed under the Apache-2.0 License.