translation pt en t5
unicamp-dlIntroduction
The translation-pt-en-t5
repository provides an implementation of the T5 model for Portuguese-English translation tasks, optimized for modest hardware setups. It includes modifications in tokenization and post-processing, leveraging a Portuguese pretrained model to enhance translation quality. Further details are available on GitHub and in the accompanying research paper.
Architecture
The model is based on the T5 architecture, designed for text-to-text tasks, and is trained using various datasets including EMEA, ParaCrawl 99k, CAPES, Scielo, JRC-Acquis, and Biomedical Domain Corpora. The model is evaluated using the BLEU metric for translation quality.
Training
The training strategy involves using a Portuguese pretrained model, with custom tokenization and post-processing techniques to improve translation accuracy. The model leverages diverse datasets to enhance its understanding and translation capability across different domains.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Transformers Library:
pip install transformers
-
Set Up Model and Tokenizer:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline tokenizer = AutoTokenizer.from_pretrained("unicamp-dl/translation-pt-en-t5") model = AutoModelForSeq2SeqLM.from_pretrained("unicamp-dl/translation-pt-en-t5")
-
Create a Translation Pipeline:
pten_pipeline = pipeline('text2text-generation', model=model, tokenizer=tokenizer) result = pten_pipeline("translate Portuguese to English: Eu gosto de comer arroz.") print(result)
For optimal performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure, which offer powerful computing resources for running machine learning models.
License
The repository and model usage are subject to the terms outlined by Hugging Face's model licensing, which can be found on their platform. Users must comply with these terms for both commercial and non-commercial use.