translation en pt t5
unicamp-dlIntroduction
This repository provides an implementation of T5 for English-to-Portuguese translation tasks, optimized for use with modest hardware. It includes enhancements in tokenization and post-processing, leveraging a Portuguese pretrained model for better translation results. For further details, visit the GitHub repository and review the accompanying research paper.
Architecture
The model is based on the T5 architecture, designed for text-to-text generation tasks, specifically configured for translation between English and Portuguese. It utilizes datasets like EMEA, ParaCrawl 99k, CAPES, Scielo, JRC-Acquis, and Biomedical Domain Corpora. The model's performance is evaluated using BLEU metrics.
Training
The model was trained using lightweight strategies suitable for limited hardware environments. It employs a pretrained Portuguese model to enhance translation accuracy. The training process involved specific optimizations in tokenization and post-processing to improve the translation quality.
Guide: Running Locally
To run the model locally, follow these steps:
- Install the
transformers
library. - Load the tokenizer and model using:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline tokenizer = AutoTokenizer.from_pretrained("unicamp-dl/translation-en-pt-t5") model = AutoModelForSeq2SeqLM.from_pretrained("unicamp-dl/translation-en-pt-t5")
- Set up a translation pipeline:
enpt_pipeline = pipeline('text2text-generation', model=model, tokenizer=tokenizer)
- Translate text by specifying the task:
enpt_pipeline("translate English to Portuguese: I like to eat rice.")
For enhanced performance, consider using cloud GPU services like AWS, Google Cloud, or Microsoft Azure.
License
The repository does not explicitly mention a license. For usage permissions and restrictions, please refer to the original repository or contact the authors.