opus mt tc big en pt
Helsinki-NLPIntroduction
The OPUS-MT-TC-BIG-EN-PT model is a neural machine translation model designed to translate English text into Portuguese. It is part of the OPUS-MT project, which aims to make neural machine translation models widely available and accessible. The model was originally trained using the Marian NMT framework.
Architecture
The OPUS-MT-TC-BIG-EN-PT model uses a transformer-big architecture. It supports multiple target languages and requires a language token at the beginning of the input sentence to specify the target language. The model is trained using data from OPUS and uses SentencePiece for tokenization.
Training
The training data for this model was sourced from the OPUS project, specifically from the opusTCv20210807+bt dataset. The model was converted to PyTorch using the Hugging Face Transformers library. It achieves BLEU scores of 50.4 on the flores101-devtest dataset and 49.6 on the tatoeba-test-v2021-08-07 dataset.
Guide: Running Locally
To run the OPUS-MT-TC-BIG-EN-PT model locally:
-
Install Transformers: Ensure you have the Hugging Face Transformers library installed.
pip install transformers
-
Load Model and Tokenizer:
from transformers import MarianMTModel, MarianTokenizer model_name = "Helsinki-NLP/opus-mt-tc-big-en-pt" tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name)
-
Translate Text:
src_text = ["Tom tried to stab me."] translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) for t in translated: print(tokenizer.decode(t, skip_special_tokens=True))
For better performance, consider using a cloud GPU service such as AWS EC2, Google Cloud, or Azure for running the model.
License
The OPUS-MT-TC-BIG-EN-PT model is licensed under the Creative Commons Attribution 4.0 International License (cc-by-4.0).