mbart large 50 finetuned opus en pt translation
NarrativaIntroduction
The mBART-LARGE-50-FINETUNED-OPUS-EN-PT-TRANSLATION model, created by Narrativa, is designed for translating text from English to Portuguese. It is built on the mBART-50 architecture, fine-tuned on the OPUS100 and OPUSBOOK datasets specialized for neural machine translation (NMT).
Architecture
mBART-50 is a multilingual Sequence-to-Sequence model employing "Multilingual Denoising Pretraining." This model supports multilingual translation by fine-tuning on multiple language directions simultaneously. It extends the original mBART model to handle 50 languages, enhancing its capability to support diverse multilingual machine translation tasks. The model's pre-training involves noising source documents using sentence shuffling and in-filling schemes, then reconstructing the original text, with 35% of words masked.
Training
The model was fine-tuned on the OPUS-100 dataset, an English-centric corpus covering 100 languages, with at least 10k sentence pairs available for 95 language pairs. Training data consisted of up to 1M sentence pairs per language pair. The BLEU score achieved was 20.61, indicating the model's translation quality.
Guide: Running Locally
To run the model locally:
- Clone the Transformers repository:
git clone https://github.com/huggingface/transformers.git
- Install the package:
pip install -q ./transformers
- Use the following Python script to perform translations:
from transformers import MBart50TokenizerFast, MBartForConditionalGeneration ckpt = 'Narrativa/mbart-large-50-finetuned-opus-en-pt-translation' tokenizer = MBart50TokenizerFast.from_pretrained(ckpt) model = MBartForConditionalGeneration.from_pretrained(ckpt).to("cuda") tokenizer.src_lang = 'en_XX' def translate(text): inputs = tokenizer(text, return_tensors='pt') input_ids = inputs.input_ids.to('cuda') attention_mask = inputs.attention_mask.to('cuda') output = model.generate(input_ids, attention_mask=attention_mask, forced_bos_token_id=tokenizer.lang_code_to_id['pt_XX']) return tokenizer.decode(output[0], skip_special_tokens=True) translate('here your English text to be translated to Portuguese...')
For optimal performance, it is recommended to use a cloud GPU service such as AWS EC2, Google Cloud, or Azure.
License
The model and its associated datasets are made available under the licenses provided by the original creators, and users should adhere to these when using the model in their applications.