mbart large 50 finetuned opus en pt translation LLM Model

Introduction

The mBART-LARGE-50-FINETUNED-OPUS-EN-PT-TRANSLATION model, created by Narrativa, is designed for translating text from English to Portuguese. It is built on the mBART-50 architecture, fine-tuned on the OPUS100 and OPUSBOOK datasets specialized for neural machine translation (NMT).

Architecture

mBART-50 is a multilingual Sequence-to-Sequence model employing "Multilingual Denoising Pretraining." This model supports multilingual translation by fine-tuning on multiple language directions simultaneously. It extends the original mBART model to handle 50 languages, enhancing its capability to support diverse multilingual machine translation tasks. The model's pre-training involves noising source documents using sentence shuffling and in-filling schemes, then reconstructing the original text, with 35% of words masked.

Training

The model was fine-tuned on the OPUS-100 dataset, an English-centric corpus covering 100 languages, with at least 10k sentence pairs available for 95 language pairs. Training data consisted of up to 1M sentence pairs per language pair. The BLEU score achieved was 20.61, indicating the model's translation quality.

Guide: Running Locally

To run the model locally:

Clone the Transformers repository:

git clone https://github.com/huggingface/transformers.git

Install the package:
```
pip install -q ./transformers
```

Use the following Python script to perform translations:

from transformers import MBart50TokenizerFast, MBartForConditionalGeneration

ckpt = 'Narrativa/mbart-large-50-finetuned-opus-en-pt-translation'
tokenizer = MBart50TokenizerFast.from_pretrained(ckpt)
model = MBartForConditionalGeneration.from_pretrained(ckpt).to("cuda")

tokenizer.src_lang = 'en_XX'

def translate(text):
    inputs = tokenizer(text, return_tensors='pt')
    input_ids = inputs.input_ids.to('cuda')
    attention_mask = inputs.attention_mask.to('cuda')
    output = model.generate(input_ids, attention_mask=attention_mask, forced_bos_token_id=tokenizer.lang_code_to_id['pt_XX'])
    return tokenizer.decode(output[0], skip_special_tokens=True)

translate('here your English text to be translated to Portuguese...')

For optimal performance, it is recommended to use a cloud GPU service such as AWS EC2, Google Cloud, or Azure.

License

The model and its associated datasets are made available under the licenses provided by the original creators, and users should adhere to these when using the model in their applications.

More Related APIs in Translation