mbart large 50

facebook

Introduction

mBART-50 is a multilingual Sequence-to-Sequence model introduced to demonstrate multilingual translation capabilities via fine-tuning. Pre-trained using the "Multilingual Denoising Pretraining" objective, mBART-50 supports translation tasks across 50 languages, extending the original mBART model by adding 25 more languages.

Architecture

The model employs a Sequence-to-Sequence architecture with a multilingual denoising pretraining objective. It concatenates data from multiple languages, applies noise through sentence shuffling and in-filling schemes, and reconstructs original texts. A unique language id symbol (LID) is used as a token to guide sentence prediction.

Training

mBART-50 requires sequences formatted with a language id token as a prefix. The text format should be [lang_code] X [eos], where lang_code represents the source or target language code. The model can then be trained similarly to other sequence-to-sequence models. The example provided demonstrates setting up the tokenizer and model for a translation task from English to Romanian.

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")

src_text = " UN Chief Says There Is No Military Solution in Syria"
tgt_text = "Şeful ONU declară că nu există o soluţie militară în Siria"

model_inputs = tokenizer(src_text, return_tensors="pt")
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_text, return_tensors="pt").input_ids

model(**model_inputs, labels=labels)  # forward pass

Guide: Running Locally

  1. Installation: Use the transformers library to load mBART-50. Ensure PyTorch or TensorFlow is installed on your system.
  2. Setup: Initialize the model and tokenizer. Specify source and target languages with appropriate codes.
  3. Data Preparation: Format your data with language id tokens.
  4. Training: Fine-tune the model on your specific data. Use available examples for guidance.
  5. Execution: Run the model on a local machine. For better performance, consider using cloud GPU services such as AWS, GCP, or Azure.

License

mBART-50 is licensed under the MIT License, allowing for flexible use in both personal and commercial projects.

More Related APIs in Text2text Generation