m2m100_1.2 B

facebook

Introduction

M2M100 1.2B is a multilingual encoder-decoder (seq-to-seq) model designed for Many-to-Many multilingual translation. It supports 100 languages, enabling translation across 9,900 language pairs. The model was introduced in a research paper and is available in a GitHub repository.

Architecture

M2M100 is built using a transformer architecture that allows for direct translation from any of the supported languages into another. It incorporates a multilingual tokenizer, M2M100Tokenizer, which relies on the sentencepiece library for tokenization.

Training

The model is trained to facilitate translation between 100 languages without pivoting through English. This approach allows for improved performance in translating directly between less commonly paired languages.

Guide: Running Locally

  1. Setup Environment: Ensure Python and PyTorch are installed. Install the SentencePiece library using:

    pip install sentencepiece
    
  2. Install Transformers: Use the following command to install Hugging Face's Transformers library:

    pip install transformers
    
  3. Load the Model and Tokenizer:

    from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
    
    model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_1.2B")
    tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_1.2B")
    
  4. Translate Text:

    • Hindi to French:
      hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
      tokenizer.src_lang = "hi"
      encoded_hi = tokenizer(hi_text, return_tensors="pt")
      generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
      print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
      
    • Chinese to English:
      chinese_text = "生活就像一盒巧克力。"
      tokenizer.src_lang = "zh"
      encoded_zh = tokenizer(chinese_text, return_tensors="pt")
      generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
      print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
      
  5. Cloud GPU Recommendation: For large-scale translation tasks, consider using cloud-based GPU services like AWS, Google Cloud, or Azure for better performance.

License

The M2M100 model is released under the MIT License, permitting flexibility for both non-commercial and commercial use.

More Related APIs in Text2text Generation