m2m100_1.2 B
facebookIntroduction
M2M100 1.2B is a multilingual encoder-decoder (seq-to-seq) model designed for Many-to-Many multilingual translation. It supports 100 languages, enabling translation across 9,900 language pairs. The model was introduced in a research paper and is available in a GitHub repository.
Architecture
M2M100 is built using a transformer architecture that allows for direct translation from any of the supported languages into another. It incorporates a multilingual tokenizer, M2M100Tokenizer, which relies on the sentencepiece
library for tokenization.
Training
The model is trained to facilitate translation between 100 languages without pivoting through English. This approach allows for improved performance in translating directly between less commonly paired languages.
Guide: Running Locally
-
Setup Environment: Ensure Python and PyTorch are installed. Install the SentencePiece library using:
pip install sentencepiece
-
Install Transformers: Use the following command to install Hugging Face's Transformers library:
pip install transformers
-
Load the Model and Tokenizer:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_1.2B") tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_1.2B")
-
Translate Text:
- Hindi to French:
hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।" tokenizer.src_lang = "hi" encoded_hi = tokenizer(hi_text, return_tensors="pt") generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr")) print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
- Chinese to English:
chinese_text = "生活就像一盒巧克力。" tokenizer.src_lang = "zh" encoded_zh = tokenizer(chinese_text, return_tensors="pt") generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en")) print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
- Hindi to French:
-
Cloud GPU Recommendation: For large-scale translation tasks, consider using cloud-based GPU services like AWS, Google Cloud, or Azure for better performance.
License
The M2M100 model is released under the MIT License, permitting flexibility for both non-commercial and commercial use.