small100
alirezamshIntroduction
SMaLL-100 is a compact, fast, and massively multilingual machine translation model. It is designed to handle over 10,000 language pairs, achieving competitive results with larger models like M2M-100 while being more efficient. This model is introduced in a paper presented at EMNLP 2022.
Architecture
The SMaLL-100 model shares its architecture and configuration with the M2M-100 model but features a modified tokenizer to adjust language codes. This modification allows users to load the tokenizer locally from a specified file. The model is designed to be smaller and faster than its counterparts while maintaining high performance, particularly for low-resource languages.
Training
SMaLL-100 is trained as a sequence-to-sequence model for translation tasks. The training process involves providing the model with source language text concatenated with a target language code and target language text. The model's tokenizer requires the sentencepiece
library:
pip install sentencepiece
Training data is available upon request, and the model utilizes a beam size of 5 and a maximum target length of 256 during generation.
Guide: Running Locally
To run SMaLL-100 locally, you need to set up the model and tokenizer:
-
Install the required packages:
pip install transformers sentencepiece
-
Load the model and tokenizer:
from transformers import M2M100ForConditionalGeneration from tokenization_small100 import SMALL100Tokenizer model = M2M100ForConditionalGeneration.from_pretrained("alirezamsh/small100") tokenizer = SMALL100Tokenizer.from_pretrained("alirezamsh/small100", tgt_lang="fr")
-
Prepare text for translation and generate output:
src_text = "Your text here." model_inputs = tokenizer(src_text, return_tensors="pt") generated_tokens = model.generate(**model_inputs) translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
For optimal performance, consider using cloud GPUs such as AWS EC2, Google Cloud Platform, or Azure.
License
The SMaLL-100 model is licensed under the MIT License, allowing for broad use and modification.