opus mt tc big en ar
Helsinki-NLPIntroduction
The OPUS-MT-TC-BIG-EN-AR is a neural machine translation model designed to translate text from English to Arabic. It is part of the OPUS-MT project, which aims to provide accessible translation models for various languages. The model was trained using Marian NMT and converted to PyTorch with Hugging Face's Transformers library.
Architecture
The model employs a transformer-big architecture and is trained on data from the OPUS dataset. It utilizes SentencePiece tokenization with a vocabulary size of 32,000 tokens. The model supports multiple target languages, requiring a sentence initial language token to specify the language.
Training
Training data for this model comes from the OPUS collection, and the training pipeline follows the procedures outlined in OPUS-MT-train. The model's training involved several datasets, including flores101-devtest, tatoeba-test, and tico19-test, achieving BLEU scores of 29.4, 20, and 30, respectively.
Guide: Running Locally
- Install Dependencies: Ensure you have Python and PyTorch installed. Use
pip
to install the Transformers library:pip install transformers
- Load the Model and Tokenizer:
from transformers import MarianMTModel, MarianTokenizer model_name = "Helsinki-NLP/opus-mt-tc-big-en-ar" tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name)
- Translate Text:
src_text = [">>ara<< I can't help you because I'm busy."] translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) for t in translated: print(tokenizer.decode(t, skip_special_tokens=True))
- Use Cloud GPUs: For faster processing, consider using cloud services such as AWS, Google Cloud, or Azure, which offer GPU instances suitable for running PyTorch models.
License
This model is licensed under the Creative Commons Attribution 4.0 International License (cc-by-4.0). This allows for use, distribution, and adaptation of the model, provided appropriate credit is given.