opus mt tc big gmq ar
Helsinki-NLPIntroduction
The OPUS-MT-TC-BIG-GMQ-AR model is a neural machine translation model developed by the Language Technology Research Group at the University of Helsinki. It is designed for translating from North Germanic languages to Arabic, part of the OPUS-MT project aimed at making translation models accessible for many languages. The model is built on Marian NMT and converted to PyTorch using the Hugging Face Transformers library. It supports multiple target languages and requires a language token for translation.
Architecture
The model uses a "transformer-big" architecture, which is a large-scale transformer model optimized for translation tasks. It was originally trained with the Marian NMT framework and later converted to PyTorch. The model supports Danish and Swedish as source languages and Arabic dialects as target languages, using initial language tokens for specificity.
Training
The model was trained on data from the OPUS project, specifically using the opusTCv20210807 dataset. Pre-processing was conducted using SentencePiece with a vocabulary size of 32k. The training scripts and methodologies are part of the OPUS-MT-train repository. The model's performance is evaluated using BLEU and chr-F metrics on the flores101-devtest dataset.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Transformers Library: Ensure that the
transformers
library is installed.pip install transformers
-
Load the Model:
from transformers import MarianMTModel, MarianTokenizer model_name = "Helsinki-NLP/opus-mt-tc-big-gmq-ar" tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name)
-
Translate Text:
src_text = [">>ara<< Jeg elsker semitiske sprog.", ">>ara<< Vad handlar boken om?"] translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) for t in translated: print(tokenizer.decode(t, skip_special_tokens=True))
Cloud GPUs
For enhanced performance, consider utilizing cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure. These platforms offer scalable resources that can significantly speed up the translation process.
License
The OPUS-MT-TC-BIG-GMQ-AR model is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This allows for sharing and adaptation, provided appropriate credit is given.