opus mt tc big ar gmq
Helsinki-NLPIntroduction
The OPUS-MT-TC-BIG-AR-GMQ is a neural machine translation model developed by the Language Technology Research Group at the University of Helsinki. It is designed for translating from Arabic to several North Germanic languages, specifically Danish, Norwegian Bokmål, and Swedish. This model is part of the OPUS-MT project, which aims to make neural machine translation models accessible for multiple languages using the Marian NMT framework.
Architecture
The model is based on the transformer-big architecture and has been converted from Marian NMT to PyTorch using the Hugging Face Transformers library. It supports multilingual translation with initial tokens to specify target languages.
Training
- Data: The model was trained using data from opusTCv20210807, processed with SentencePiece.
- Scripts and Framework: Training utilized the OPUS-MT-train procedures and Marian NMT framework.
- Pre-processing: SentencePiece was used for tokenization.
- Evaluation Metrics: The model was evaluated using BLEU and chr-F metrics on the flores101-devtest dataset, with scores provided for different language pairs.
Guide: Running Locally
-
Installation and Setup:
- Ensure Python and PyTorch are installed.
- Install the Hugging Face Transformers library:
pip install transformers
-
Running the Model:
- Load the Marian model and tokenizer:
from transformers import MarianMTModel, MarianTokenizer model_name = "Helsinki-NLP/opus-mt-tc-big-ar-gmq" tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name)
- Prepare your source text with initial language tokens and generate translations:
src_text = [">>swe<< بكرا منشوف شو بدنا نعمل"] translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) for t in translated: print(tokenizer.decode(t, skip_special_tokens=True))
- Load the Marian model and tokenizer:
-
Cloud GPUs: For faster performance, consider using cloud GPU services like AWS, Google Cloud, or Azure.
License
The model is released under the Creative Commons Attribution 4.0 International License (CC-BY-4.0). This allows for sharing and adaptation, provided appropriate credit is given.