opus mt ur en
Helsinki-NLPIntroduction
The OPUS-MT-UR-EN model is developed by the Language Technology Research Group at the University of Helsinki. It is designed for translating text from Urdu to English, utilizing a transformer-based architecture optimized with alignment techniques. The model is part of the Tatoeba Challenge and has been trained with a focus on text-to-text generation.
Architecture
The model uses a transformer architecture with specific pre-processing steps, including normalization and SentencePiece tokenization with a vocabulary size of 32k. It is designed for non-multilingual translation, targeting a single source language (Urdu) and a single target language (English).
Training
The training process involved the use of the Tatoeba dataset, with the model's weights available for download. The system was trained on data until June 17, 2020, and achieved a BLEU score of 23.2 and a chr-F2 score of 0.435 on the Tatoeba-test.urd.eng test set.
Guide: Running Locally
- Installation: Ensure you have Python and the necessary libraries installed, such as Hugging Face's Transformers and PyTorch or TensorFlow.
- Download Model: Retrieve the model weights from the provided link: opus-2020-06-17.zip.
- Setup Environment: Load the model using the Transformers library. Example code:
from transformers import MarianMTModel, MarianTokenizer tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-ur-en') model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-ur-en') # Example translation src_text = "آپ کیسے ہیں؟" translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])
- Testing: Use the test set available at opus-2020-06-17.test.txt to evaluate the model's performance.
- Hardware: For intensive tasks or large datasets, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure for better performance.
License
The model is released under the Apache 2.0 License, allowing for both personal and commercial use, modification, and distribution of the model and its derivatives.