opus mt th en
Helsinki-NLPIntroduction
The OPUS-MT-TH-EN model is a translation model developed by the Language Technology Research Group at the University of Helsinki. It is designed to translate text from Thai to English using a transformer architecture.
Architecture
This model uses the transformer-align
architecture with pre-processing that includes normalization and SentencePiece tokenization (spm32k). It is a text-to-text generation model implemented using the Marian NMT framework.
Training
The model was trained on aligned Thai-English corpora and leverages pre-processing steps such as normalization and sentence tokenization to enhance translation accuracy. The training data was compiled and made available on June 17, 2020, with a BLEU score of 48.1 and a chr-F score of 0.644, indicating strong translation performance.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Dependencies: Ensure you have Python and the Hugging Face Transformers library installed. You may also need PyTorch or TensorFlow.
pip install transformers torch
-
Load the Model:
from transformers import MarianMTModel, MarianTokenizer model_name = 'Helsinki-NLP/opus-mt-th-en' tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name)
-
Translate Text:
src_text = ["สวัสดี"] translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) translated_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated] print(translated_text)
Cloud GPU Suggestion
For performance optimization, consider using cloud-based GPU services such as AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure ML with GPU instances to handle larger datasets and more complex translation tasks efficiently.
License
The OPUS-MT-TH-EN model is licensed under the Apache 2.0 License, allowing for free use, modification, and distribution under the license conditions.