opus mt tc big tr en
Helsinki-NLPIntroduction
The OPUS-MT-TC-BIG-TR-EN model is a neural machine translation model designed to translate text from Turkish to English. It is part of the OPUS-MT project, which aims to provide widely accessible translation models for various languages. The model utilizes Marian NMT and is converted to PyTorch using the Transformers library by Hugging Face.
Architecture
The model is based on the transformer-big architecture and utilizes the opusTCv20210807+bt dataset. Tokenization is performed using SentencePiece with a vocabulary size of 32,000. The original model and additional information can be found in the Tatoeba Challenge repository.
Training
The model is trained using data from the OPUS project, employing the OPUS-MT-train procedures. It has been benchmarked with several test datasets, achieving BLEU scores such as 57.6 on the Tatoeba-test-v2021-08-07 and 37.6 on the flores101-devtest.
Guide: Running Locally
- Install Transformers Library: Ensure you have the
transformers
library installed.pip install transformers
- Load Model and Tokenizer:
from transformers import MarianMTModel, MarianTokenizer model_name = "Helsinki-NLP/opus-mt-tc-big-tr-en" tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name)
- Translate Text:
src_text = ["Your Turkish text here."] translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) for t in translated: print(tokenizer.decode(t, skip_special_tokens=True))
- Use Transformers Pipelines:
from transformers import pipeline pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-tr-en") print(pipe("Your Turkish text here."))
- Cloud GPU Recommendation: For faster inference, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Microsoft Azure.
License
The OPUS-MT-TC-BIG-TR-EN model is licensed under the Creative Commons Attribution 4.0 International License (cc-by-4.0).