opus mt tc big tr en

Helsinki-NLP

Introduction

The OPUS-MT-TC-BIG-TR-EN model is a neural machine translation model designed to translate text from Turkish to English. It is part of the OPUS-MT project, which aims to provide widely accessible translation models for various languages. The model utilizes Marian NMT and is converted to PyTorch using the Transformers library by Hugging Face.

Architecture

The model is based on the transformer-big architecture and utilizes the opusTCv20210807+bt dataset. Tokenization is performed using SentencePiece with a vocabulary size of 32,000. The original model and additional information can be found in the Tatoeba Challenge repository.

Training

The model is trained using data from the OPUS project, employing the OPUS-MT-train procedures. It has been benchmarked with several test datasets, achieving BLEU scores such as 57.6 on the Tatoeba-test-v2021-08-07 and 37.6 on the flores101-devtest.

Guide: Running Locally

  1. Install Transformers Library: Ensure you have the transformers library installed.
    pip install transformers
    
  2. Load Model and Tokenizer:
    from transformers import MarianMTModel, MarianTokenizer
    
    model_name = "Helsinki-NLP/opus-mt-tc-big-tr-en"
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)
    
  3. Translate Text:
    src_text = ["Your Turkish text here."]
    translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
    for t in translated:
        print(tokenizer.decode(t, skip_special_tokens=True))
    
  4. Use Transformers Pipelines:
    from transformers import pipeline
    pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-tr-en")
    print(pipe("Your Turkish text here."))
    
  5. Cloud GPU Recommendation: For faster inference, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Microsoft Azure.

License

The OPUS-MT-TC-BIG-TR-EN model is licensed under the Creative Commons Attribution 4.0 International License (cc-by-4.0).

More Related APIs in Translation