Introduction

The OPUS-MT-EN-ZH model is a translation model developed by the Language Technology Research Group at the University of Helsinki. It is designed to translate text from English to various Chinese dialects using a Transformer architecture. This model is part of the Tatoeba Challenge and is available under the Apache 2.0 license.

Architecture

  • Model Type: Transformer
  • Source Language: English (eng)
  • Target Languages: Includes Mandarin (cmn), Cantonese (yue), and others in simplified and traditional forms.
  • Pre-processing: Involves normalization and SentencePiece tokenization with a vocabulary size of 32,000.

Training

The model was trained on the Tatoeba dataset, with a training date noted as July 17, 2020. It uses a sentence initial language token (e.g., >>id<<) to specify the target language.

Benchmarks

  • BLEU Score: 31.4
  • chr-F Score: 0.268

Guide: Running Locally

  1. Environment Setup:
    • Ensure you have Python and PyTorch installed.
    • Install the Hugging Face Transformers library using pip:
      pip install transformers
      
  2. Download Model:
  3. Run Translation:
    • Load the model using the Transformers library and perform translations using your input text.
  4. Cloud GPU Recommendation:
    • For faster processing, consider using cloud GPUs available from providers like AWS or Google Cloud.

License

The OPUS-MT-EN-ZH model is licensed under the Apache License 2.0, allowing for broad use, modification, and distribution.

More Related APIs in Translation