opus mt en zh
Helsinki-NLPIntroduction
The OPUS-MT-EN-ZH model is a translation model developed by the Language Technology Research Group at the University of Helsinki. It is designed to translate text from English to various Chinese dialects using a Transformer architecture. This model is part of the Tatoeba Challenge and is available under the Apache 2.0 license.
Architecture
- Model Type: Transformer
- Source Language: English (eng)
- Target Languages: Includes Mandarin (cmn), Cantonese (yue), and others in simplified and traditional forms.
- Pre-processing: Involves normalization and SentencePiece tokenization with a vocabulary size of 32,000.
Training
The model was trained on the Tatoeba dataset, with a training date noted as July 17, 2020. It uses a sentence initial language token (e.g., >>id<<
) to specify the target language.
Benchmarks
- BLEU Score: 31.4
- chr-F Score: 0.268
Guide: Running Locally
- Environment Setup:
- Ensure you have Python and PyTorch installed.
- Install the Hugging Face Transformers library using pip:
pip install transformers
- Download Model:
- Download the model weights from the provided URL: opus-2020-07-17.zip.
- Run Translation:
- Load the model using the Transformers library and perform translations using your input text.
- Cloud GPU Recommendation:
- For faster processing, consider using cloud GPUs available from providers like AWS or Google Cloud.
License
The OPUS-MT-EN-ZH model is licensed under the Apache License 2.0, allowing for broad use, modification, and distribution.