elan mt bt ja en
MitsuaIntroduction
ElanMT-BT-ja-en is a Japanese to English translation model developed by the ELAN MITSUA Project. The model is a fine-tuned version of ElanMT-base-ja-en, trained exclusively on openly licensed data and Wikipedia back-translated data.
Architecture
ElanMT-BT-ja-en is based on the Marian MT 6-layer encoder-decoder transformer architecture, utilizing a sentencepiece tokenizer. It has been designed to perform translation tasks from Japanese to English.
Training
The model training involved multiple steps:
- A sentencepiece tokenizer was trained on a 4M lines openly licensed corpus.
- A back-translation model was trained on the same corpus for 6 epochs.
- A base translation model was trained on a 4M lines corpus.
- 20M lines of English Wikipedia were back-translated to Japanese.
- Four ja-en models were fine-tuned using back-translated data.
- The models were merged for optimal validation scores.
- The merged model was further fine-tuned on a subset of high-quality corpus.
Training utilized datasets such as:
- Mitsua/wikidata-parallel-descriptions-en-ja
- Kyoto Free Translation Task (KFTT)
- Tatoeba
- WikiMatrix
- MDN Web Docs
- Wikimedia contenttranslation dump
Guide: Running Locally
-
Install Required Packages
pip install transformers accelerate sentencepiece
-
Run Translation
from transformers import pipeline translator = pipeline('translation', model='Mitsua/elan-mt-bt-ja-en') print(translator('こんにちは。私はAIです。'))
-
Handle Longer Texts For longer texts, pySBD can be used for sentence segmentation:
pip install pysbd
import pysbd seg = pysbd.Segmenter(language="ja", clean=False) txt = 'こんにちは。私はAIです。お元気ですか?' print(translator(seg.segment(txt)))
Cloud GPUs: Using cloud GPU services like AWS, Google Cloud, or Azure can significantly speed up the translation process.
License
The model is licensed under the CC BY-SA 4.0 license. The ELAN MITSUA Project is not liable for any direct or indirect losses resulting from the use of this model.