elan mt bt ja en

Mitsua

Introduction

ElanMT-BT-ja-en is a Japanese to English translation model developed by the ELAN MITSUA Project. The model is a fine-tuned version of ElanMT-base-ja-en, trained exclusively on openly licensed data and Wikipedia back-translated data.

Architecture

ElanMT-BT-ja-en is based on the Marian MT 6-layer encoder-decoder transformer architecture, utilizing a sentencepiece tokenizer. It has been designed to perform translation tasks from Japanese to English.

Training

The model training involved multiple steps:

  1. A sentencepiece tokenizer was trained on a 4M lines openly licensed corpus.
  2. A back-translation model was trained on the same corpus for 6 epochs.
  3. A base translation model was trained on a 4M lines corpus.
  4. 20M lines of English Wikipedia were back-translated to Japanese.
  5. Four ja-en models were fine-tuned using back-translated data.
  6. The models were merged for optimal validation scores.
  7. The merged model was further fine-tuned on a subset of high-quality corpus.

Training utilized datasets such as:

  • Mitsua/wikidata-parallel-descriptions-en-ja
  • Kyoto Free Translation Task (KFTT)
  • Tatoeba
  • WikiMatrix
  • MDN Web Docs
  • Wikimedia contenttranslation dump

Guide: Running Locally

  1. Install Required Packages

    pip install transformers accelerate sentencepiece
    
  2. Run Translation

    from transformers import pipeline
    translator = pipeline('translation', model='Mitsua/elan-mt-bt-ja-en')
    print(translator('こんにちは。私はAIです。'))
    
  3. Handle Longer Texts For longer texts, pySBD can be used for sentence segmentation:

    pip install pysbd
    
    import pysbd
    seg = pysbd.Segmenter(language="ja", clean=False)
    txt = 'こんにちは。私はAIです。お元気ですか?'
    print(translator(seg.segment(txt)))
    

Cloud GPUs: Using cloud GPU services like AWS, Google Cloud, or Azure can significantly speed up the translation process.

License

The model is licensed under the CC BY-SA 4.0 license. The ELAN MITSUA Project is not liable for any direct or indirect losses resulting from the use of this model.

More Related APIs in Translation