elan mt bt ja en LLM Model

Introduction

ElanMT-BT-ja-en is a Japanese to English translation model developed by the ELAN MITSUA Project. The model is a fine-tuned version of ElanMT-base-ja-en, trained exclusively on openly licensed data and Wikipedia back-translated data.

Architecture

ElanMT-BT-ja-en is based on the Marian MT 6-layer encoder-decoder transformer architecture, utilizing a sentencepiece tokenizer. It has been designed to perform translation tasks from Japanese to English.

Training

The model training involved multiple steps:

A sentencepiece tokenizer was trained on a 4M lines openly licensed corpus.
A back-translation model was trained on the same corpus for 6 epochs.
A base translation model was trained on a 4M lines corpus.
20M lines of English Wikipedia were back-translated to Japanese.
Four ja-en models were fine-tuned using back-translated data.
The models were merged for optimal validation scores.
The merged model was further fine-tuned on a subset of high-quality corpus.

Training utilized datasets such as:

Mitsua/wikidata-parallel-descriptions-en-ja
Kyoto Free Translation Task (KFTT)
Tatoeba
WikiMatrix
MDN Web Docs
Wikimedia contenttranslation dump

Guide: Running Locally

Install Required Packages

pip install transformers accelerate sentencepiece

Run Translation

from transformers import pipeline
translator = pipeline('translation', model='Mitsua/elan-mt-bt-ja-en')
print(translator('こんにちは。私はAIです。'))

Handle Longer Texts For longer texts, pySBD can be used for sentence segmentation:

pip install pysbd

import pysbd
seg = pysbd.Segmenter(language="ja", clean=False)
txt = 'こんにちは。私はAIです。お元気ですか？'
print(translator(seg.segment(txt)))

Cloud GPUs: Using cloud GPU services like AWS, Google Cloud, or Azure can significantly speed up the translation process.

License

The model is licensed under the CC BY-SA 4.0 license. The ELAN MITSUA Project is not liable for any direct or indirect losses resulting from the use of this model.

More Related APIs in Translation