mt5 zh ja en trimmed

K024

MT5-ZH-JA-EN-TRIMMED Model Documentation

Introduction

MT5-ZH-JA-EN-TRIMMED is a translation model fine-tuned from the mt5-base model. It supports translations between Chinese, Japanese, and English. The model has its vocabulary trimmed to approximately one-third of the original, retaining the top 85,000 tokens from the training data.

Architecture

The model is based on the mt5-base architecture, a variant of the T5 model designed for multilingual tasks. It is specifically trimmed for efficiency by reducing the vocabulary size. This modification helps in optimizing the model for specific translations between the targeted languages: Chinese, Japanese, and English.

Training

The model was trained using datasets from various multilingual sources, including:

  • Wikimedia datasets (en-ja, en-zh, ja-zh)
  • Wikititles datasets (ja-en, zh-en)
  • Wikimatrix (ja-zh)
  • News Commentary (en-ja, en-zh, ja-zh)
  • TED2020 talks (en-ja, en-zh, ja-zh)

These datasets provide a rich source of parallel text data to enhance the model's translation capabilities.

Guide: Running Locally

To run the model locally, you can use the following Python code with the Hugging Face Transformers library:

from transformers import (
  T5Tokenizer,
  MT5ForConditionalGeneration,
  Text2TextGenerationPipeline,
)

path = "K024/mt5-zh-ja-en-trimmed"
pipe = Text2TextGenerationPipeline(
  model=MT5ForConditionalGeneration.from_pretrained(path),
  tokenizer=T5Tokenizer.from_pretrained(path),
)

sentence = "ja2zh: 吾輩は猫である。名前はまだ無い。"
res = pipe(sentence, max_length=100, num_beams=4)
print(res[0]['generated_text'])

Suggested Cloud GPUs

For better performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure. These services offer powerful GPUs suitable for running large models efficiently.

License

The model is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). This license allows sharing and adaptation under non-commercial terms, with attribution and share-alike conditions.

More Related APIs in Translation