mt5 zh ja en trimmed
K024MT5-ZH-JA-EN-TRIMMED Model Documentation
Introduction
MT5-ZH-JA-EN-TRIMMED is a translation model fine-tuned from the mt5-base
model. It supports translations between Chinese, Japanese, and English. The model has its vocabulary trimmed to approximately one-third of the original, retaining the top 85,000 tokens from the training data.
Architecture
The model is based on the mt5-base
architecture, a variant of the T5 model designed for multilingual tasks. It is specifically trimmed for efficiency by reducing the vocabulary size. This modification helps in optimizing the model for specific translations between the targeted languages: Chinese, Japanese, and English.
Training
The model was trained using datasets from various multilingual sources, including:
- Wikimedia datasets (
en-ja
,en-zh
,ja-zh
) - Wikititles datasets (
ja-en
,zh-en
) - Wikimatrix (
ja-zh
) - News Commentary (
en-ja
,en-zh
,ja-zh
) - TED2020 talks (
en-ja
,en-zh
,ja-zh
)
These datasets provide a rich source of parallel text data to enhance the model's translation capabilities.
Guide: Running Locally
To run the model locally, you can use the following Python code with the Hugging Face Transformers library:
from transformers import (
T5Tokenizer,
MT5ForConditionalGeneration,
Text2TextGenerationPipeline,
)
path = "K024/mt5-zh-ja-en-trimmed"
pipe = Text2TextGenerationPipeline(
model=MT5ForConditionalGeneration.from_pretrained(path),
tokenizer=T5Tokenizer.from_pretrained(path),
)
sentence = "ja2zh: 吾輩は猫である。名前はまだ無い。"
res = pipe(sentence, max_length=100, num_beams=4)
print(res[0]['generated_text'])
Suggested Cloud GPUs
For better performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure. These services offer powerful GPUs suitable for running large models efficiently.
License
The model is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). This license allows sharing and adaptation under non-commercial terms, with attribution and share-alike conditions.