unihanlm base
microsoftIntroduction
UnihanLM is a self-supervised, pretrained masked language model (MLM) designed for Chinese and Japanese. It leverages the shared characters between these languages through a two-stage coarse-to-fine training approach using the Unihan database. This model aims to enhance performance on both monolingual and cross-lingual tasks by exploiting the morphological similarities between Chinese and Japanese characters.
Architecture
UnihanLM employs a two-stage training process:
- Coarse-Grained Pretraining: Morphologically similar characters are clustered using the Unihan database, and these clusters replace original characters in sentences for initial training.
- Fine-Grained Pretraining: The original characters are restored from clusters to refine the model's understanding of specific character representations.
Training
The model is trained using datasets from Chinese and Japanese Wikipedia. The training process involves a coarse-to-fine methodology, allowing the model to learn general character similarities before focusing on individual character distinctions. Detailed information on the training procedure and evaluation results can be found in the associated research paper.
Guide: Running Locally
To use UnihanLM locally:
- Install the
transformers
library from Hugging Face. - Load the model using the provided model card on Hugging Face's platform.
- Follow the usage pattern similar to XLM models.
For optimal performance, consider utilizing cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
UnihanLM is released under the Apache 2.0 License, permitting both commercial and non-commercial use.