unihanlm base

microsoft

Introduction

UnihanLM is a self-supervised, pretrained masked language model (MLM) designed for Chinese and Japanese. It leverages the shared characters between these languages through a two-stage coarse-to-fine training approach using the Unihan database. This model aims to enhance performance on both monolingual and cross-lingual tasks by exploiting the morphological similarities between Chinese and Japanese characters.

Architecture

UnihanLM employs a two-stage training process:

  1. Coarse-Grained Pretraining: Morphologically similar characters are clustered using the Unihan database, and these clusters replace original characters in sentences for initial training.
  2. Fine-Grained Pretraining: The original characters are restored from clusters to refine the model's understanding of specific character representations.

Training

The model is trained using datasets from Chinese and Japanese Wikipedia. The training process involves a coarse-to-fine methodology, allowing the model to learn general character similarities before focusing on individual character distinctions. Detailed information on the training procedure and evaluation results can be found in the associated research paper.

Guide: Running Locally

To use UnihanLM locally:

  1. Install the transformers library from Hugging Face.
  2. Load the model using the provided model card on Hugging Face's platform.
  3. Follow the usage pattern similar to XLM models.

For optimal performance, consider utilizing cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

UnihanLM is released under the Apache 2.0 License, permitting both commercial and non-commercial use.

More Related APIs in Feature Extraction