bert large japanese

tohoku-nlp

BERT-LARGE-JAPANESE

Introduction

BERT-LARGE-JAPANESE is a pre-trained model based on BERT, specifically tailored for the Japanese language. It utilizes word-level tokenization derived from the Unidic 2.1.2 dictionary followed by WordPiece subword tokenization. The model employs whole word masking for masked language modeling tasks.

Architecture

The architecture of this model mirrors the original BERT large design, comprising 24 layers, 1024 hidden state dimensions, and 16 attention heads.

Training

The model was trained on Japanese Wikipedia data, using a corpus generated from the Wikipedia Cirrussearch dump as of August 31, 2020. This corpus is approximately 4.0GB and contains around 30 million sentences. Texts were tokenized using MeCab with the mecab-ipadic-NEologd dictionary, followed by subword tokenization using the WordPiece algorithm. The vocabulary consists of 32,768 tokens. Training was conducted on a Cloud TPU v3-8 instance, taking about 5 days, following BERT's original configurations, including 512 tokens per instance and 1 million training steps.

Guide: Running Locally

  1. Install Required Libraries: Ensure that you have the required libraries such as transformers, fugashi, and unidic-lite installed.
  2. Download the Model: Access the model from the Hugging Face model hub.
  3. Set Up Tokenization: Utilize fugashi and unidic-lite for tokenization.
  4. Run Inference: Load the model using the transformers library and run your text through it.

For optimal performance, it is recommended to use cloud GPUs such as those provided by AWS, Azure, or Google Cloud.

License

The pretrained models are available under the Creative Commons Attribution-ShareAlike 3.0 License.

More Related APIs in Fill Mask