bert base japanese
tohoku-nlpIntroduction
BERT Base Japanese is a pretrained BERT model designed specifically for the Japanese language. It utilizes word-level tokenization with the IPA dictionary and WordPiece subword tokenization.
Architecture
The architecture of the model follows the original BERT base design, featuring 12 layers, 768 hidden state dimensions, and 12 attention heads.
Training
The model was trained using the Japanese Wikipedia dataset as of September 1, 2019. WikiExtractor was utilized to extract plain text, resulting in a 2.6GB corpus with approximately 17 million sentences. MeCab, a morphological parser with the IPA dictionary, was used for initial tokenization, followed by WordPiece for subword tokenization, with a vocabulary size of 32,000. Training was conducted with configurations similar to the original BERT, including 512 tokens per instance, 256 instances per batch, and a total of 1 million training steps.
Guide: Running Locally
- Set Up Environment: Ensure Python and necessary libraries, like Hugging Face Transformers and PyTorch or TensorFlow, are installed.
- Download the Model: Use the Hugging Face Model Hub to download the BERT Base Japanese model.
- Load the Model: Load the model and tokenizer in your script using the Hugging Face Transformers library.
- Run Inference: Input Japanese text to perform tasks such as fill-mask.
For optimal performance, especially during fine-tuning or large-scale inference tasks, consider using cloud GPUs such as those available on AWS, GCP, or Azure.
License
The pretrained models are available under the Creative Commons Attribution-ShareAlike 3.0 license.