indobert base uncased
indolemIntroduction
IndoBERT is an Indonesian adaptation of the BERT model, trained with a substantial dataset of over 220 million words sourced from Indonesian Wikipedia, news articles, and an Indonesian Web Corpus. It demonstrates competitive performance across various Indonesian language tasks as part of the IndoLEM benchmark.
Architecture
IndoBERT is based on the BERT architecture, tailored specifically for the Indonesian language. It undergoes pre-training to handle tasks involving morpho-syntax, semantics, and discourse.
Training
The model was trained for 2.4 million steps, equivalent to 180 epochs, achieving a final perplexity of 3.97 on the development set. The training dataset included:
- Indonesian Wikipedia (74 million words)
- News articles from Kompas, Tempo, and Liputan6 (55 million words)
- Indonesian Web Corpus (90 million words)
Guide: Running Locally
Steps to Load Model and Tokenizer
- Install Transformers Library
Ensure you have versiontransformers==3.5.1
installed.pip install transformers==3.5.1
- Load the Model and Tokenizer
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("indolem/indobert-base-uncased") model = AutoModel.from_pretrained("indolem/indobert-base-uncased")
Suggested Cloud GPUs
For optimal performance, consider utilizing cloud GPU services such as:
- Google Colab (provides free Tesla K80, T4, or P100 GPUs)
- AWS EC2 instances with GPU support
- Azure N-series VMs
License
The IndoBERT model is licensed under the MIT License, allowing for wide usage and modification within the terms specified.