indobert base uncased

indolem

Introduction

IndoBERT is an Indonesian adaptation of the BERT model, trained with a substantial dataset of over 220 million words sourced from Indonesian Wikipedia, news articles, and an Indonesian Web Corpus. It demonstrates competitive performance across various Indonesian language tasks as part of the IndoLEM benchmark.

Architecture

IndoBERT is based on the BERT architecture, tailored specifically for the Indonesian language. It undergoes pre-training to handle tasks involving morpho-syntax, semantics, and discourse.

Training

The model was trained for 2.4 million steps, equivalent to 180 epochs, achieving a final perplexity of 3.97 on the development set. The training dataset included:

  • Indonesian Wikipedia (74 million words)
  • News articles from Kompas, Tempo, and Liputan6 (55 million words)
  • Indonesian Web Corpus (90 million words)

Guide: Running Locally

Steps to Load Model and Tokenizer

  1. Install Transformers Library
    Ensure you have version transformers==3.5.1 installed.
    pip install transformers==3.5.1
    
  2. Load the Model and Tokenizer
    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained("indolem/indobert-base-uncased")
    model = AutoModel.from_pretrained("indolem/indobert-base-uncased")
    

Suggested Cloud GPUs

For optimal performance, consider utilizing cloud GPU services such as:

  • Google Colab (provides free Tesla K80, T4, or P100 GPUs)
  • AWS EC2 instances with GPU support
  • Azure N-series VMs

License

The IndoBERT model is licensed under the MIT License, allowing for wide usage and modification within the terms specified.

More Related APIs in Fill Mask