indobert base p1
indobenchmarkIntroduction
IndoBERT is a state-of-the-art language model tailored for the Indonesian language, based on the BERT architecture. It is designed to handle tasks involving natural language understanding in Indonesian, and it employs both the masked language modeling (MLM) and next sentence prediction (NSP) objectives.
Architecture
The IndoBERT models come in various configurations, including base and large models, as well as lite versions. The base models have 124.5 million parameters, while the large models have 335.2 million parameters. The lite versions are more compact, with 11.7 million parameters for the base and 17.7 million for the large models. All models are trained on the Indo4B dataset, which consists of 23.43 GB of text.
Training
IndoBERT models are pre-trained using a combination of MLM and NSP objectives on a substantial corpus of Indonesian text. This training setup allows the models to capture the syntactic and semantic properties of the language effectively.
Guide: Running Locally
To run IndoBERT locally, use the following steps:
-
Install the Transformers Library: Ensure you have the
transformers
library installed in your Python environment. You can install it using pip:pip install transformers
-
Load the Model and Tokenizer:
from transformers import BertTokenizer, AutoModel tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-base-p1") model = AutoModel.from_pretrained("indobenchmark/indobert-base-p1")
-
Extract Contextual Representation:
import torch x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1, -1) print(x, model(x)[0].sum())
-
Hardware Recommendations: For optimal performance, consider using cloud GPU services such as Amazon AWS EC2, Google Cloud Platform, or Microsoft Azure. These services offer various GPU options that can handle the computational requirements of running large models like IndoBERT.
License
The IndoBERT model is released under the MIT license, allowing for wide usage and distribution with minimal restrictions.