indobert large p2
indobenchmarkIndoBERT-Large-P2
Introduction
IndoBERT-Large-P2 is a state-of-the-art language model for Indonesian, based on the BERT architecture. It is designed for tasks involving natural language understanding and is pretrained using the masked language modeling (MLM) and next sentence prediction (NSP) objectives.
Architecture
The IndoBERT model is part of the IndoBenchmark suite, featuring various model sizes and configurations. IndoBERT-Large-P2 has 335.2 million parameters and is trained on the Indo4B dataset, comprising 23.43 GB of Indonesian text.
Training
IndoBERT-Large-P2 is trained on the Indo4B dataset using MLM and NSP objectives. The model is part of a series of pre-trained models, ranging from base to large configurations, including lite variants with reduced parameter sizes.
Guide: Running Locally
Basic Steps
-
Install Transformers Library: Ensure that you have the Hugging Face Transformers library installed.
pip install transformers
-
Load the Model and Tokenizer:
from transformers import BertTokenizer, AutoModel tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-large-p2") model = AutoModel.from_pretrained("indobenchmark/indobert-large-p2")
-
Extract Contextual Representation:
import torch x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1) print(x, model(x)[0].sum())
Cloud GPUs
To facilitate efficient model training and inference, consider using cloud GPUs such as those offered by Google Cloud Platform, AWS, or Azure.
License
IndoBERT-Large-P2 is released under the MIT License, allowing for wide usage and distribution.