indobert large p2

indobenchmark

IndoBERT-Large-P2

Introduction

IndoBERT-Large-P2 is a state-of-the-art language model for Indonesian, based on the BERT architecture. It is designed for tasks involving natural language understanding and is pretrained using the masked language modeling (MLM) and next sentence prediction (NSP) objectives.

Architecture

The IndoBERT model is part of the IndoBenchmark suite, featuring various model sizes and configurations. IndoBERT-Large-P2 has 335.2 million parameters and is trained on the Indo4B dataset, comprising 23.43 GB of Indonesian text.

Training

IndoBERT-Large-P2 is trained on the Indo4B dataset using MLM and NSP objectives. The model is part of a series of pre-trained models, ranging from base to large configurations, including lite variants with reduced parameter sizes.

Guide: Running Locally

Basic Steps

  1. Install Transformers Library: Ensure that you have the Hugging Face Transformers library installed.

    pip install transformers
    
  2. Load the Model and Tokenizer:

    from transformers import BertTokenizer, AutoModel
    tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-large-p2")
    model = AutoModel.from_pretrained("indobenchmark/indobert-large-p2")
    
  3. Extract Contextual Representation:

    import torch
    x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
    print(x, model(x)[0].sum())
    

Cloud GPUs

To facilitate efficient model training and inference, consider using cloud GPUs such as those offered by Google Cloud Platform, AWS, or Azure.

License

IndoBERT-Large-P2 is released under the MIT License, allowing for wide usage and distribution.

More Related APIs in Feature Extraction