bert base indonesian 1.5 G

cahya

Introduction

The BERT-Base Indonesian 1.5G model is a pre-trained BERT-base model designed for the Indonesian language. It utilizes masked language modeling (MLM) objectives and is built to handle uncased text. This model has been trained on Indonesian Wikipedia and newspaper datasets to perform tasks such as text classification and text generation.

Architecture

The model uses a BERT-base architecture, which includes multiple layers of transformers designed to understand the context of words in a sentence. It processes uncased text and tokenizes it using the WordPiece method with a vocabulary of 32,000 tokens.

Training

The model was pre-trained using 522MB of Indonesian Wikipedia and 1GB of Indonesian newspapers. The training involved lowercasing and tokenizing the text data using WordPiece. The inputs to the model are formatted with special tokens: [CLS] for the start of a sentence and [SEP] to separate sentences.

Guide: Running Locally

To use this model locally, you can follow these steps:

  1. Install the Transformers library:

    pip install transformers
    
  2. Use the model for masked language modeling:

    from transformers import pipeline
    unmasker = pipeline('fill-mask', model='cahya/bert-base-indonesian-1.5G')
    result = unmasker("Ibu ku sedang bekerja [MASK] supermarket")
    
  3. Extract features using PyTorch:

    from transformers import BertTokenizer, BertModel
    tokenizer = BertTokenizer.from_pretrained('cahya/bert-base-indonesian-1.5G')
    model = BertModel.from_pretrained('cahya/bert-base-indonesian-1.5G')
    text = "Silakan diganti dengan text apa saja."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    
  4. Extract features using TensorFlow:

    from transformers import BertTokenizer, TFBertModel
    tokenizer = BertTokenizer.from_pretrained('cahya/bert-base-indonesian-1.5G')
    model = TFBertModel.from_pretrained('cahya/bert-base-indonesian-1.5G')
    text = "Silakan diganti dengan text apa saja."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

For faster performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

The BERT-Base Indonesian 1.5G model is licensed under the MIT License, allowing for wide usage and modification with minimal restrictions.

More Related APIs in Fill Mask