indobertweet base uncased

indolem

INDOBERTWEET 🐦

Introduction

IndoBERTweet is a large-scale pretrained language model tailored for Indonesian Twitter data. It extends an Indonesian BERT model with additional domain-specific vocabulary, improving efficiency and effectiveness in processing Twitter content.

Architecture

The model utilizes average-pooling of BERT subword embeddings to initialize domain-specific vocabulary, rather than starting from scratch or using word2vec projections. This approach enhances performance on social media text.

Training

IndoBERTweet was trained using Indonesian tweets collected over a year (December 2019 to December 2020) via Twitter's official API. The dataset comprises 409 million word tokens, significantly larger than previous datasets used for IndoBERT. Due to Twitter's policy, this data is not publicly available.

Guide: Running Locally

To use IndoBERTweet, follow these steps:

  1. Install Transformers Library:
    Ensure you have transformers==3.5.1 installed.

  2. Load Model and Tokenizer:

    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased")
    model = AutoModel.from_pretrained("indolem/indobertweet-base-uncased")
    
  3. Preprocessing:

    • Convert all text to lowercase.
    • Replace user mentions with @USER and URLs with HTTPURL.
    • Use the emoji package to translate emoticons into text.

For enhanced performance, consider using cloud GPUs such as those offered by Google Cloud or AWS.

License

IndoBERTweet is licensed under the Apache License 2.0.

More Related APIs in Fill Mask