indobertweet base uncased
indolemINDOBERTWEET 🐦
Introduction
IndoBERTweet is a large-scale pretrained language model tailored for Indonesian Twitter data. It extends an Indonesian BERT model with additional domain-specific vocabulary, improving efficiency and effectiveness in processing Twitter content.
Architecture
The model utilizes average-pooling of BERT subword embeddings to initialize domain-specific vocabulary, rather than starting from scratch or using word2vec projections. This approach enhances performance on social media text.
Training
IndoBERTweet was trained using Indonesian tweets collected over a year (December 2019 to December 2020) via Twitter's official API. The dataset comprises 409 million word tokens, significantly larger than previous datasets used for IndoBERT. Due to Twitter's policy, this data is not publicly available.
Guide: Running Locally
To use IndoBERTweet, follow these steps:
-
Install Transformers Library:
Ensure you havetransformers==3.5.1
installed. -
Load Model and Tokenizer:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased") model = AutoModel.from_pretrained("indolem/indobertweet-base-uncased")
-
Preprocessing:
- Convert all text to lowercase.
- Replace user mentions with
@USER
and URLs withHTTPURL
. - Use the
emoji
package to translate emoticons into text.
For enhanced performance, consider using cloud GPUs such as those offered by Google Cloud or AWS.
License
IndoBERTweet is licensed under the Apache License 2.0.