bertweet base
vinaiBERTweet: A Pre-Trained Language Model for English Tweets
Introduction
BERTweet is the first public large-scale language model pre-trained specifically for English Tweets. Utilizing the RoBERTa pre-training procedure, BERTweet was trained on a corpus of 850 million English Tweets, totaling approximately 80GB of data. This includes 845 million Tweets from January 2012 to August 2019 and 5 million Tweets related to the COVID-19 pandemic. Detailed information and experimental results are documented in the associated research paper.
Architecture
BERTweet is based on the RoBERTa model architecture. It leverages the robust training approach of RoBERTa to handle the unique linguistic characteristics found in Twitter data.
Training
The training involves a massive dataset of 16 billion word tokens from 850 million Tweets. The Tweets were collected over a broad period and include a significant subset related to the COVID-19 pandemic, ensuring the model's relevance to contemporary topics.
Guide: Running Locally
To run BERTweet locally, follow these basic steps:
- Environment Setup: Ensure you have Python and the necessary libraries installed, such as PyTorch or TensorFlow, if applicable.
- Download the Model: Access the model from the Hugging Face Model Hub.
- Load the Model: Use the appropriate library (e.g.,
transformers
in Python) to load BERTweet. - Inference: Run inference using sample Tweet data to test different NLP tasks like sentiment analysis, named entity recognition, etc.
For optimal performance, consider using cloud GPUs from providers such as AWS, Google Cloud, or Azure.
License
BERTweet is distributed under the MIT License, allowing for broad use and modification within the terms of the license.