twhin bert large
TwitterIntroduction
TwHIN-BERT is a socially-enriched, pre-trained language model designed for multilingual Tweet representation. It is trained using a combination of text-based self-supervision and a social objective, leveraging the social interactions present in the Twitter Heterogeneous Information Network (TwHIN). TwHIN-BERT is applicable to both natural language processing (NLP) and social recommendation tasks, outperforming similar models in these areas.
Architecture
TwHIN-BERT is available in two pre-trained model sizes: base and large. The base model contains 280 million parameters, while the large model includes 550 million parameters. These models are compatible with Hugging Face's BERT models and can serve as a direct replacement in various applications.
Training
The model is pre-trained on 7 billion Tweets across more than 100 languages. It uses masked language modeling (MLM) for text-based self-supervision, complemented by a social objective derived from Twitter's social interactions, enhancing its performance on both semantic understanding and social recommendation tasks.
Guide: Running Locally
To use TwHIN-BERT with the Hugging Face Transformers library:
-
Install the Transformers library:
pip install transformers
-
Load the tokenizer and model:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained('Twitter/twhin-bert-large') model = AutoModel.from_pretrained('Twitter/twhin-bert-large')
-
Tokenize input text and obtain model outputs:
inputs = tokenizer("I'm using TwHIN-BERT! #TwHIN-BERT #NLP", return_tensors="pt") outputs = model(**inputs)
For enhanced performance, consider using cloud GPU services such as AWS, GCP, or Azure.
License
TwHIN-BERT is licensed under the Apache 2.0 license, allowing for wide use and distribution with appropriate attributions.