he B E R T
avichrIntroduction
HeBERT is a Hebrew pretrained language model based on Google's BERT architecture. It is designed for polarity analysis and emotion recognition tasks. The model utilizes a BERT-Base configuration as described by Devlin et al. (2018).
Architecture
HeBERT is built upon the architecture of BERT (Bidirectional Encoder Representations from Transformers), specifically the BERT-Base model configuration. It adapts this architecture to process Hebrew language tasks, focusing on sentiment analysis and emotion recognition.
Training
HeBERT was trained using three datasets:
- A Hebrew version of OSCAR, comprising about 9.8 GB of data, 1 billion words, and over 20.8 million sentences.
- A Hebrew dump of Wikipedia, consisting of approximately 650 MB of data, 63 million words, and 3.8 million sentences.
- Emotion UGC data, gathered from comments on articles from three major news sites during January 2020 to August 2020. This dataset contains about 150 MB of data, 7 million words, and 350K sentences, with some sentences annotated for various emotions and sentiment.
The annotation process involved multiple annotators, with validation through Krippendorff's alpha to ensure reliability.
Guide: Running Locally
To run HeBERT locally, follow these steps:
-
Install Transformers Library:
pip install transformers
-
Load the Model for Masked-LM or Sentiment Classification:
from transformers import AutoTokenizer, AutoModel, pipeline # For Masked-LM tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT") model = AutoModel.from_pretrained("avichr/heBERT") fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) # For Sentiment Classification tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT_sentiment_analysis") model = AutoModel.from_pretrained("avichr/heBERT_sentiment_analysis") sentiment_analysis = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, return_all_scores=True)
-
Example Usage:
# Fill Mask Example result = fill_mask("הקורונה לקחה את [MASK] ולנו לא נשאר דבר.") # Sentiment Analysis Example scores = sentiment_analysis('אני מתלבט מה לאכול לארוחת צהריים')
-
Cloud GPUs: For faster processing, consider utilizing cloud GPU services such as AWS, GCP, or Azure.
License
Please ensure to cite the work if you use the model:
Chriqui, A., & Yahav, I. (2022). HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. INFORMS Journal on Data Science, forthcoming.
@article{chriqui2021hebert,
title={HeBERT \& HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition},
author={Chriqui, Avihay and Yahav, Inbal},
journal={INFORMS Journal on Data Science},
year={2022}
}