he B E R T

avichr

Introduction

HeBERT is a Hebrew pretrained language model based on Google's BERT architecture. It is designed for polarity analysis and emotion recognition tasks. The model utilizes a BERT-Base configuration as described by Devlin et al. (2018).

Architecture

HeBERT is built upon the architecture of BERT (Bidirectional Encoder Representations from Transformers), specifically the BERT-Base model configuration. It adapts this architecture to process Hebrew language tasks, focusing on sentiment analysis and emotion recognition.

Training

HeBERT was trained using three datasets:

  1. A Hebrew version of OSCAR, comprising about 9.8 GB of data, 1 billion words, and over 20.8 million sentences.
  2. A Hebrew dump of Wikipedia, consisting of approximately 650 MB of data, 63 million words, and 3.8 million sentences.
  3. Emotion UGC data, gathered from comments on articles from three major news sites during January 2020 to August 2020. This dataset contains about 150 MB of data, 7 million words, and 350K sentences, with some sentences annotated for various emotions and sentiment.

The annotation process involved multiple annotators, with validation through Krippendorff's alpha to ensure reliability.

Guide: Running Locally

To run HeBERT locally, follow these steps:

  1. Install Transformers Library:

    pip install transformers
    
  2. Load the Model for Masked-LM or Sentiment Classification:

    from transformers import AutoTokenizer, AutoModel, pipeline
    
    # For Masked-LM
    tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT")
    model = AutoModel.from_pretrained("avichr/heBERT")
    fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
    
    # For Sentiment Classification
    tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT_sentiment_analysis")
    model = AutoModel.from_pretrained("avichr/heBERT_sentiment_analysis")
    sentiment_analysis = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, return_all_scores=True)
    
  3. Example Usage:

    # Fill Mask Example
    result = fill_mask("הקורונה לקחה את [MASK] ולנו לא נשאר דבר.")
    
    # Sentiment Analysis Example
    scores = sentiment_analysis('אני מתלבט מה לאכול לארוחת צהריים')
    
  4. Cloud GPUs: For faster processing, consider utilizing cloud GPU services such as AWS, GCP, or Azure.

License

Please ensure to cite the work if you use the model:

Chriqui, A., & Yahav, I. (2022). HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. INFORMS Journal on Data Science, forthcoming.

@article{chriqui2021hebert,
  title={HeBERT \& HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition},
  author={Chriqui, Avihay and Yahav, Inbal},
  journal={INFORMS Journal on Data Science},
  year={2022}
}

More Related APIs in Fill Mask