alephbert base
onlplabIntroduction
ALEPHBERT is a state-of-the-art language model designed specifically for Hebrew, based on Google's BERT architecture. It aims to provide robust language understanding capabilities for Hebrew text.
Architecture
ALEPHBERT utilizes the BERT architecture, as described in the seminal paper by Devlin et al. (2018). It leverages Transformer-based modeling techniques to process and understand Hebrew text efficiently.
Training
The model was trained using the Hugging Face training procedure on a DGX machine with 8 V100 GPUs. The training data included:
- OSCAR's Hebrew section with 10 GB of text, consisting of 20 million sentences.
- The Hebrew dump of Wikipedia, containing 650 MB of text and 3 million sentences.
- Hebrew Tweets from the Twitter sample stream, providing 7 GB of text and 70 million sentences.
The training process involved optimizing using the Masked Language Model loss. The data was divided into sections based on the number of tokens, with varying training epochs and learning rates to optimize performance.
Guide: Running Locally
-
Installation: Ensure you have Python and PyTorch environment set up. Install the
transformers
library from Hugging Face.pip install transformers
-
Loading the Model:
from transformers import BertModel, BertTokenizerFast alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base') alephbert = BertModel.from_pretrained('onlplab/alephbert-base') alephbert.eval() # Use this if not fine-tuning
-
Execution: Run your script to use ALEPHBERT for your Hebrew language tasks.
-
Cloud GPUs: For enhanced performance, consider using cloud GPU services like AWS, Google Cloud, or Azure.
License
ALEPHBERT is licensed under the Apache 2.0 License, allowing for wide use with attribution.