Darija B E R T
SI2M-LabIntroduction
DarijaBERT is the first BERT model specifically designed for the Moroccan Arabic dialect, known as Darija. Developed by AIOX Lab and SI2M Lab INSEA, this open-source model aims to advance natural language processing (NLP) for Moroccan dialects.
Architecture
DarijaBERT is based on the BERT-base architecture but does not include the Next Sentence Prediction (NSP) objective. It was trained on approximately 3 million sequences of Darija text, totaling around 100 million tokens.
Training
The model was trained on a dataset collected from three primary sources:
- Stories written in Darija, scraped from a dedicated website.
- Youtube comments from 40 Moroccan channels.
- Tweets gathered using a list of Darija-specific keywords.
This results in a comprehensive dataset of 691MB of text. Training was facilitated by resources from Google’s TensorFlow Research Cloud (TRC) program, which provided free Cloud TPUs.
Guide: Running Locally
To run DarijaBERT locally, you can use the Hugging Face Transformers library. Here are the basic steps:
-
Install the Transformers library:
pip install transformers
-
Load the DarijaBERT model:
from transformers import AutoTokenizer, AutoModel DarijaBERT_tokenizer = AutoTokenizer.from_pretrained("SI2M-Lab/DarijaBERT") DarijaBert_model = AutoModel.from_pretrained("SI2M-Lab/DarijaBERT")
For optimal performance, consider using cloud GPUs from platforms like AWS, Google Cloud, or Azure.
License
The model and its resources are open-source, allowing researchers, industrialists, and the NLP community to utilize and contribute to its development. For citation, refer to the specified article by Gaanoun et al., 2023.