Darija B E R T

SI2M-Lab

Introduction

DarijaBERT is the first BERT model specifically designed for the Moroccan Arabic dialect, known as Darija. Developed by AIOX Lab and SI2M Lab INSEA, this open-source model aims to advance natural language processing (NLP) for Moroccan dialects.

Architecture

DarijaBERT is based on the BERT-base architecture but does not include the Next Sentence Prediction (NSP) objective. It was trained on approximately 3 million sequences of Darija text, totaling around 100 million tokens.

Training

The model was trained on a dataset collected from three primary sources:

  • Stories written in Darija, scraped from a dedicated website.
  • Youtube comments from 40 Moroccan channels.
  • Tweets gathered using a list of Darija-specific keywords.

This results in a comprehensive dataset of 691MB of text. Training was facilitated by resources from Google’s TensorFlow Research Cloud (TRC) program, which provided free Cloud TPUs.

Guide: Running Locally

To run DarijaBERT locally, you can use the Hugging Face Transformers library. Here are the basic steps:

  1. Install the Transformers library:

    pip install transformers
    
  2. Load the DarijaBERT model:

    from transformers import AutoTokenizer, AutoModel
    
    DarijaBERT_tokenizer = AutoTokenizer.from_pretrained("SI2M-Lab/DarijaBERT")
    DarijaBert_model = AutoModel.from_pretrained("SI2M-Lab/DarijaBERT")
    

For optimal performance, consider using cloud GPUs from platforms like AWS, Google Cloud, or Azure.

License

The model and its resources are open-source, allowing researchers, industrialists, and the NLP community to utilize and contribute to its development. For citation, refer to the specified article by Gaanoun et al., 2023.

More Related APIs in Fill Mask