bert base arabertv2

aubmindlab

Introduction

AraBERT is an Arabic language model based on Google's BERT architecture, designed for enhancing Arabic language understanding. The model includes multiple versions, with improvements in pre-segmentation using the Farasa Segmenter. It has been evaluated on tasks like Sentiment Analysis, Named Entity Recognition, and Arabic Question Answering.

Architecture

AraBERT adopts the BERT-Base configuration. It introduces improved preprocessing and vocabulary adjustments, inserting spaces around punctuations and numbers. The model supports both the original and Fast tokenizer implementations by utilizing the BertWordpieceTokenizer.

Training

AraBERT uses a significantly larger dataset and extended training times compared to its predecessors. The training data comprises 77GB, including Arabic Wikipedia and the OSCAR corpus. The model was trained on TPUs, utilizing 3.5 times more data than earlier versions. AraBERTv2 models incorporate better preprocessing and a new vocabulary to handle Arabic text more effectively.

Guide: Running Locally

To run AraBERT locally, follow these steps:

  1. Install Dependencies: Make sure to install the required libraries, including farasapy for text segmentation.
    pip install farasapy
    
  2. Download the Model: Use Hugging Face's model repository to download the desired version of AraBERT.
    git clone https://huggingface.co/aubmindlab/bert-base-arabertv2
    
  3. Preprocess Text: Utilize the ArabertPreprocessor class for text preprocessing.
    from arabert.preprocess import ArabertPreprocessor
    arabert_prep = ArabertPreprocessor(model_name="bert-base-arabertv2")
    processed_text = arabert_prep.preprocess("Your text here")
    
  4. Inference: Load the model in your preferred framework, such as PyTorch or TensorFlow, and perform inference.

For accelerated performance, consider using cloud GPUs, like those available on Google Cloud or AWS.

License

The AraBERT models are available under Hugging Face's licensing terms. Users are encouraged to cite the original paper if they utilize the models in their work.

More Related APIs in Fill Mask