bert base arabertv02

aubmindlab

Introduction
AraBERT is a pre-trained language model designed for Arabic language processing, based on Google's BERT architecture. It is developed by AUB MIND Lab and offers several versions, including AraBERTv0.1 and AraBERTv1, with improvements in pre-segmentation using the Farasa Segmenter. AraBERT is evaluated on tasks such as sentiment analysis, named entity recognition, and Arabic question answering.

Architecture
AraBERT employs the BERT-Base configuration, focusing on Arabic language understanding. The model has undergone enhancements in versions v0.2 and v2, which include improved pre-processing and vocabulary handling. These improvements address issues with wordpiece vocabulary, specifically handling punctuation and numbers.

Training
The model was trained on an extensive Arabic corpus, leveraging resources like the unshuffled OSCAR corpus and Arabic Wikipedia. Training utilized TPU hardware for efficient processing, with versions trained on up to 3.5 times more data than previous iterations.

Guide: Running Locally

  1. Preprocessing: Use the arabert Python package to preprocess text. Install it via pip:
    pip install arabert
    
    Then, preprocess text with:
    from arabert.preprocess import ArabertPreprocessor
    
    model_name = "aubmindlab/bert-large-arabertv02"
    arabert_prep = ArabertPreprocessor(model_name=model_name)
    
    text = "ولن نبالغ إذا قلنا: إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
    processed_text = arabert_prep.preprocess(text)
    
  2. Model Download: Models are available in PyTorch, TF2, and TF1 formats. To download TF1 models:
    • Install git-lfs:
      curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
      sudo apt-get install git-lfs
      git lfs install
      
    • Clone the repository:
      git clone https://huggingface.co/aubmindlab/MODEL_NAME
      
  3. Cloud GPUs: For enhanced performance, consider using cloud GPUs from providers like Google Cloud or AWS.

License
AraBERT's development was supported by TensorFlow Research Cloud (TFRC) for access to Cloud TPUs. The model is available under the terms described by the AUB MIND Lab and associated contributors. If you use AraBERT, please cite it accordingly, as detailed in the provided BibTeX citation.

More Related APIs in Fill Mask