bert large arabertv2 LLM Model

Introduction

AraBERT is a pretrained language model designed for Arabic language understanding, based on Google's BERT architecture. There are different versions, including AraBERTv0.1, AraBERTv1, and the newer AraBERTv2, which introduces better pre-processing and a new vocabulary.

Architecture

AraBERT is built on the BERT-Base configuration and offers various models, such as AraBERTv0.2 and AraBERTv2, available in both base and large versions. These models have been trained with different datasets, with AraBERTv2 using pre-segmented text for improved accuracy.

Training

The AraBERT models are trained on a large Arabic corpus, including datasets like Wikipedia, the 1.5B Arabic Corpus, and Assafir news articles. Training uses TPUs for efficiency, handling over 420 million sequences with different lengths. Enhanced pre-processing techniques are applied to improve vocabulary learning.

Guide: Running Locally

Installation:
- Install the arabert package: pip install arabert.
- Use the ArabertPreprocessor for text segmentation and cleaning.

Preprocessing Example:

from arabert.preprocess import ArabertPreprocessor

model_name="aubmindlab/bert-large-arabertv2"
arabert_prep = ArabertPreprocessor(model_name=model_name)

text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
processed_text = arabert_prep.preprocess(text)
print(processed_text)

Model Download:
- Use git-lfs or wget to download models from Hugging Face.
- Example command for TF1 model: git clone https://huggingface.co/aubmindlab/MODEL_NAME.
Hardware Requirements:
- Cloud GPUs like Google's TPUs are recommended for efficient training and inference.

License

AraBERT and its associated resources are open for use. Users are encouraged to cite the model's creators if utilized in academic or commercial work. For any inquiries, contact Wissam Antoun or Fady Baly through their provided social media or email.

More Related APIs in Fill Mask