bert large arabertv2
aubmindlabIntroduction
AraBERT is a pretrained language model designed for Arabic language understanding, based on Google's BERT architecture. There are different versions, including AraBERTv0.1, AraBERTv1, and the newer AraBERTv2, which introduces better pre-processing and a new vocabulary.
Architecture
AraBERT is built on the BERT-Base configuration and offers various models, such as AraBERTv0.2 and AraBERTv2, available in both base and large versions. These models have been trained with different datasets, with AraBERTv2 using pre-segmented text for improved accuracy.
Training
The AraBERT models are trained on a large Arabic corpus, including datasets like Wikipedia, the 1.5B Arabic Corpus, and Assafir news articles. Training uses TPUs for efficiency, handling over 420 million sequences with different lengths. Enhanced pre-processing techniques are applied to improve vocabulary learning.
Guide: Running Locally
-
Installation:
- Install the arabert package:
pip install arabert
. - Use the
ArabertPreprocessor
for text segmentation and cleaning.
- Install the arabert package:
-
Preprocessing Example:
from arabert.preprocess import ArabertPreprocessor model_name="aubmindlab/bert-large-arabertv2" arabert_prep = ArabertPreprocessor(model_name=model_name) text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري" processed_text = arabert_prep.preprocess(text) print(processed_text)
-
Model Download:
- Use
git-lfs
orwget
to download models from Hugging Face. - Example command for TF1 model:
git clone https://huggingface.co/aubmindlab/MODEL_NAME
.
- Use
-
Hardware Requirements:
- Cloud GPUs like Google's TPUs are recommended for efficient training and inference.
License
AraBERT and its associated resources are open for use. Users are encouraged to cite the model's creators if utilized in academic or commercial work. For any inquiries, contact Wissam Antoun or Fady Baly through their provided social media or email.