bert large arabic
asafayaIntroduction
The BERT-LARGE-ARABIC model is a pretrained BERT Large language model specifically tailored for the Arabic language. This model is designed for tasks like offensive speech identification in social media, as referenced in the paper cited in the documentation.
Architecture
The model is based on the BERT architecture and is trained on a vast corpus including the Arabic version of OSCAR and a recent dump of Arabic Wikipedia, among other Arabic resources. The corpus amounts to approximately 95GB of text, containing both Modern Standard Arabic and dialectical Arabic.
Training
The pretraining corpus includes ~8.2 billion words from diverse Arabic resources. Non-Arabic words were not removed to maintain the integrity of certain tasks, like Named Entity Recognition (NER). The model was trained using Google's BERT GitHub repository on a TPU v3-8, with training settings adapted from the original BERT: 3 million training steps with a batch size of 128.
Guide: Running Locally
To use the pretrained BERT-LARGE-ARABIC model, follow these steps:
-
Install the required libraries:
pip install torch transformers
-
Load the model:
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-large-arabic") model = AutoModelForMaskedLM.from_pretrained("asafaya/bert-large-arabic")
-
Utilize cloud GPUs: For optimal performance, consider using cloud services such as Google Cloud, AWS, or Azure, providing GPU acceleration.
License
The documentation does not specify a particular license. Users should refer to the original repository or contact the authors for detailed licensing information.