Introduction

AraGPT2 is a pre-trained transformer model for Arabic language generation, developed by AUB MIND Lab. It was trained on a large Arabic dataset and supports various configurations, such as base, medium, large, and mega. The model is designed for text generation tasks and is compatible with the Hugging Face Transformers library.

Architecture

AraGPT2 follows the architecture of GPT-2. The models vary in size and configuration:

  • AraGPT2-base: 1024 context size, 768 embedding size, 12 heads, 12 layers, 135 million parameters.
  • AraGPT2-medium: 1024 context size, 1024 embedding size, 16 heads, 24 layers, 370 million parameters.
  • AraGPT2-large: 1024 context size, 1280 embedding size, 20 heads, 36 layers, 792 million parameters.
  • AraGPT2-mega: 1024 context size, 1536 embedding size, 25 heads, 48 layers, 1.46 billion parameters.

The models utilize different optimizers: lamb for base and medium, adafactor for large and mega.

Training

The training process used GPUs and TPUs, with the TPUEstimator API. The dataset consists of 77GB of Arabic text from various sources like Wikipedia, the 1.5B Arabic Corpus, and news articles. Models were trained using different hardware configurations and required significant computational resources.

Guide: Running Locally

To run AraGPT2 locally, follow these steps:

  1. Install required libraries:

    pip install transformers arabert
    
  2. Import the necessary modules and initialize the model:

    from transformers import GPT2TokenizerFast, pipeline
    from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
    from arabert.preprocess import ArabertPreprocessor
    
    MODEL_NAME = 'aubmindlab/aragpt2-base'
    arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)
    text_clean = arabert_prep.preprocess("Your text here")
    model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
    tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
    generation_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)
    
  3. Generate text:

    generation_pipeline("Your text here", num_beams=10, max_length=200, top_p=0.9)
    

For more intensive tasks, consider using cloud GPUs like Google Cloud or AWS with TPUs to leverage faster training and inference.

License

AraGPT2 is intended for research and scientific purposes. The generated text does not represent the authors' or institutions' official stance. Ensure usage complies with ethical standards and does not propagate inappropriate content.

More Related APIs in Text Generation