aragpt2 medium

aubmindlab

Introduction

AraGPT2 is a pre-trained transformer model designed for Arabic language generation. Developed by AUB MIND Lab, it builds on the GPT-2 architecture and has been trained on a large Arabic corpus. This model is intended for research and scientific purposes, with its outputs not reflecting official views or preferences.

Architecture

AraGPT2 follows the GPT-2 architecture with several variants:

  • Base: 12 layers, 768 embedding size, 12 heads, 135M parameters
  • Medium: 24 layers, 1024 embedding size, 16 heads, 370M parameters
  • Large: 36 layers, 1280 embedding size, 20 heads, 792M parameters
  • Mega: 48 layers, 1536 embedding size, 25 heads, 1.46B parameters

The models are compatible with the Transformers library and utilize different optimizers like Lamb and Adafactor to manage memory efficiently.

Training

AraGPT2 was trained using a dataset comprising 77GB of Arabic text, including sources like Arabic Wikipedia, the 1.5B Arabic Corpus, and Assafir news articles. Training was conducted on TPU hardware, leveraging the TPUv3-8 for the Medium model and TPUv3-128 for larger models. The training process involved creating TFRecords and fine-tuning using the TPU infrastructure.

Guide: Running Locally

  1. Installation: Install the necessary libraries using pip:
    pip install transformers arabert
    
  2. Setup: Import required modules and initialize the preprocessor and model:
    from transformers import GPT2TokenizerFast, pipeline, GPT2LMHeadModel
    from arabert.preprocess import ArabertPreprocessor
    
    MODEL_NAME = 'aubmindlab/aragpt2-medium'
    arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)
    model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
    tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
    generation_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)
    
  3. Text Generation: Preprocess input text and generate text using the model:
    text = "Input text here."
    text_clean = arabert_prep.preprocess(text)
    generated_text = generation_pipeline(text_clean, pad_token_id=tokenizer.eos_token_id,
                                         num_beams=10, max_length=200, top_p=0.9,
                                         repetition_penalty=3.0, no_repeat_ngram_size=3)[0]['generated_text']
    
  4. Cloud GPUs: For improved performance, consider using cloud GPU services like Google Cloud or AWS.

License

AraGPT2 is intended for research and scientific purposes only. Generated content does not reflect the authors' or institutions' official views, and should not be propagated if it infringes on rights or violates social norms. Please cite the model as specified if used in your work.

More Related APIs in Text Generation