mbert2mbert arabic text summarization

malmarjeh

Introduction

The MBERT2MBERT Arabic Text Summarization model is an abstractive text summarization tool designed for Arabic text. It is based on the BERT2BERT model architecture and is fine-tuned using mBERT weights. The model has been trained on a dataset comprising 84,764 paragraph-summary pairs to perform tasks such as Arabic text summarization, news title generation, and paraphrasing.

Architecture

The model utilizes a BERT2BERT architecture, which leverages the multilingual BERT (mBERT) as its foundational layer. This allows the model to effectively understand and process Arabic text for various text generation tasks. The architecture supports encoder-decoder functionalities, making it suitable for text-to-text generation tasks.

Training

The training of this model involves fine-tuning mBERT weights on a comprehensive dataset that includes 84,764 paragraph-summary pairs. This dataset is specifically curated for Arabic abstractive text summarization, enabling the model to generate concise summaries of Arabic text efficiently. The training process is further detailed in a research paper available at ScienceDirect.

Guide: Running Locally

To run the MBERT2MBERT Arabic Text Summarization model locally, follow these steps:

  1. Setup Environment: Ensure you have Python installed and set up a virtual environment for package management.

  2. Install Libraries: Use pip to install the necessary Python libraries:

    pip install transformers arabert
    
  3. Load the Model: Utilize the code snippet provided to load and test the model:

    from transformers import BertTokenizer, AutoModelForSeq2SeqLM, pipeline
    from arabert.preprocess import ArabertPreprocessor
    
    model_name = "malmarjeh/mbert2mbert-arabic-text-summarization"
    preprocessor = ArabertPreprocessor(model_name="")
    
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
    
    text = "Your Arabic text here"
    text = preprocessor.preprocess(text)
    
    result = pipeline(text,
                pad_token_id=tokenizer.eos_token_id,
                num_beams=3,
                repetition_penalty=3.0,
                max_length=200,
                length_penalty=1.0,
                no_repeat_ngram_size=3)[0]['generated_text']
    print(result)
    
  4. Cloud GPU Recommendation: For enhanced performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure to run your model, especially for large-scale processing.

License

The license information is not explicitly provided in the document. Please refer to the repository or contact the maintainer for licensing details. For further inquiries, you can contact the author at banimarje@gmail.com.

More Related APIs in Text2text Generation