summarization arabic english news

marefa-nlp

Introduction

The "Summarization Arabic-English News" model is designed to generate concise highlights from news articles in both Arabic and English. The model is part of the Hugging Face ecosystem and is tailored for text-to-text generation tasks using the T5 architecture.

Architecture

This model leverages the T5 (Text-to-Text Transfer Transformer) architecture, specifically fine-tuned from an Arabic T5 model developed by Abu Bakr Soliman. It operates within the Hugging Face Transformers library and is implemented using PyTorch.

Training

The model was fine-tuned on a dataset of Arabic and English news articles. The fine-tuning process adapted the pre-trained T5 architecture to better handle the nuances of summarizing news content, focusing on extracting key points from longer text passages.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Setup Environment:

    • Install PyTorch.
    • Install necessary Python packages:
      pip3 install transformers==4.7.0 nltk==3.5 protobuf==3.15.3 sentencepiece==0.1.96
      
  2. Download Model and Tokenizer:

    • Load the model and tokenizer using the following script:
      from transformers import AutoTokenizer, AutoModelWithLMHead
      import torch
      import nltk
      nltk.download('punkt')
      from nltk.tokenize import sent_tokenize
      
      device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
      m_name = "marefa-nlp/summarization-arabic-english-news"
      
      tokenizer = AutoTokenizer.from_pretrained(m_name)
      model = AutoModelWithLMHead.from_pretrained(m_name).to(device)
      
  3. Generate Summaries:

    • Use the get_summary function to generate summaries from input text.
    • Example usage:
      def get_summary(text, tokenizer, model, device="cpu", num_beams=2):
          # Function implementation as provided in the document
          ...
      
      sample_text = "Your news article text here"
      summaries = get_summary(sample_text, tokenizer, model, device)
      for summary in summaries:
          print(summary)
      
  4. Cloud GPUs:

    • For improved performance, consider using cloud-based GPU services such as Google Colab, AWS EC2, or Azure VMs to handle the computation more efficiently, especially for large datasets or real-time applications.

License

The usage of this model and its associated code is subject to the licensing terms provided by the developers. Please refer to the model's repository on Hugging Face for detailed licensing information.

More Related APIs in Text2text Generation