pegasus multi_news

google

Introduction

PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization) is a model designed for summarization tasks. Developed by researchers at Google, PEGASUS leverages a unique approach of pre-training with extracted gap-sentences to enhance the ability to generate abstractive summaries. It is implemented in PyTorch and integrated into the Hugging Face Transformers library.

Architecture

The PEGASUS architecture is geared towards text summarization and is pre-trained using both C4 and HugeNews datasets. The model uses a technique called "Mixed & Stochastic" training, which involves sampling gap sentences and applying uniform noise to importance scores during training. This approach helps PEGASUS handle various summarization datasets like CNN/DailyMail, Multi-News, and others effectively.

Training

The "Mixed & Stochastic" training method includes:

  • Training on both C4 and HugeNews, with dataset mixtures weighted by the number of examples.
  • Extended training for 1.5 million steps, which is longer than the standard 500k steps, accommodating slower convergence on pretraining perplexity.
  • Uniform sampling of a gap sentence ratio between 15% and 45%.
  • Sampling of important sentences with a 20% uniform noise to importance scores.
  • Updating of the SentencePiece tokenizer to encode newline characters, improving paragraph segmentation.

Guide: Running Locally

To run PEGASUS locally:

  1. Set Up Environment: Install Python and create a virtual environment.
  2. Install Dependencies: Use pip to install PyTorch and Hugging Face Transformers:
    pip install torch transformers
    
  3. Download Model: Download PEGASUS via the Hugging Face model hub:
    from transformers import PegasusTokenizer, PegasusForConditionalGeneration
    
    model_name = "google/pegasus-multi_news"
    tokenizer = PegasusTokenizer.from_pretrained(model_name)
    model = PegasusForConditionalGeneration.from_pretrained(model_name)
    
  4. Run Inference: Process text data and generate summaries:
    text = "Your input text here."
    inputs = tokenizer(text, return_tensors="pt", truncation=True)
    summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=60, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    print(summary)
    

For improved performance, consider using cloud-based GPU services such as AWS, GCP, or Azure.

License

PEGASUS is released under an Apache 2.0 License, allowing users to freely use, modify, and distribute the software with proper attribution to the original authors.

More Related APIs in Summarization