pegasus multi_news
googleIntroduction
PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization) is a model designed for summarization tasks. Developed by researchers at Google, PEGASUS leverages a unique approach of pre-training with extracted gap-sentences to enhance the ability to generate abstractive summaries. It is implemented in PyTorch and integrated into the Hugging Face Transformers library.
Architecture
The PEGASUS architecture is geared towards text summarization and is pre-trained using both C4 and HugeNews datasets. The model uses a technique called "Mixed & Stochastic" training, which involves sampling gap sentences and applying uniform noise to importance scores during training. This approach helps PEGASUS handle various summarization datasets like CNN/DailyMail, Multi-News, and others effectively.
Training
The "Mixed & Stochastic" training method includes:
- Training on both C4 and HugeNews, with dataset mixtures weighted by the number of examples.
- Extended training for 1.5 million steps, which is longer than the standard 500k steps, accommodating slower convergence on pretraining perplexity.
- Uniform sampling of a gap sentence ratio between 15% and 45%.
- Sampling of important sentences with a 20% uniform noise to importance scores.
- Updating of the SentencePiece tokenizer to encode newline characters, improving paragraph segmentation.
Guide: Running Locally
To run PEGASUS locally:
- Set Up Environment: Install Python and create a virtual environment.
- Install Dependencies: Use
pip
to install PyTorch and Hugging Face Transformers:pip install torch transformers
- Download Model: Download PEGASUS via the Hugging Face model hub:
from transformers import PegasusTokenizer, PegasusForConditionalGeneration model_name = "google/pegasus-multi_news" tokenizer = PegasusTokenizer.from_pretrained(model_name) model = PegasusForConditionalGeneration.from_pretrained(model_name)
- Run Inference: Process text data and generate summaries:
text = "Your input text here." inputs = tokenizer(text, return_tensors="pt", truncation=True) summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=60, early_stopping=True) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) print(summary)
For improved performance, consider using cloud-based GPU services such as AWS, GCP, or Azure.
License
PEGASUS is released under an Apache 2.0 License, allowing users to freely use, modify, and distribute the software with proper attribution to the original authors.