pegasus xsum

google

Introduction

PEGASUS-XSum is a model from Google designed for abstractive summarization tasks. It leverages a pre-training technique with extracted gap sentences to improve performance on summarization datasets.

Architecture

The PEGASUS model uses a transformer architecture with a unique pre-training strategy. It is pre-trained on large corpora like C4 and HugeNews, and uses a gap sentence generation method where important sentences are masked and predicted during training. This technique has been improved through mixed and stochastic training, which includes sampling important sentences with added noise and adjusting the sentencepiece tokenizer to encode newline characters.

Training

The "Mixed & Stochastic" model variation is trained over 1.5 million steps, compared to 500,000 steps in the original configuration, due to slower convergence on pre-training perplexity. The model samples a gap sentence ratio between 15% and 45% and incorporates a 20% noise factor to the importance scores of sentences. This approach is applied across various datasets, including XSum, CNN/DailyMail, and others, achieving notable improvements in ROUGE scores, a common metric for evaluating text summarization.

Guide: Running Locally

  1. Setup Environment: Install Python and the Hugging Face Transformers library.

    pip install transformers
    
  2. Download Model: Use the Transformers library to load the PEGASUS-XSum model.

    from transformers import PegasusForConditionalGeneration, PegasusTokenizer
    model_name = "google/pegasus-xsum"
    tokenizer = PegasusTokenizer.from_pretrained(model_name)
    model = PegasusForConditionalGeneration.from_pretrained(model_name)
    
  3. Run Inference: Prepare your text input and generate a summary.

    input_text = "Your input text here."
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding="longest")
    summary_ids = model.generate(inputs["input_ids"])
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    print(summary)
    
  4. Cloud GPUs: For large-scale summarization tasks or faster performance, consider using cloud GPU providers like AWS, Google Cloud, or Azure.

License

The PEGASUS model is released under a license specified by Google Research. Users should check the model's page on Hugging Face for detailed licensing information.

More Related APIs in Summarization