pegasus newsroom
googleIntroduction
PEGASUS is a model designed for abstractive summarization tasks. It was developed by Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu, and it utilizes a pre-training approach with extracted gap-sentences. The model is available on Hugging Face, and the original TensorFlow code is hosted on GitHub.
Architecture
PEGASUS employs a mixed and stochastic approach for training. The model is trained on datasets like C4 and HugeNews, using a weighted mixture based on the number of examples in each dataset. Key features include:
- Uniformly sampled gap sentence ratios between 15% and 45%.
- Stochastic sampling of important sentences with a 20% uniform noise to importance scores.
- An updated sentencepiece tokenizer capable of encoding newline characters.
Training
The training process involves using both C4 and HugeNews datasets, with the model being trained for 1.5 million steps instead of the initial 500,000. This extended training period helps in better convergence on pre-training perplexity. The model's performance across various datasets is quantified using metrics like ROUGE scores.
Guide: Running Locally
- Setup Environment: Ensure you have Python and PyTorch installed. Optionally, set up a virtual environment for package management.
- Install Transformers Library: Run
pip install transformers
. - Download Model: Use the Hugging Face model hub to download PEGASUS by executing:
from transformers import PegasusTokenizer, PegasusForConditionalGeneration tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-newsroom") model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-newsroom")
- Inference: Tokenize your input text and generate summaries using the model:
text = "Your input text here." inputs = tokenizer(text, return_tensors="pt", truncation=True) summary_ids = model.generate(inputs["input_ids"]) print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
- GPU Recommendation: For faster inference, consider using cloud GPU services like AWS, Google Cloud, or Azure.
License
The PEGASUS model and its associated code are typically subject to the licensing terms specified by the authors and maintainers. Users should refer to the original GitHub repository and Hugging Face page for specific licensing details.