distilbart tos summarizer tosdr
ml6teamIntroduction
The DISTILBART-TOS-SUMMARIZER-TOSDR model is designed for summarizing Terms and Conditions (T&C) documents. It is based on the sshleifer/distilbart-cnn-6-6
model and is part of an end-to-end pipeline that includes extractive summarization using Latent Semantic Analysis (LSA) before applying this abstractive summarization model.
Architecture
The model uses the distilbart-6-6
architecture, which is a distilled version of the BART transformer model. It is fine-tuned on a dataset from TOSDR, focusing on summarizing T&C documents. The pipeline first applies extractive summarization to reduce the text size before the model performs abstractive summarization.
Training
The model was fine-tuned using the TOSDR dataset, which provides structured data for training summarization models. The training process involved using extractive summarization as a preprocessing step, followed by fine-tuning the distilbart-6-6
model to perform abstractive summarization effectively.
Guide: Running Locally
-
Install Dependencies: Ensure you have Python installed along with the required libraries. You will need the
transformers
library andsumy
for extractive summarization. -
Load the Model:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("ml6team/distilbart-tos-summarizer-tosdr") model = AutoModelForSeq2SeqLM.from_pretrained("ml6team/distilbart-tos-summarizer-tosdr")
-
Run Summarization: Use the provided code to perform summarization on your text input. The code sample shows how to integrate extractive summarization with the abstractive model.
-
Hardware Suggestions: For efficient processing, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure, which offer scalable resources for inference tasks.
License
The model and code are subject to the licensing terms of the Hugging Face repository and any associated datasets or libraries used. Always review the license before use to ensure compliance.