distilbart tos summarizer tosdr

ml6team

Introduction

The DISTILBART-TOS-SUMMARIZER-TOSDR model is designed for summarizing Terms and Conditions (T&C) documents. It is based on the sshleifer/distilbart-cnn-6-6 model and is part of an end-to-end pipeline that includes extractive summarization using Latent Semantic Analysis (LSA) before applying this abstractive summarization model.

Architecture

The model uses the distilbart-6-6 architecture, which is a distilled version of the BART transformer model. It is fine-tuned on a dataset from TOSDR, focusing on summarizing T&C documents. The pipeline first applies extractive summarization to reduce the text size before the model performs abstractive summarization.

Training

The model was fine-tuned using the TOSDR dataset, which provides structured data for training summarization models. The training process involved using extractive summarization as a preprocessing step, followed by fine-tuning the distilbart-6-6 model to perform abstractive summarization effectively.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python installed along with the required libraries. You will need the transformers library and sumy for extractive summarization.

  2. Load the Model:

    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    
    tokenizer = AutoTokenizer.from_pretrained("ml6team/distilbart-tos-summarizer-tosdr")
    model = AutoModelForSeq2SeqLM.from_pretrained("ml6team/distilbart-tos-summarizer-tosdr")
    
  3. Run Summarization: Use the provided code to perform summarization on your text input. The code sample shows how to integrate extractive summarization with the abstractive model.

  4. Hardware Suggestions: For efficient processing, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure, which offer scalable resources for inference tasks.

License

The model and code are subject to the licensing terms of the Hugging Face repository and any associated datasets or libraries used. Always review the license before use to ensure compliance.

More Related APIs in Summarization