bart paraphrase

eugenesiow

Introduction

The BART Paraphrase model is a large sequence-to-sequence (seq2seq) model designed for text generation tasks, particularly paraphrasing. It is based on the BART architecture, which combines a bidirectional encoder and a left-to-right decoder, making it effective for tasks like natural language generation, translation, and comprehension.

Architecture

The BART model utilizes a standard seq2seq architecture, featuring:

  • A bidirectional encoder similar to BERT.
  • A left-to-right decoder similar to GPT.
  • A pretraining task that includes shuffling the order of sentences and using a novel in-filling scheme where text spans are replaced with a single mask token. This model has been fine-tuned on three paraphrase datasets: Quora, PAWS, and MSR paraphrase corpus.

Training

The BART Paraphrase model was fine-tuned on the pretrained facebook/bart-large model using datasets such as Quora, PAWS, and MSR paraphrase corpus. The training followed the procedure outlined in the simpletransformers seq2seq example, which involves leveraging the capabilities of the pre-trained BART model to enhance paraphrasing tasks.

Guide: Running Locally

To run the BART Paraphrase model locally:

  1. Install Dependencies: Ensure PyTorch and Hugging Face Transformers library are installed.
  2. Import Necessary Libraries:
    import torch
    from transformers import BartForConditionalGeneration, BartTokenizer
    
  3. Define Input Sentence:
    input_sentence = "They were there to enjoy us and they were there to pray for us."
    
  4. Load Model and Tokenizer:
    model = BartForConditionalGeneration.from_pretrained('eugenesiow/bart-paraphrase')
    tokenizer = BartTokenizer.from_pretrained('eugenesiow/bart-paraphrase')
    
  5. Prepare Input and Generate Output:
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    batch = tokenizer(input_sentence, return_tensors='pt')
    generated_ids = model.generate(batch['input_ids'])
    generated_sentence = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    print(generated_sentence)
    

Consider using cloud GPUs from providers like AWS, Google Cloud, or Azure if local GPU resources are insufficient.

License

The BART Paraphrase model is licensed under the Apache 2.0 license, which allows for widespread use with minimal restrictions.

More Related APIs in Text2text Generation