longformer base plagiarism detection

jpwahle

Introduction

The Longformer-base model for plagiarism detection is designed to identify machine-paraphrased plagiarism in text. This model utilizes the Longformer architecture and has been specifically trained on the Machine-Paraphrased Plagiarism Dataset. It achieves high accuracy in detecting paraphrased text produced by tools like SpinBot and SpinnerChief.

Architecture

The model is based on the Longformer architecture, a transformer model optimized for processing long documents. It extends the standard transformer architecture by using a combination of local windowed attention and global attention, allowing efficient handling of documents with extended lengths.

Training

The Longformer-base model was trained using the Machine-Paraphrased Plagiarism Dataset. The training involved evaluating the effectiveness of various pre-trained word embedding models and machine learning classifiers. The Longformer model demonstrated superior performance, achieving an average F1 score of 80.99%.

Guide: Running Locally

To run the Longformer-base plagiarism detection model locally:

  1. Install the Transformers Library: Ensure that the transformers library by Hugging Face is installed in your Python environment.

    pip install transformers
    
  2. Import and Load Model: Use the following code to load the model and tokenizer.

    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    
    model = AutoModelForSequenceClassification.from_pretrained("jpelhaw/longformer-base-plagiarism-detection")
    tokenizer = AutoTokenizer.from_pretrained("jpelhaw/longformer-base-plagiarism-detection")
    
  3. Tokenize Input: Prepare your input text for the model.

    input = "Your input text here."
    tokens = tokenizer.tokenize(input, add_special_tokens=True)
    
  4. Inference: Run the model to detect plagiarism.

    result = model(**tokens)
    

For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The model and its associated resources are available under the license specified by the creators, which typically includes permissions for use in academic and research settings. Ensure to review and comply with the license terms provided by Hugging Face and the model authors.

More Related APIs in Text Classification