longformer base plagiarism detection
jpwahleIntroduction
The Longformer-base model for plagiarism detection is designed to identify machine-paraphrased plagiarism in text. This model utilizes the Longformer architecture and has been specifically trained on the Machine-Paraphrased Plagiarism Dataset. It achieves high accuracy in detecting paraphrased text produced by tools like SpinBot and SpinnerChief.
Architecture
The model is based on the Longformer architecture, a transformer model optimized for processing long documents. It extends the standard transformer architecture by using a combination of local windowed attention and global attention, allowing efficient handling of documents with extended lengths.
Training
The Longformer-base model was trained using the Machine-Paraphrased Plagiarism Dataset. The training involved evaluating the effectiveness of various pre-trained word embedding models and machine learning classifiers. The Longformer model demonstrated superior performance, achieving an average F1 score of 80.99%.
Guide: Running Locally
To run the Longformer-base plagiarism detection model locally:
-
Install the Transformers Library: Ensure that the
transformers
library by Hugging Face is installed in your Python environment.pip install transformers
-
Import and Load Model: Use the following code to load the model and tokenizer.
from transformers import AutoModelForSequenceClassification, AutoTokenizer model = AutoModelForSequenceClassification.from_pretrained("jpelhaw/longformer-base-plagiarism-detection") tokenizer = AutoTokenizer.from_pretrained("jpelhaw/longformer-base-plagiarism-detection")
-
Tokenize Input: Prepare your input text for the model.
input = "Your input text here." tokens = tokenizer.tokenize(input, add_special_tokens=True)
-
Inference: Run the model to detect plagiarism.
result = model(**tokens)
For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The model and its associated resources are available under the license specified by the creators, which typically includes permissions for use in academic and research settings. Ensure to review and comply with the license terms provided by Hugging Face and the model authors.