bert tiny finetuned enron spam detection
mrm8488Introduction
The BERT-TINY-FINETUNED-ENRON-SPAM-DETECTION model is a fine-tuned version of the BERT-Tiny model, specifically tailored for spam detection tasks using the SetFit/enron_spam dataset. The model is designed to classify email text as spam or not with high precision and recall.
Architecture
The model is based on the Google BERT-Tiny architecture, which is a smaller variant of the original BERT model. It uses only 2 layers with a hidden size of 128 and 2 attention heads, making it efficient for tasks with limited computational resources.
Training
The model was trained using the following hyperparameters:
- Learning Rate: 2e-05
- Train Batch Size: 16
- Evaluation Batch Size: 32
- Seed: 42
- Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- Learning Rate Scheduler: Linear
- Number of Epochs: 4
The training was conducted using the Enron spam dataset, and the model achieved the following results on the evaluation set:
- Loss: 0.0593
- Precision: 0.9851
- Recall: 0.9871
- Accuracy: 0.986
- F1 Score: 0.9861
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Dependencies: Ensure you have the following Python packages installed:
- Transformers 4.23.1
- PyTorch 1.12.1+cu113
- Datasets 2.6.1
- Tokenizers 0.13.1
-
Clone the Repository: Clone the model repository from Hugging Face to your local machine.
-
Load the Model: Use the Transformers library to load the model and tokenizer:
from transformers import BertTokenizer, BertForSequenceClassification tokenizer = BertTokenizer.from_pretrained('mrm8488/bert-tiny-finetuned-enron-spam-detection') model = BertForSequenceClassification.from_pretrained('mrm8488/bert-tiny-finetuned-enron-spam-detection')
-
Inference: Prepare your input text, tokenize it, and run inference through the model.
For improved performance, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.
License
The model is licensed under the Apache-2.0 license. This allows for both personal and commercial use, modification, and distribution, with proper credit to the original authors.