autonlp Gibberish Detector 492513457
madhurjindalIntroduction
The project focuses on developing a gibberish detector for the English language, particularly useful for improving the accuracy and user experience of systems relying on text inputs, such as chatbots. The goal is to classify text as gibberish or non-gibberish, enhancing interaction quality by filtering out nonsensical input.
Architecture
The gibberish detector is based on the DistilBERT model, fine-tuned with AutoTrain for text classification tasks. It follows a multi-class classification approach, categorizing input text into distinct classes based on the level of gibberish detected.
Training
The model was trained using AutoNLP with a focus on reducing CO2 emissions during training, achieving a total of 5.53 grams. The validation metrics highlight high accuracy and precision, with an accuracy of 97.36% and a weighted F1 score of 97.36%.
Label Description
The model categorizes text into four classes:
- Noise: Random words without individual meaning.
- Word Salad: Words are meaningful individually but nonsensical collectively.
- Mild Gibberish: Contains grammatical or syntactical errors.
- Clean: Meaningful and coherent sentences.
Guide: Running Locally
Basic Steps
-
Install Dependencies: Ensure Python and PyTorch are installed. Use
pip
to install the Transformers library.pip install transformers torch
-
Load Model and Tokenizer: Use the
transformers
library to load the model and tokenizer.from transformers import AutoModelForSequenceClassification, AutoTokenizer model = AutoModelForSequenceClassification.from_pretrained("madhurjindal/autonlp-Gibberish-Detector-492513457") tokenizer = AutoTokenizer.from_pretrained("madhurjindal/autonlp-Gibberish-Detector-492513457")
-
Inference: Tokenize input text and perform inference.
inputs = tokenizer("Your input text here", return_tensors="pt") outputs = model(**inputs)
-
Prediction: Use the softmax function to determine probabilities and classify the input.
import torch.nn.functional as F probs = F.softmax(outputs.logits, dim=-1) predicted_index = torch.argmax(probs, dim=1).item()
Suggest Cloud GPUs
Consider using cloud services such as AWS, Google Cloud Platform, or Azure for access to high-performance GPUs, which can accelerate the model inference process.
License
The gibberish detector is released under the MIT License, allowing for wide use and modification of the codebase, provided attribution is maintained.