quality classifier deberta

nvidia

Quality Classifier DeBERTa

Introduction

The Quality Classifier DeBERTa is a text classification model designed to categorize documents into "High", "Medium", or "Low" quality. It evaluates documents based on factors such as accuracy, clarity, coherence, grammar, depth of information, and overall usefulness. The model is a part of NVIDIA's NeMo Curator, specifically within its qualitative filtering module.

Architecture

The model uses the DeBERTa V3 Base architecture, capable of processing input texts with a context length of up to 1024 tokens.

Training

Training Data

The model was trained on 22,828 text samples from Common Crawl, each labeled as "High", "Medium", or "Low" quality by human annotators. Examples include:

  • Low Quality: Texts that are poorly structured or unclear.
  • High Quality: Well-structured texts with clear and coherent content.

Evaluation

The evaluation dataset consists of 7,128 samples where annotators unanimously agreed on the label. The model achieved an accuracy of 82.52% on this set, with higher precision and recall for "Medium" and "Low" qualities compared to "High".

Guide: Running Locally

To run the Quality Classifier DeBERTa locally, follow these steps:

  1. Install Required Libraries:

    • Ensure you have PyTorch and Hugging Face Transformers installed.
  2. Download the Model:

    • Use the Hugging Face model hub to download the pre-trained model and tokenizer.
  3. Setup the Model:

    • Initialize the model and tokenizer with the provided PyTorch and Transformers code snippets.
  4. Prepare Input Data:

    • Tokenize your text samples and process them for input into the model.
  5. Run Inference:

    • Use the model to predict the quality class of your documents.

For efficient processing, consider using a cloud GPU provider such as AWS, Google Cloud, or Azure to leverage NVIDIA GPUs.

License

This model is licensed under the Apache 2.0 License. By using the model, you agree to comply with the terms and conditions outlined in the license.

More Related APIs