content type classifier deberta LLM Model

Introduction

The Content Type Classifier is a text classification model developed by NVIDIA, designed to categorize documents into one of 11 distinct speech types. It leverages the DeBERTa V3 Base architecture to analyze and understand textual nuances for accurate classification across various content types.

Architecture

Model Architecture: Deberta V3 Base
Context Length: 1024 tokens

Training

The model was trained on datasets such as Jigsaw Toxic Comments, Jigsaw Unintended Biases Dataset, Toxigen Dataset, Common Crawl, and Wikipedia. Training involved labeling 25,000 samples, with 19,604 samples used that were agreed upon by at least two annotators. The label distribution includes categories like Product Websites, Blogs, News, and more. The model's performance was evaluated with metrics like PR-AUC, reporting an average AUC of 0.6192 and accuracy of 0.6805 on the agreed samples.

Guide: Running Locally

Environment Setup:
- Use Python 3.10 and ensure compatibility with NVIDIA GPUs, Volta™ or higher (compute capability 7.0+), and CUDA 12 or above.
- Recommended operating systems are Ubuntu 22.04/20.04.
Installation:
- Install PyTorch and necessary libraries such as transformers.

Model Usage:

import torch
from torch import nn
from transformers import AutoModel, AutoTokenizer, AutoConfig

class CustomModel(nn.Module):
    def __init__(self, config):
        super(CustomModel, self).__init__()
        self.model = AutoModel.from_pretrained(config["base_model"])
        self.dropout = nn.Dropout(config["fc_dropout"])
        self.fc = nn.Linear(self.model.config.hidden_size, len(config["id2label"]))

    def forward(self, input_ids, attention_mask):
        features = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
        dropped = self.dropout(features)
        outputs = self.fc(dropped)
        return torch.softmax(outputs[:, 0, :], dim=1)

config = AutoConfig.from_pretrained("nvidia/content-type-classifier-deberta")
tokenizer = AutoTokenizer.from_pretrained("nvidia/content-type-classifier-deberta")
model = CustomModel.from_pretrained("nvidia/content-type-classifier-deberta")
model.eval()

text_samples = ["Hi, great video! I am now a subscriber."]
inputs = tokenizer(text_samples, return_tensors="pt", padding="longest", truncation=True)
outputs = model(inputs["input_ids"], inputs["attention_mask"])

Cloud Suggestions:
- Utilize cloud GPU services like AWS EC2 with NVIDIA GPUs for better performance.

License

This model is licensed under the NVIDIA Open Model License Agreement. Details can be found here.

More Related APIs