content type classifier deberta
nvidiaIntroduction
The Content Type Classifier is a text classification model developed by NVIDIA, designed to categorize documents into one of 11 distinct speech types. It leverages the DeBERTa V3 Base architecture to analyze and understand textual nuances for accurate classification across various content types.
Architecture
- Model Architecture: Deberta V3 Base
- Context Length: 1024 tokens
Training
The model was trained on datasets such as Jigsaw Toxic Comments, Jigsaw Unintended Biases Dataset, Toxigen Dataset, Common Crawl, and Wikipedia. Training involved labeling 25,000 samples, with 19,604 samples used that were agreed upon by at least two annotators. The label distribution includes categories like Product Websites, Blogs, News, and more. The model's performance was evaluated with metrics like PR-AUC, reporting an average AUC of 0.6192 and accuracy of 0.6805 on the agreed samples.
Guide: Running Locally
-
Environment Setup:
- Use Python 3.10 and ensure compatibility with NVIDIA GPUs, Volta™ or higher (compute capability 7.0+), and CUDA 12 or above.
- Recommended operating systems are Ubuntu 22.04/20.04.
-
Installation:
- Install PyTorch and necessary libraries such as
transformers
.
- Install PyTorch and necessary libraries such as
-
Model Usage:
import torch from torch import nn from transformers import AutoModel, AutoTokenizer, AutoConfig class CustomModel(nn.Module): def __init__(self, config): super(CustomModel, self).__init__() self.model = AutoModel.from_pretrained(config["base_model"]) self.dropout = nn.Dropout(config["fc_dropout"]) self.fc = nn.Linear(self.model.config.hidden_size, len(config["id2label"])) def forward(self, input_ids, attention_mask): features = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state dropped = self.dropout(features) outputs = self.fc(dropped) return torch.softmax(outputs[:, 0, :], dim=1) config = AutoConfig.from_pretrained("nvidia/content-type-classifier-deberta") tokenizer = AutoTokenizer.from_pretrained("nvidia/content-type-classifier-deberta") model = CustomModel.from_pretrained("nvidia/content-type-classifier-deberta") model.eval() text_samples = ["Hi, great video! I am now a subscriber."] inputs = tokenizer(text_samples, return_tensors="pt", padding="longest", truncation=True) outputs = model(inputs["input_ids"], inputs["attention_mask"])
-
Cloud Suggestions:
- Utilize cloud GPU services like AWS EC2 with NVIDIA GPUs for better performance.
License
This model is licensed under the NVIDIA Open Model License Agreement. Details can be found here.