domain classifier
nvidiaIntroduction
The Domain Classifier is a text classification model designed to categorize documents into one of 26 pre-defined domain classes, such as 'Sports', 'Finance', 'Food_and_Drink', etc. It leverages advanced machine learning architectures to provide accurate predictions based on input text.
Architecture
- The model is based on the Deberta V3 Base architecture.
- It processes inputs with a context length of up to 512 tokens.
Training
Training Data
- The model was trained on 1 million samples from Common Crawl, labeled using Google Cloud’s Natural Language API.
- Additionally, it utilized 500,000 curated Wikipedia articles obtained through the Wikipedia-API.
Training Steps
- Training involved multiple rounds with the data labeled by a combination of pseudo-labels and Google Cloud API.
Guide: Running Locally
To run the Domain Classifier locally, you can employ the following steps, and consider using cloud GPUs for optimal performance:
-
Environment Setup
- Ensure you have Python and PyTorch installed.
- Install the Transformers library from Hugging Face.
-
Model Download
- Use the Hugging Face Transformers library to download the model and tokenizer:
from transformers import AutoTokenizer, AutoConfig config = AutoConfig.from_pretrained("nvidia/domain-classifier") tokenizer = AutoTokenizer.from_pretrained("nvidia/domain-classifier")
- Use the Hugging Face Transformers library to download the model and tokenizer:
-
Model Initialization
- Initialize the model using the provided code:
import torch from torch import nn from transformers import AutoModel from huggingface_hub import PyTorchModelHubMixin class CustomModel(nn.Module, PyTorchModelHubMixin): def __init__(self, config): super(CustomModel, self).__init__() self.model = AutoModel.from_pretrained(config["base_model"]) self.dropout = nn.Dropout(config["fc_dropout"]) self.fc = nn.Linear(self.model.config.hidden_size, len(config["id2label"])) def forward(self, input_ids, attention_mask): features = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state dropped = self.dropout(features) outputs = self.fc(dropped) return torch.softmax(outputs[:, 0, :], dim=1) model = CustomModel.from_pretrained("nvidia/domain-classifier") model.eval()
- Initialize the model using the provided code:
-
Input Preparation
- Prepare text samples and tokenize them:
text_samples = ["Sports is a popular domain", "Politics is a popular domain"] inputs = tokenizer(text_samples, return_tensors="pt", padding="longest", truncation=True)
- Prepare text samples and tokenize them:
-
Inference
- Run inference and get predictions:
outputs = model(inputs["input_ids"], inputs["attention_mask"]) predicted_classes = torch.argmax(outputs, dim=1) predicted_domains = [config.id2label[class_idx.item()] for class_idx in predicted_classes.cpu().numpy()] print(predicted_domains)
- Run inference and get predictions:
Cloud GPUS
For enhanced performance, consider using cloud-based GPU resources such as AWS EC2 with NVIDIA GPUs or Google Cloud Platform's AI Platform.
License
The Domain Classifier is licensed under Apache 2.0. By using the model, you agree to adhere to the terms and conditions specified in the Apache License 2.0.