domain classifier

nvidia

Introduction

The Domain Classifier is a text classification model designed to categorize documents into one of 26 pre-defined domain classes, such as 'Sports', 'Finance', 'Food_and_Drink', etc. It leverages advanced machine learning architectures to provide accurate predictions based on input text.

Architecture

  • The model is based on the Deberta V3 Base architecture.
  • It processes inputs with a context length of up to 512 tokens.

Training

Training Data

  • The model was trained on 1 million samples from Common Crawl, labeled using Google Cloud’s Natural Language API.
  • Additionally, it utilized 500,000 curated Wikipedia articles obtained through the Wikipedia-API.

Training Steps

  • Training involved multiple rounds with the data labeled by a combination of pseudo-labels and Google Cloud API.

Guide: Running Locally

To run the Domain Classifier locally, you can employ the following steps, and consider using cloud GPUs for optimal performance:

  1. Environment Setup

    • Ensure you have Python and PyTorch installed.
    • Install the Transformers library from Hugging Face.
  2. Model Download

    • Use the Hugging Face Transformers library to download the model and tokenizer:
      from transformers import AutoTokenizer, AutoConfig
      config = AutoConfig.from_pretrained("nvidia/domain-classifier")
      tokenizer = AutoTokenizer.from_pretrained("nvidia/domain-classifier")
      
  3. Model Initialization

    • Initialize the model using the provided code:
      import torch
      from torch import nn
      from transformers import AutoModel
      from huggingface_hub import PyTorchModelHubMixin
      
      class CustomModel(nn.Module, PyTorchModelHubMixin):
          def __init__(self, config):
              super(CustomModel, self).__init__()
              self.model = AutoModel.from_pretrained(config["base_model"])
              self.dropout = nn.Dropout(config["fc_dropout"])
              self.fc = nn.Linear(self.model.config.hidden_size, len(config["id2label"]))
      
          def forward(self, input_ids, attention_mask):
              features = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
              dropped = self.dropout(features)
              outputs = self.fc(dropped)
              return torch.softmax(outputs[:, 0, :], dim=1)
      
      model = CustomModel.from_pretrained("nvidia/domain-classifier")
      model.eval()
      
  4. Input Preparation

    • Prepare text samples and tokenize them:
      text_samples = ["Sports is a popular domain", "Politics is a popular domain"]
      inputs = tokenizer(text_samples, return_tensors="pt", padding="longest", truncation=True)
      
  5. Inference

    • Run inference and get predictions:
      outputs = model(inputs["input_ids"], inputs["attention_mask"])
      predicted_classes = torch.argmax(outputs, dim=1)
      predicted_domains = [config.id2label[class_idx.item()] for class_idx in predicted_classes.cpu().numpy()]
      print(predicted_domains)
      

Cloud GPUS

For enhanced performance, consider using cloud-based GPU resources such as AWS EC2 with NVIDIA GPUs or Google Cloud Platform's AI Platform.

License

The Domain Classifier is licensed under Apache 2.0. By using the model, you agree to adhere to the terms and conditions specified in the Apache License 2.0.

More Related APIs