multilingual domain classifier

nvidia

Introduction

The multilingual domain classifier is a text classification model developed by NVIDIA. It classifies documents into one of 26 domain classes, such as 'Sports' or 'News'. It supports text inputs in 52 languages, including English, Arabic, French, and Japanese.

Architecture

The model is built on the DeBERTa V3 Base architecture, with a context length of 512 tokens. It uses PyTorch and integrates with the PyTorchModelHubMixin for model management.

Training

The model was trained using 1 million samples from Common Crawl, labeled with Google's Natural Language API, and 500k Wikipedia articles. Training involved translating English samples into 51 other languages, creating 52 copies per sample. Validation was done by evaluating each language separately. The primary evaluation metric was PR-AUC, tested on NVIDIA V100 hardware.

Guide: Running Locally

  1. Install Dependencies: Ensure Python 3.10 and PyTorch are installed.
  2. Setup Environment: Use CUDA 12 or above on a compatible NVIDIA GPU (Volta™ or higher).
  3. Model Setup:
    import torch
    from transformers import AutoModel, AutoTokenizer, AutoConfig
    
    config = AutoConfig.from_pretrained("nvidia/multilingual-domain-classifier")
    tokenizer = AutoTokenizer.from_pretrained("nvidia/multilingual-domain-classifier")
    model = AutoModel.from_pretrained("nvidia/multilingual-domain-classifier")
    model.eval()
    
  4. Prepare Input: Tokenize text samples for classification.
  5. Inference: Run the model on input text and interpret the output classes.
    inputs = tokenizer(["Example text"], return_tensors="pt", padding="longest", truncation=True)
    outputs = model(inputs["input_ids"], inputs["attention_mask"])
    
  6. Cloud GPUs: For enhanced performance, consider using cloud GPU providers like AWS or Google Cloud.

License

The model is released under the NVIDIA Open Model License Agreement, which can be found here.

More Related APIs