multilingual domain classifier
nvidiaIntroduction
The multilingual domain classifier is a text classification model developed by NVIDIA. It classifies documents into one of 26 domain classes, such as 'Sports' or 'News'. It supports text inputs in 52 languages, including English, Arabic, French, and Japanese.
Architecture
The model is built on the DeBERTa V3 Base architecture, with a context length of 512 tokens. It uses PyTorch and integrates with the PyTorchModelHubMixin for model management.
Training
The model was trained using 1 million samples from Common Crawl, labeled with Google's Natural Language API, and 500k Wikipedia articles. Training involved translating English samples into 51 other languages, creating 52 copies per sample. Validation was done by evaluating each language separately. The primary evaluation metric was PR-AUC, tested on NVIDIA V100 hardware.
Guide: Running Locally
- Install Dependencies: Ensure Python 3.10 and PyTorch are installed.
- Setup Environment: Use CUDA 12 or above on a compatible NVIDIA GPU (Volta™ or higher).
- Model Setup:
import torch from transformers import AutoModel, AutoTokenizer, AutoConfig config = AutoConfig.from_pretrained("nvidia/multilingual-domain-classifier") tokenizer = AutoTokenizer.from_pretrained("nvidia/multilingual-domain-classifier") model = AutoModel.from_pretrained("nvidia/multilingual-domain-classifier") model.eval()
- Prepare Input: Tokenize text samples for classification.
- Inference: Run the model on input text and interpret the output classes.
inputs = tokenizer(["Example text"], return_tensors="pt", padding="longest", truncation=True) outputs = model(inputs["input_ids"], inputs["attention_mask"])
- Cloud GPUs: For enhanced performance, consider using cloud GPU providers like AWS or Google Cloud.
License
The model is released under the NVIDIA Open Model License Agreement, which can be found here.