distilrubert tiny cased conversational 5k
DeepPavlovIntroduction
The DistilRuBERT-tiny-cased-conversational-5k model is a compact Russian conversational model. It features a reduced vocabulary size of 5,000 words and is built with a lightweight architecture consisting of 3 layers, 264 hidden units, and 12 attention heads, totaling 3.6 million parameters. The model was trained using data from OpenSubtitles, Dirty, Pikabu, and a segment of the Taiga corpus, inspired by previous works such as DistilBERT.
Architecture
The model employs a distillation approach, using Masked Language Modeling (MLM) loss and Kullback-Leibler (KL) divergence loss to align the student model's outputs with the teacher model's hidden states. This results in a significantly smaller architecture compared to its predecessors, such as RuBERT and other distilled models. The vocabulary reduction to 5K allows for more efficient processing while maintaining performance levels suitable for conversational tasks.
Training
DistilRuBERT-tiny-cased-conversational-5k was trained over approximately 100 hours on seven NVIDIA Tesla P100-SXM2.0 16GB GPUs. The training process involved fine-tuning on various Russian datasets, including tasks like sentiment classification (RuSentiment), paraphrasing (ParaPhraser), Named Entity Recognition (NER), and question answering. Performance assessments were conducted using PyTorchBenchmark, with evaluations on an NVIDIA GeForce GTX 1080 Ti and an Intel Core i7-7700K CPU.
Guide: Running Locally
-
Install Dependencies: Ensure you have Python and PyTorch installed. You can install the Hugging Face Transformers library via pip:
pip install transformers
-
Load the Model: Use the Transformers library to load the model:
from transformers import AutoModel, AutoTokenizer model_name = "DeepPavlov/distilrubert-tiny-cased-conversational-5k" model = AutoModel.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)
-
Inference: Tokenize your input data and run it through the model to get predictions.
-
Hardware Suggestions: For optimal performance, especially on large datasets or batch sizes, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Microsoft Azure.
License
The model and associated resources are published under the perpetual, non-exclusive license of arXiv.org. For further details on usage rights, refer to the paper's citation in the publication.