sbert_punc_case_ru LLM Model

Introduction

The SbertPuncCase model is designed for punctuation and case restoration in Russian text. It is capable of inserting periods, commas, and question marks, as well as determining the appropriate case for words (lowercase, capitalized, or uppercase). This model is particularly useful for text correction after speech recognition, as it operates on lowercase text strings.

Architecture

The SbertPuncCase model is based on the sbert_large_nlu_ru model. It processes text by converting it to lowercase and tokenizing it. The model predicts a class for each token, similar to Named Entity Recognition (NER), classifying into 12 categories: 3 punctuation marks and 3 case variants.

Training

The model was trained using transcriptions of interviews, leveraging the structure of the sbert_large_nlu_ru model. It predicts classes for tokens which are then decoded to restore the text according to the predicted classes.

Guide: Running Locally

To use the SbertPuncCase model locally, follow these steps:

Ensure git-lfs is installed on your system.

Install the model using the following command:

pip install git+https://huggingface.co/kontur-ai/sbert_punc_case_ru

Use the model in your Python code:

from sbert_punc_case_ru import SbertPuncCase
model = SbertPuncCase()
model.punctuate("sbert punc case расставляет точки запятые и знаки вопроса вам нравится")

For improved performance, consider using cloud GPUs such as those offered by AWS, GCP, or Azure.

License

The SbertPuncCase model is licensed under the Apache 2.0 License.

More Related APIs in Token Classification