sbert_punc_case_ru
kontur-aiIntroduction
The SbertPuncCase model is designed for punctuation and case restoration in Russian text. It is capable of inserting periods, commas, and question marks, as well as determining the appropriate case for words (lowercase, capitalized, or uppercase). This model is particularly useful for text correction after speech recognition, as it operates on lowercase text strings.
Architecture
The SbertPuncCase model is based on the sbert_large_nlu_ru
model. It processes text by converting it to lowercase and tokenizing it. The model predicts a class for each token, similar to Named Entity Recognition (NER), classifying into 12 categories: 3 punctuation marks and 3 case variants.
Training
The model was trained using transcriptions of interviews, leveraging the structure of the sbert_large_nlu_ru
model. It predicts classes for tokens which are then decoded to restore the text according to the predicted classes.
Guide: Running Locally
To use the SbertPuncCase model locally, follow these steps:
- Ensure
git-lfs
is installed on your system. - Install the model using the following command:
pip install git+https://huggingface.co/kontur-ai/sbert_punc_case_ru
- Use the model in your Python code:
from sbert_punc_case_ru import SbertPuncCase model = SbertPuncCase() model.punctuate("sbert punc case расставляет точки запятые и знаки вопроса вам нравится")
For improved performance, consider using cloud GPUs such as those offered by AWS, GCP, or Azure.
License
The SbertPuncCase model is licensed under the Apache 2.0 License.