rubert base cased sentiment
blanchefortRUBERT-BASE-CASED-SENTIMENT
Introduction
RuBERT-Base-Cased-Sentiment is a model designed for sentiment analysis of short Russian texts. It is based on the DeepPavlov/rubert-base-cased-conversational model and has been trained on a large corpus consisting of 351,797 texts. The model classifies sentiments into three categories: Neutral, Positive, and Negative.
Architecture
The model uses the BERT architecture implemented in PyTorch, and it is compatible with TensorFlow and JAX as well. It leverages the capabilities of the transformers library to perform text classification, specifically sentiment analysis, in the Russian language.
Training
The model was trained using several datasets:
- RuTweetCorp: A corpus built for training sentiment classifiers from microblog posts.
- RuReviews: An annotated sentiment analysis dataset for Russian product reviews.
- RuSentiment: A dataset for sentiment analysis in Russian social media.
- Отзывы о медучреждениях: A dataset containing reviews of medical institutions collected from prodoctorov.ru.
Guide: Running Locally
To use the model locally, follow these steps:
- Install PyTorch and Transformers: Ensure you have the PyTorch library and the transformers package installed.
- Load the Model and Tokenizer:
import torch from transformers import AutoModelForSequenceClassification, BertTokenizerFast tokenizer = BertTokenizerFast.from_pretrained('blanchefort/rubert-base-cased-sentiment') model = AutoModelForSequenceClassification.from_pretrained('blanchefort/rubert-base-cased-sentiment', return_dict=True)
- Define the Prediction Function:
@torch.no_grad() def predict(text): inputs = tokenizer(text, max_length=512, padding=True, truncation=True, return_tensors='pt') outputs = model(**inputs) predicted = torch.nn.functional.softmax(outputs.logits, dim=1) predicted = torch.argmax(predicted, dim=1).numpy() return predicted
- Use Cloud GPUs: For larger datasets or faster processing, consider using cloud-based GPUs like those offered by AWS, GCP, or Azure.
License
The model and datasets used are subject to their respective licenses. Ensure compliance with all licensing terms when using or distributing the model and datasets.