rubert base cased
DeepPavlovIntroduction
RuBERT is a pre-trained language model specifically adapted for the Russian language. It is based on the BERT architecture and is designed for tasks requiring understanding and generation of Russian text. The model is characterized by having 12 layers, 768 hidden units, 12 attention heads, and approximately 180 million parameters.
Architecture
RuBERT is a Russian, cased model initialized from a multilingual BERT-base model. It was trained on a large corpus consisting of the Russian part of Wikipedia and Russian news data to build a vocabulary of Russian subtokens. The model features:
- 12 layers
- 768 hidden units
- 12 attention heads
- 180 million parameters
Training
The model was trained using a corpus that includes the Russian Wikipedia and news data. The training process involved adapting the multilingual BERT-base model for Russian by building a comprehensive vocabulary of Russian subtokens. The model was also fine-tuned with Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks to enhance its understanding of the Russian language.
Guide: Running Locally
To run RuBERT locally, follow these steps:
- Install Python: Ensure Python is installed on your machine.
- Set up a virtual environment:
python -m venv rubert-env source rubert-env/bin/activate # On Windows use `rubert-env\Scripts\activate`
- Install Hugging Face Transformers library:
pip install transformers
- Load the RuBERT model:
from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("DeepPavlov/rubert-base-cased") model = AutoModel.from_pretrained("DeepPavlov/rubert-base-cased")
- Inference: Use the tokenizer and model for feature extraction or other NLP tasks.
For improved performance, consider using cloud GPUs from platforms like AWS, Google Cloud, or Azure.
License
The RuBERT model is available under licensing terms specified on the Hugging Face model card. Users should refer to the model's license information for detailed terms and conditions.