rubert tiny2
cointegratedIntroduction
RuBERT-Tiny2 is an updated version of the RuBERT-Tiny model. It is a compact, Russian BERT-based encoder designed for high-quality sentence embeddings. This model is optimized for tasks in the Russian language, providing improved vocabulary, sequence length, and sentence embedding quality.
Architecture
RuBERT-Tiny2 features a larger vocabulary of 83,828 tokens compared to its predecessor's 29,564. It supports sequences up to 2048 tokens, in contrast to the previous 512. The model's sentence embeddings have been enhanced to resemble LaBSE more closely, with meaningful segment embeddings fine-tuned on Natural Language Inference (NLI) tasks.
Training
The model is pretrained with a focus on generating accurate sentence embeddings for Russian texts. It is designed for tasks like KNN classification of short texts and can be fine-tuned for specific downstream applications.
Guide: Running Locally
-
Installation:
Install the necessary packages:pip install transformers sentencepiece
-
Script:
Use the following Python script to generate embeddings:import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2") model = AutoModel.from_pretrained("cointegrated/rubert-tiny2") # model.cuda() # Uncomment if using a GPU def embed_bert_cls(text, model, tokenizer): t = tokenizer(text, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**{k: v.to(model.device) for k, v in t.items()}) embeddings = model_output.last_hidden_state[:, 0, :] embeddings = torch.nn.functional.normalize(embeddings) return embeddings[0].cpu().numpy() print(embed_bert_cls('привет мир', model, tokenizer).shape)
-
Alternative with Sentence Transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer('cointegrated/rubert-tiny2') sentences = ["привет мир", "hello world", "здравствуй вселенная"] embeddings = model.encode(sentences) print(embeddings)
-
Cloud GPUs:
To leverage faster computation, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.
License
RuBERT-Tiny2 is distributed under the MIT License, allowing for both commercial and non-commercial use, modification, and distribution.