rubert tiny
cointegratedIntroduction
The rubert-tiny
model is a small, distilled version of the bert-base-multilingual-cased
model, optimized for Russian and English language tasks. With a size of 45 MB and 12 million parameters, it is designed for tasks where speed and efficiency are prioritized over accuracy. This model is suitable for tasks such as NER or sentiment classification in Russian, and it provides sentence representation alignment between Russian and English using [CLS]
embeddings.
Architecture
The rubert-tiny
model is approximately ten times smaller and faster than a base-sized BERT model. It employs MLM loss, translation ranking loss, and [CLS]
embeddings distilled from several models, including LaBSE, to enhance its performance.
Training
The model was trained using datasets like Yandex Translate corpus, OPUS-100, and Tatoeba. It uses a combination of distilled MLM loss from bert-base-multilingual-cased
and additional ranking and embedding distillation techniques to improve its efficiency and performance in Russian NLU tasks.
Guide: Running Locally
To produce sentence embeddings with rubert-tiny
, follow these steps:
-
Install Dependencies:
pip install transformers sentencepiece
-
Load the Model:
import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny") model = AutoModel.from_pretrained("cointegrated/rubert-tiny") # model.cuda() # Uncomment if using a GPU
-
Embed Text:
def embed_bert_cls(text, model, tokenizer): t = tokenizer(text, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**{k: v.to(model.device) for k, v in t.items()}) embeddings = model_output.last_hidden_state[:, 0, :] embeddings = torch.nn.functional.normalize(embeddings) return embeddings[0].cpu().numpy() print(embed_bert_cls('привет мир', model, tokenizer).shape) # (312,)
-
Cloud GPUs: To enhance performance, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.
License
The rubert-tiny
model is licensed under the MIT License, providing flexibility for both personal and commercial use.