Introduction

The rubert-tiny model is a small, distilled version of the bert-base-multilingual-cased model, optimized for Russian and English language tasks. With a size of 45 MB and 12 million parameters, it is designed for tasks where speed and efficiency are prioritized over accuracy. This model is suitable for tasks such as NER or sentiment classification in Russian, and it provides sentence representation alignment between Russian and English using [CLS] embeddings.

Architecture

The rubert-tiny model is approximately ten times smaller and faster than a base-sized BERT model. It employs MLM loss, translation ranking loss, and [CLS] embeddings distilled from several models, including LaBSE, to enhance its performance.

Training

The model was trained using datasets like Yandex Translate corpus, OPUS-100, and Tatoeba. It uses a combination of distilled MLM loss from bert-base-multilingual-cased and additional ranking and embedding distillation techniques to improve its efficiency and performance in Russian NLU tasks.

Guide: Running Locally

To produce sentence embeddings with rubert-tiny, follow these steps:

  1. Install Dependencies:

    pip install transformers sentencepiece
    
  2. Load the Model:

    import torch
    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny")
    model = AutoModel.from_pretrained("cointegrated/rubert-tiny")
    # model.cuda()  # Uncomment if using a GPU
    
  3. Embed Text:

    def embed_bert_cls(text, model, tokenizer):
        t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
        with torch.no_grad():
            model_output = model(**{k: v.to(model.device) for k, v in t.items()})
        embeddings = model_output.last_hidden_state[:, 0, :]
        embeddings = torch.nn.functional.normalize(embeddings)
        return embeddings[0].cpu().numpy()
    
    print(embed_bert_cls('привет мир', model, tokenizer).shape)
    # (312,)
    
  4. Cloud GPUs: To enhance performance, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.

License

The rubert-tiny model is licensed under the MIT License, providing flexibility for both personal and commercial use.

More Related APIs in Fill Mask