Introduction

RuBERT-Tiny2 is an updated version of the RuBERT-Tiny model. It is a compact, Russian BERT-based encoder designed for high-quality sentence embeddings. This model is optimized for tasks in the Russian language, providing improved vocabulary, sequence length, and sentence embedding quality.

Architecture

RuBERT-Tiny2 features a larger vocabulary of 83,828 tokens compared to its predecessor's 29,564. It supports sequences up to 2048 tokens, in contrast to the previous 512. The model's sentence embeddings have been enhanced to resemble LaBSE more closely, with meaningful segment embeddings fine-tuned on Natural Language Inference (NLI) tasks.

Training

The model is pretrained with a focus on generating accurate sentence embeddings for Russian texts. It is designed for tasks like KNN classification of short texts and can be fine-tuned for specific downstream applications.

Guide: Running Locally

  1. Installation:
    Install the necessary packages:

    pip install transformers sentencepiece
    
  2. Script:
    Use the following Python script to generate embeddings:

    import torch
    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2")
    model = AutoModel.from_pretrained("cointegrated/rubert-tiny2")
    # model.cuda()  # Uncomment if using a GPU
    
    def embed_bert_cls(text, model, tokenizer):
        t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
        with torch.no_grad():
            model_output = model(**{k: v.to(model.device) for k, v in t.items()})
        embeddings = model_output.last_hidden_state[:, 0, :]
        embeddings = torch.nn.functional.normalize(embeddings)
        return embeddings[0].cpu().numpy()
    
    print(embed_bert_cls('привет мир', model, tokenizer).shape)
    
  3. Alternative with Sentence Transformers:

    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer('cointegrated/rubert-tiny2')
    sentences = ["привет мир", "hello world", "здравствуй вселенная"]
    embeddings = model.encode(sentences)
    print(embeddings)
    
  4. Cloud GPUs:
    To leverage faster computation, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.

License

RuBERT-Tiny2 is distributed under the MIT License, allowing for both commercial and non-commercial use, modification, and distribution.

More Related APIs in Sentence Similarity