sbert_large_nlu_ru

ai-forever

Introduction

The SBERT_LARGE_NLU_RU model is a BERT large model for generating sentence embeddings in the Russian language. It is designed to be used with the Transformers library and PyTorch. The model provides high-quality embeddings by utilizing mean token embeddings, as outlined in this article.

Architecture

The model is based on the BERT architecture, specifically a large, uncased version optimized for the Russian language. It supports feature extraction and text embeddings inference, making it suitable for a variety of natural language understanding tasks.

Training

The model has been trained to produce sentence embeddings by taking into account the mean of token embeddings. This approach includes attention masking to ensure that the embeddings are averaged correctly, enhancing the quality of the outputs.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python installed and set up with the necessary libraries, such as transformers and torch.

  2. Load Model and Tokenizer:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained("ai-forever/sbert_large_nlu_ru")
    model = AutoModel.from_pretrained("ai-forever/sbert_large_nlu_ru")
    
  3. Tokenize Sentences: Prepare your input sentences for the model.

    sentences = ['Привет! Как твои дела?', 'А правда, что 42 твое любимое число?']
    encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=24, return_tensors='pt')
    
  4. Compute Embeddings: Use the model to obtain sentence embeddings.

    with torch.no_grad():
        model_output = model(**encoded_input)
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return sum_embeddings / sum_mask
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    
  5. Cloud GPU Suggestion: For enhanced performance and efficiency, consider running this model on a cloud-based GPU service, such as AWS EC2 with NVIDIA GPUs or Google Cloud Platform's GPU instances.

License

The model is hosted on Hugging Face and developed by the SberDevices team. For detailed licensing information, refer to the model's repository on Hugging Face. The authors include Aleksandr Abramov, Denis Antykhov, and Ibragim Badertdinov, with their profiles and contributions accessible via their respective GitHub and Kaggle profiles.

More Related APIs in Feature Extraction