La B S E en ru

cointegrated

Introduction

LaBSE-en-ru is a language model adapted from sentence-transformers/LaBSE, itself a port of Google's LaBSE model. It focuses on English and Russian tokens, creating efficient embeddings without sacrificing quality.

Architecture

The model is a streamlined version of LaBSE, with a vocabulary reduced to 10% of the original to include only English and Russian tokens. This reduction maintains the quality of the embeddings while decreasing the number of parameters to 27% of the original model.

Training

The model leverages pre-trained LaBSE architecture for feature extraction, embeddings, and sentence similarity tasks. It was refined to specifically handle English and Russian languages, maintaining performance while reducing complexity.

Guide: Running Locally

To use LaBSE-en-ru for sentence embeddings, follow these steps:

  1. Install PyTorch and Transformers:

    pip install torch transformers
    
  2. Load the Model and Tokenizer:

    import torch
    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained("cointegrated/LaBSE-en-ru")
    model = AutoModel.from_pretrained("cointegrated/LaBSE-en-ru")
    
  3. Prepare and Encode Sentences:

    sentences = ["Hello World", "Привет Мир"]
    encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors='pt')
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    embeddings = model_output.pooler_output
    embeddings = torch.nn.functional.normalize(embeddings)
    print(embeddings)
    
  4. Cloud GPU Recommendation: Use cloud services like Google Colab or AWS for resource-intensive tasks to leverage GPUs for faster processing.

License

The model is licensed under the terms available at https://tfhub.dev/google/LaBSE/1.

More Related APIs in Feature Extraction