La B S E en ru LLM Model — Open LLM List

Introduction

LaBSE-en-ru is a language model adapted from sentence-transformers/LaBSE, itself a port of Google's LaBSE model. It focuses on English and Russian tokens, creating efficient embeddings without sacrificing quality.

Architecture

The model is a streamlined version of LaBSE, with a vocabulary reduced to 10% of the original to include only English and Russian tokens. This reduction maintains the quality of the embeddings while decreasing the number of parameters to 27% of the original model.

Training

The model leverages pre-trained LaBSE architecture for feature extraction, embeddings, and sentence similarity tasks. It was refined to specifically handle English and Russian languages, maintaining performance while reducing complexity.

Guide: Running Locally

To use LaBSE-en-ru for sentence embeddings, follow these steps:

Install PyTorch and Transformers:
```
pip install torch transformers
```

Load the Model and Tokenizer:

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("cointegrated/LaBSE-en-ru")
model = AutoModel.from_pretrained("cointegrated/LaBSE-en-ru")

Prepare and Encode Sentences:

sentences = ["Hello World", "Привет Мир"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)
embeddings = model_output.pooler_output
embeddings = torch.nn.functional.normalize(embeddings)
print(embeddings)

Cloud GPU Recommendation: Use cloud services like Google Colab or AWS for resource-intensive tasks to leverage GPUs for faster processing.

License

The model is licensed under the terms available at https://tfhub.dev/google/LaBSE/1.

More Related APIs in Feature Extraction