La B S E en ru
cointegratedIntroduction
LaBSE-en-ru is a language model adapted from sentence-transformers/LaBSE, itself a port of Google's LaBSE model. It focuses on English and Russian tokens, creating efficient embeddings without sacrificing quality.
Architecture
The model is a streamlined version of LaBSE, with a vocabulary reduced to 10% of the original to include only English and Russian tokens. This reduction maintains the quality of the embeddings while decreasing the number of parameters to 27% of the original model.
Training
The model leverages pre-trained LaBSE architecture for feature extraction, embeddings, and sentence similarity tasks. It was refined to specifically handle English and Russian languages, maintaining performance while reducing complexity.
Guide: Running Locally
To use LaBSE-en-ru for sentence embeddings, follow these steps:
-
Install PyTorch and Transformers:
pip install torch transformers
-
Load the Model and Tokenizer:
import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("cointegrated/LaBSE-en-ru") model = AutoModel.from_pretrained("cointegrated/LaBSE-en-ru")
-
Prepare and Encode Sentences:
sentences = ["Hello World", "Привет Мир"] encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) embeddings = model_output.pooler_output embeddings = torch.nn.functional.normalize(embeddings) print(embeddings)
-
Cloud GPU Recommendation: Use cloud services like Google Colab or AWS for resource-intensive tasks to leverage GPUs for faster processing.
License
The model is licensed under the terms available at https://tfhub.dev/google/LaBSE/1.