La B S E
sentence-transformersIntroduction
LaBSE (Language-agnostic BERT Sentence Embedding) is a model designed to map 109 languages into a shared vector space, facilitating tasks like sentence similarity. The model is implemented using the sentence-transformers
library in PyTorch.
Architecture
The LaBSE model architecture consists of several components within the SentenceTransformer
framework:
- A
Transformer
layer based onBertModel
with a maximum sequence length of 256 and no lowercasing. - A
Pooling
layer with a dimension of 768, usingCLS
token pooling. - A
Dense
layer with 768 input and output features, utilizing a Tanh activation function. - A
Normalize
layer for standardizing the embeddings.
Training
The model is a PyTorch port of the original LaBSE model from TensorFlow, developed by Google. It is capable of processing over 110 languages, using the sentence-transformers
library to encode sentences into embeddings.
Guide: Running Locally
To run LaBSE locally, follow these steps:
-
Install the
sentence-transformers
library:pip install -U sentence-transformers
-
Use the model in your Python script:
from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('sentence-transformers/LaBSE') embeddings = model.encode(sentences) print(embeddings)
-
For better performance, especially with large datasets, consider using cloud GPU services such as AWS, GCP, or Azure.
License
The LaBSE model is distributed under the Apache 2.0 license, allowing for free use and distribution with attribution.