Introduction

LaBSE (Language-agnostic BERT Sentence Embedding) is a model designed to map 109 languages into a shared vector space, facilitating tasks like sentence similarity. The model is implemented using the sentence-transformers library in PyTorch.

Architecture

The LaBSE model architecture consists of several components within the SentenceTransformer framework:

  • A Transformer layer based on BertModel with a maximum sequence length of 256 and no lowercasing.
  • A Pooling layer with a dimension of 768, using CLS token pooling.
  • A Dense layer with 768 input and output features, utilizing a Tanh activation function.
  • A Normalize layer for standardizing the embeddings.

Training

The model is a PyTorch port of the original LaBSE model from TensorFlow, developed by Google. It is capable of processing over 110 languages, using the sentence-transformers library to encode sentences into embeddings.

Guide: Running Locally

To run LaBSE locally, follow these steps:

  1. Install the sentence-transformers library:

    pip install -U sentence-transformers
    
  2. Use the model in your Python script:

    from sentence_transformers import SentenceTransformer
    sentences = ["This is an example sentence", "Each sentence is converted"]
    model = SentenceTransformer('sentence-transformers/LaBSE')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. For better performance, especially with large datasets, consider using cloud GPU services such as AWS, GCP, or Azure.

License

The LaBSE model is distributed under the Apache 2.0 license, allowing for free use and distribution with attribution.

More Related APIs in Sentence Similarity