cross en de roberta sentence transformer

T-Systems-onsite

Introduction

The Cross English & German RoBERTa Sentence Transformer is designed to compute sentence embeddings for English and German text. It enables semantic comparison of sentences through cosine similarity, aiding tasks like semantic textual similarity, semantic search, and paraphrase mining. It functions cross-lingually, allowing semantic searches in one language to retrieve results in another.

Architecture

This model builds on the xlm-roberta-base model, fine-tuned for cross-lingual performance with multilingual datasets. It utilizes the Sentence-BERT (SBERT) architecture, which employs a siamese and triplet network structure to produce semantically meaningful sentence embeddings that can be efficiently compared.

Training

The model was trained on a large-scale paraphrase dataset across 50+ languages, leveraging paraphrases from diverse sources. It was further fine-tuned using the STSbenchmark dataset for both English and German, employing a technique called multilingual finetuning with language-crossing. Optimal hyperparameters were determined through a hyperparameter search using Optuna, leading to a model that effectively handles cross-lingual semantic tasks.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python installed, then install the sentence-transformers package using pip:

    pip install -U sentence-transformers
    
  2. Load the Model:

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('T-Systems-onsite/cross-en-de-roberta-sentence-transformer')
    
  3. Run Inference: Use the model to compute embeddings and perform tasks like semantic similarity.

  4. Cloud GPU: For intensive tasks, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure to speed up processing.

License

This model is licensed under the MIT License. You may use it in compliance with the License, which is included in the repository.

More Related APIs in Feature Extraction