gte large en v1.5

Alibaba-NLP

Introduction

The GTE-LARGE-EN-V1.5 model is part of the gte-v1.5 series, offering enhanced performance for text embeddings with support for long contexts up to 8192 tokens. Built on the transformer++ encoder, it integrates BERT with RoPE and GLU. It achieves state-of-the-art scores on the MTEB benchmark and performs well in long-context retrieval tests.

Architecture

The model uses a transformer++ encoder backbone, combining BERT, RoPE, and GLU. It is designed for long-context text representation and retrieval, supporting a maximum sequence length of 8192 tokens. The architecture aims for competitive performance in multilingual and long-context tasks.

Training

The GTE-LARGE-EN-V1.5 model underwent a multi-stage training process to support long contexts. Training involved:

  • Masked Language Modeling (MLM): Applied on various sequence lengths to prepare the model for long-context handling.
  • Weak-supervised Contrastive Pre-training (CPT): Utilized specific pre-training data.
  • Supervised Contrastive Fine-tuning: Further refined using dedicated fine-tuning data.

Guide: Running Locally

  1. Install Dependencies: Ensure transformers version >= 4.36.0 and sentence_transformers version >= 2.7.0 are installed.

    pip install transformers>=4.36.0 sentence_transformers>=2.7.0
    
  2. Load the Model: Use the following Python code to load and utilize the model for generating text embeddings.

    import torch.nn.functional as F
    from transformers import AutoModel, AutoTokenizer
    
    model_path = 'Alibaba-NLP/gte-large-en-v1.5'
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
    
    input_texts = ["what is the capital of China?", "how to implement quick sort in python?", "Beijing", "sorting algorithms"]
    batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
    outputs = model(**batch_dict)
    embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)
    
  3. Compute Similarity Scores: Calculate similarity scores between sentence embeddings.

    scores = (embeddings[:1] @ embeddings[1:].T) * 100
    print(scores.tolist())
    
  4. Use with Cloud GPUs: For improved performance, consider using cloud GPUs through platforms like AWS, Google Cloud, or Azure.

License

The GTE-LARGE-EN-V1.5 model is licensed under the Apache 2.0 License.

More Related APIs in Sentence Similarity