gte large en v1.5 LLM Model

Introduction

The GTE-LARGE-EN-V1.5 model is part of the gte-v1.5 series, offering enhanced performance for text embeddings with support for long contexts up to 8192 tokens. Built on the transformer++ encoder, it integrates BERT with RoPE and GLU. It achieves state-of-the-art scores on the MTEB benchmark and performs well in long-context retrieval tests.

Architecture

The model uses a transformer++ encoder backbone, combining BERT, RoPE, and GLU. It is designed for long-context text representation and retrieval, supporting a maximum sequence length of 8192 tokens. The architecture aims for competitive performance in multilingual and long-context tasks.

Training

The GTE-LARGE-EN-V1.5 model underwent a multi-stage training process to support long contexts. Training involved:

Masked Language Modeling (MLM): Applied on various sequence lengths to prepare the model for long-context handling.
Weak-supervised Contrastive Pre-training (CPT): Utilized specific pre-training data.
Supervised Contrastive Fine-tuning: Further refined using dedicated fine-tuning data.

Guide: Running Locally

Install Dependencies: Ensure transformers version >= 4.36.0 and sentence_transformers version >= 2.7.0 are installed.
```
pip install transformers>=4.36.0 sentence_transformers>=2.7.0
```

Load the Model: Use the following Python code to load and utilize the model for generating text embeddings.

import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model_path = 'Alibaba-NLP/gte-large-en-v1.5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)

input_texts = ["what is the capital of China?", "how to implement quick sort in python?", "Beijing", "sorting algorithms"]
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)

Compute Similarity Scores: Calculate similarity scores between sentence embeddings.
```
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
```
Use with Cloud GPUs: For improved performance, consider using cloud GPUs through platforms like AWS, Google Cloud, or Azure.

License

The GTE-LARGE-EN-V1.5 model is licensed under the Apache 2.0 License.

More Related APIs in Sentence Similarity