gte large en v1.5
Alibaba-NLPIntroduction
The GTE-LARGE-EN-V1.5 model is part of the gte-v1.5 series, offering enhanced performance for text embeddings with support for long contexts up to 8192 tokens. Built on the transformer++ encoder, it integrates BERT with RoPE and GLU. It achieves state-of-the-art scores on the MTEB benchmark and performs well in long-context retrieval tests.
Architecture
The model uses a transformer++ encoder backbone, combining BERT, RoPE, and GLU. It is designed for long-context text representation and retrieval, supporting a maximum sequence length of 8192 tokens. The architecture aims for competitive performance in multilingual and long-context tasks.
Training
The GTE-LARGE-EN-V1.5 model underwent a multi-stage training process to support long contexts. Training involved:
- Masked Language Modeling (MLM): Applied on various sequence lengths to prepare the model for long-context handling.
- Weak-supervised Contrastive Pre-training (CPT): Utilized specific pre-training data.
- Supervised Contrastive Fine-tuning: Further refined using dedicated fine-tuning data.
Guide: Running Locally
-
Install Dependencies: Ensure
transformers
version >= 4.36.0 andsentence_transformers
version >= 2.7.0 are installed.pip install transformers>=4.36.0 sentence_transformers>=2.7.0
-
Load the Model: Use the following Python code to load and utilize the model for generating text embeddings.
import torch.nn.functional as F from transformers import AutoModel, AutoTokenizer model_path = 'Alibaba-NLP/gte-large-en-v1.5' tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModel.from_pretrained(model_path, trust_remote_code=True) input_texts = ["what is the capital of China?", "how to implement quick sort in python?", "Beijing", "sorting algorithms"] batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt') outputs = model(**batch_dict) embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)
-
Compute Similarity Scores: Calculate similarity scores between sentence embeddings.
scores = (embeddings[:1] @ embeddings[1:].T) * 100 print(scores.tolist())
-
Use with Cloud GPUs: For improved performance, consider using cloud GPUs through platforms like AWS, Google Cloud, or Azure.
License
The GTE-LARGE-EN-V1.5 model is licensed under the Apache 2.0 License.