jina embeddings v2 base en
jinaaiIntroduction
jina-embeddings-v2-base-en
is an English monolingual embedding model designed for long document processing, such as retrieval, semantic similarity, and text reranking. Built on BERT architecture with ALiBi, it supports sequences up to 8192 tokens. Pretrained on the C4 dataset, it is further refined on over 400 million sentence pairs, offering enhanced performance while maintaining a standard size of 137 million parameters.
Architecture
The model uses the symmetric bidirectional variant of ALiBi, allowing for longer sequence processing. It employs a BERT-based architecture, specifically JinaBERT, and is pretrained using the C4 dataset. The model's training involved sentence pairs and hard negatives, curated from diverse domains.
Training
The training process utilized sequences of 512 tokens, but the model is capable of extrapolating to handle up to 8k tokens due to ALiBi. This feature makes it suitable for a variety of applications, including document retrieval and semantic tasks. The model consists of 137 million parameters, enabling quick inference without sacrificing performance.
Guide: Running Locally
To use jina-embeddings-v2-base-en
locally, follow these steps:
-
Install Dependencies:
pip install transformers sentence-transformers
-
Load the Model:
from transformers import AutoModel model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
-
Encode Sentences:
embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
-
Compute Similarity:
from numpy.linalg import norm cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b)) print(cos_sim(embeddings[0], embeddings[1]))
-
Consider Cloud GPUs: For better performance, especially with long sequences, consider using cloud services like AWS Sagemaker for deployment.
License
The model is licensed under the Apache 2.0 License, allowing for both personal and commercial use.