jina embeddings v2 base zh

jinaai

Introduction

The jina-embeddings-v2-base-zh model is a Chinese/English bilingual text embedding model designed by Jina AI. It supports sequence lengths up to 8192 characters and is based on a BERT architecture, specifically utilizing the symmetric bidirectional variant of ALiBi to enable longer sequence processing. This model is optimized for mono-lingual and cross-lingual applications, particularly for mixed Chinese-English inputs.

Architecture

The model is built on the JinaBERT architecture, an adaptation of the BERT framework. JinaBERT integrates ALiBi (Attention with Linear Biases) to process longer sequences more effectively. The model is part of a suite that includes other bilingual and multilingual models such as jina-embeddings-v2-small-en, jina-embeddings-v2-base-en, and jina-embeddings-v2-base-de.

Training

Details of the data and training process are documented in a technical report available at arXiv:2402.17016. The training focuses on achieving high performance in both single-language and cross-language tasks.

Guide: Running Locally

  1. Install Dependencies:

    pip install transformers sentence-transformers
    
  2. Load the Model:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
    model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True, torch_dtype=torch.bfloat16)
    
  3. Encode Text:

    encoded_input = tokenizer(['How is the weather today?', '今天天气怎么样?'], return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        output = model(**encoded_input)
    
  4. Cloud GPUs: For handling large sequences and improving performance, it is recommended to utilize cloud GPU services such as AWS Sagemaker or other cloud providers offering GPU support.

License

The jina-embeddings-v2-base-zh model is licensed under the Apache-2.0 License, allowing for open use and modification while maintaining attribution to the original creators.

More Related APIs in Feature Extraction