jina embeddings v2 base code

jinaai

Introduction

Jina Embeddings V2 is a multilingual embedding model developed by Jina AI, supporting English and 30 widely used programming languages. It is part of a series designed for high performance in technical question answering and code search, utilizing a BERT-based architecture optimized for long sequence processing.

Architecture

The model is based on a BERT architecture (JinaBert) with a symmetric bidirectional variant of ALiBi, allowing for longer sequence lengths up to 8192 tokens. It was pretrained on the GitHub code dataset and further refined with over 150 million coding question-answer pairs and docstring source code pairs. The model comprises 161 million parameters, balancing efficiency and performance.

Training

Trained with a sequence length of 512, the model can extrapolate to 8k sequence lengths, facilitated by ALiBi. This capability is crucial for applications requiring processing of long documents. Mean pooling is recommended for integrating the model, as it averages token embeddings effectively.

Guide: Running Locally

  1. Install Dependencies:
    pip install transformers sentence-transformers
    
  2. Load the Model:
    from transformers import AutoModel, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-code')
    model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-code', trust_remote_code=True)
    
  3. Perform Inference:
    sentences = ['Your sentence here']
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    # Apply mean pooling as described in the documentation
    
  4. Cloud GPUs: For more extensive tasks, consider using cloud GPU services such as AWS, Google Cloud, or Azure for optimal performance.

License

The model is released under the Apache-2.0 license, allowing for broad usage and modification.

More Related APIs in Feature Extraction