bge visualized

BAAI

Introduction

Visualized-BGE is a universal multi-modal embedding model that incorporates image token embedding into the BGE Text Embedding framework. This allows Visualized-BGE to process multi-modal data beyond just text. It is primarily used for hybrid modal retrieval tasks, such as Multi-Modal Knowledge Retrieval and Composed Image Retrieval. The model retains the strong text embedding capabilities of the original BGE model.

Architecture

Visualized-BGE includes two primary models:

  • BAAI/bge-visualized-base-en-v1.5: A 768-dimensional text embedding model for English.
  • BAAI/bge-visualized-m3: A 1024-dimensional multilingual model.

The model processes a hybrid multi-modal dataset with over 500,000 instances, designed for multi-modal training.

Training

Visualized-BGE can be evaluated and fine-tuned for specific retrieval tasks. The training includes Stage-2 training with datasets like VISTA-S2. Zero-shot performance and supervised fine-tuning have been evaluated on various benchmarks, such as WebQA, CIRR, and ReMuQ.

Guide: Running Locally

Installation

  1. Clone the repository:
    git clone https://github.com/FlagOpen/FlagEmbedding.git
    cd FlagEmbedding/research/visual_bge
    pip install -e .
    
  2. Install core packages:
    pip install torchvision timm einops ftfy
    
  3. Download model weights and pass them to the model_weight parameter.

Running

To generate embeddings for multi-modal data, instantiate the Visualized_BGE model and use it to encode various data formats, such as text and images.

Sample code for Composed Image Retrieval:

import torch
from visual_bge.modeling import Visualized_BGE

model = Visualized_BGE(model_name_bge="BAAI/bge-base-en-v1.5", model_weight="path: Visualized_base_en_v1.5.pth")
model.eval()
with torch.no_grad():
    query_emb = model.encode(image="./imgs/cir_query.png", text="Make the background dark, as if the camera has taken the photo at night")
    candi_emb_1 = model.encode(image="./imgs/cir_candi_1.png")
    candi_emb_2 = model.encode(image="./imgs/cir_candi_2.png")

sim_1 = query_emb @ candi_emb_1.T
sim_2 = query_emb @ candi_emb_2.T
print(sim_1, sim_2)

Cloud GPU Recommendation

For optimal performance, consider using cloud-based GPUs such as AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure.

License

Visualized-BGE is released under an open-source license. For more details, please refer to the project's GitHub repository: https://github.com/FlagOpen/FlagEmbedding.

More Related APIs