M M Ret large

JUNJIE99

Introduction

MegaPairs introduces a novel data synthesis approach that utilizes open-domain images to construct heterogeneous KNN triplets for universal multimodal retrieval. The dataset contains over 26 million triplets, enabling the training of multimodal retrieval models, MMRets, including MMRet-CLIP and MMRet-MLLM. These models set a new standard in zero-shot composed image retrieval and the multimodal embedding benchmark (MMEB), showcasing efficiency, scalability, and generalization abilities.

Architecture

The MMRet models are built upon the CLIP architecture, utilizing open-domain datasets to train on triplets, enhancing their retrieval capabilities. The MegaPairs dataset is central to training, holding significant data diversity to ensure robust model performance across various benchmarks.

Training

MMRets demonstrate state-of-the-art performance in zero-shot retrieval tasks. The MMRet-base model, with 149 million parameters, outperforms larger models, while MMRet-MLLM improves previous benchmarks by 8.1%. Fine-tuning further enhances performance on MMEB, showcasing the models' robust generalization capabilities.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python and PyTorch installed. Install the transformers library from Hugging Face.

    pip install torch transformers
    
  2. Load Model: Use the Hugging Face transformers library to load the MMRet models.

    import torch
    from transformers import AutoModel
    
    MODEL_NAME = "JUNJIE99/MMRet-base"  # or "JUNJIE99/MMRet-large"
    model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
    model.eval()
    
  3. Run Inference: Encode your images and text, and compute similarity scores.

    with torch.no_grad():
        query = model.encode(images="./assets/cir_query.png", text="Make the background dark, as if the camera has taken the photo at night")
        candidates = model.encode(images=["./assets/cir_candi_1.png", "./assets/cir_candi_2.png"])
        scores = query @ candidates.T
    print(scores)
    
  4. Hardware Suggestion: For efficient computation, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.

License

The MegaPairs dataset and MMRet models are released under the MIT License. The image data used in MegaPairs comes from the Recap-Datacomp dataset, which is available under the CC BY 4.0 license.

More Related APIs