Introduction

MM-Embed is an advanced extension of NV-Embed-v1, integrating multimodal retrieval capabilities. It achieves superior performance on the UniIR benchmark with a 52.7 average score, surpassing previous results. Additionally, MM-Embed enhances text retrieval accuracy in the MTEB benchmark. It introduces innovative training strategies, such as modality-aware hard negative mining and continual text-to-text fine-tuning, to improve retrieval accuracy.

Architecture

Training

MM-Embed employs new training strategies, including modality-aware hard negative mining, to enhance multimodal and text-to-text retrieval accuracy. These strategies ensure high performance across diverse retrieval tasks, maintaining accuracy in both text and multimodal contexts.

Guide: Running Locally

  1. Install Required Packages:

    pip uninstall -y transformer-engine
    pip install torch==2.2.0
    pip install transformers==4.42.4
    pip install flash-attn==2.2.0
    pip install pillow
    
  2. Authenticate Access:

    • Obtain a Hugging Face access token from here.
    • Execute huggingface-cli login and input your token.
  3. Download Model:

    • Use the Hugging Face model repository to obtain MM-Embed.
  4. Run Inference:

    • Ensure your environment supports PyTorch with NVIDIA hardware, preferably on Linux OS.
  5. Suggested Cloud GPUs:

    • NVIDIA H100 for optimal performance with PyTorch.

License

MM-Embed is licensed under CC-BY-NC-4.0, restricting use to non-commercial purposes. For commercial applications, refer to NeMo Retriever Microservices (NIMs). For detailed license terms, visit this link.

More Related APIs