M M Embed LLM Model — Open LLM List

Introduction

MM-Embed is an advanced extension of NV-Embed-v1, integrating multimodal retrieval capabilities. It achieves superior performance on the UniIR benchmark with a 52.7 average score, surpassing previous results. Additionally, MM-Embed enhances text retrieval accuracy in the MTEB benchmark. It introduces innovative training strategies, such as modality-aware hard negative mining and continual text-to-text fine-tuning, to improve retrieval accuracy.

Architecture

Multimodal Architecture: llava-hf/llava-v1.6-mistral-7b-hf
Text Embedding LLM: nvidia/NV-Embed-v1

Training

MM-Embed employs new training strategies, including modality-aware hard negative mining, to enhance multimodal and text-to-text retrieval accuracy. These strategies ensure high performance across diverse retrieval tasks, maintaining accuracy in both text and multimodal contexts.

Guide: Running Locally

Install Required Packages:

pip uninstall -y transformer-engine
pip install torch==2.2.0
pip install transformers==4.42.4
pip install flash-attn==2.2.0
pip install pillow

Authenticate Access:
- Obtain a Hugging Face access token from here.
- Execute huggingface-cli login and input your token.
Download Model:
- Use the Hugging Face model repository to obtain MM-Embed.
Run Inference:
- Ensure your environment supports PyTorch with NVIDIA hardware, preferably on Linux OS.
Suggested Cloud GPUs:
- NVIDIA H100 for optimal performance with PyTorch.

License

MM-Embed is licensed under CC-BY-NC-4.0, restricting use to non-commercial purposes. For commercial applications, refer to NeMo Retriever Microservices (NIMs). For detailed license terms, visit this link.

More Related APIs