M M Embed
nvidiaIntroduction
MM-Embed is an advanced extension of NV-Embed-v1, integrating multimodal retrieval capabilities. It achieves superior performance on the UniIR benchmark with a 52.7 average score, surpassing previous results. Additionally, MM-Embed enhances text retrieval accuracy in the MTEB benchmark. It introduces innovative training strategies, such as modality-aware hard negative mining and continual text-to-text fine-tuning, to improve retrieval accuracy.
Architecture
- Multimodal Architecture: llava-hf/llava-v1.6-mistral-7b-hf
- Text Embedding LLM: nvidia/NV-Embed-v1
Training
MM-Embed employs new training strategies, including modality-aware hard negative mining, to enhance multimodal and text-to-text retrieval accuracy. These strategies ensure high performance across diverse retrieval tasks, maintaining accuracy in both text and multimodal contexts.
Guide: Running Locally
-
Install Required Packages:
pip uninstall -y transformer-engine pip install torch==2.2.0 pip install transformers==4.42.4 pip install flash-attn==2.2.0 pip install pillow
-
Authenticate Access:
- Obtain a Hugging Face access token from here.
- Execute
huggingface-cli login
and input your token.
-
Download Model:
- Use the Hugging Face model repository to obtain MM-Embed.
-
Run Inference:
- Ensure your environment supports PyTorch with NVIDIA hardware, preferably on Linux OS.
-
Suggested Cloud GPUs:
- NVIDIA H100 for optimal performance with PyTorch.
License
MM-Embed is licensed under CC-BY-NC-4.0, restricting use to non-commercial purposes. For commercial applications, refer to NeMo Retriever Microservices (NIMs). For detailed license terms, visit this link.