N V L M D 72 B LLM Model — Open LLM List

Introduction
The NVLM-D-72B is a multimodal large language model developed by NVIDIA, designed to perform state-of-the-art vision-language and text-only tasks. It is open-sourced for the community, supporting various applications like optical character recognition, multimodal reasoning, and coding.

Architecture
NVLM-D-72B employs a decoder-only transformer architecture. It accepts inputs in text and image formats, processed into one-dimensional and two-dimensional structures, respectively, and outputs text in a string format. The model utilizes advanced features such as a tokenizer with special tokens for visual tasks and supports multi-GPU inference.

Training
The model is trained using a combination of Megatron-LM and Huggingface codebases, with benchmark results provided for both. It employs a large-scale, high-quality multimodal dataset for training, integrating various data types such as image captions, natural images, charts, and scene descriptions. The supervised fine-tuning includes data from diverse domains, including general knowledge, science diagrams, and mathematical reasoning.

Guide: Running Locally

Environment Setup: Use the provided Dockerfile to create a reproducible environment based on the nvcr.io/nvidia/pytorch:23.09-py3 image.
Model Loading:
- Import necessary libraries and load the model using the AutoModel.from_pretrained method from the Transformers library.
- Utilize device_map for distributing the model across multiple GPUs if available.
Inference:
- Implement the split_model function to allocate model layers across available GPUs.
- Use the AutoTokenizer for handling input text and image data.
- Perform inference by loading images with appropriate preprocessing and generating responses to text and image prompts.
Cloud GPUs: Consider using NVIDIA GPUs on cloud platforms like AWS or Google Cloud for optimal performance.

License
The NVLM-D-72B model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC-BY-NC-4.0), permitting non-commercial use with appropriate attribution.

More Related APIs in Image Text To Text