N V L M D 72 B
nvidiaIntroduction
The NVLM-D-72B is a multimodal large language model developed by NVIDIA, designed to perform state-of-the-art vision-language and text-only tasks. It is open-sourced for the community, supporting various applications like optical character recognition, multimodal reasoning, and coding.
Architecture
NVLM-D-72B employs a decoder-only transformer architecture. It accepts inputs in text and image formats, processed into one-dimensional and two-dimensional structures, respectively, and outputs text in a string format. The model utilizes advanced features such as a tokenizer with special tokens for visual tasks and supports multi-GPU inference.
Training
The model is trained using a combination of Megatron-LM and Huggingface codebases, with benchmark results provided for both. It employs a large-scale, high-quality multimodal dataset for training, integrating various data types such as image captions, natural images, charts, and scene descriptions. The supervised fine-tuning includes data from diverse domains, including general knowledge, science diagrams, and mathematical reasoning.
Guide: Running Locally
- Environment Setup: Use the provided Dockerfile to create a reproducible environment based on the nvcr.io/nvidia/pytorch:23.09-py3 image.
- Model Loading:
- Import necessary libraries and load the model using the
AutoModel.from_pretrained
method from the Transformers library. - Utilize
device_map
for distributing the model across multiple GPUs if available.
- Import necessary libraries and load the model using the
- Inference:
- Implement the
split_model
function to allocate model layers across available GPUs. - Use the
AutoTokenizer
for handling input text and image data. - Perform inference by loading images with appropriate preprocessing and generating responses to text and image prompts.
- Implement the
- Cloud GPUs: Consider using NVIDIA GPUs on cloud platforms like AWS or Google Cloud for optimal performance.
License
The NVLM-D-72B model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC-BY-NC-4.0), permitting non-commercial use with appropriate attribution.