N V L M D 72 B

nvidia

Introduction
The NVLM-D-72B is a multimodal large language model developed by NVIDIA, designed to perform state-of-the-art vision-language and text-only tasks. It is open-sourced for the community, supporting various applications like optical character recognition, multimodal reasoning, and coding.

Architecture
NVLM-D-72B employs a decoder-only transformer architecture. It accepts inputs in text and image formats, processed into one-dimensional and two-dimensional structures, respectively, and outputs text in a string format. The model utilizes advanced features such as a tokenizer with special tokens for visual tasks and supports multi-GPU inference.

Training
The model is trained using a combination of Megatron-LM and Huggingface codebases, with benchmark results provided for both. It employs a large-scale, high-quality multimodal dataset for training, integrating various data types such as image captions, natural images, charts, and scene descriptions. The supervised fine-tuning includes data from diverse domains, including general knowledge, science diagrams, and mathematical reasoning.

Guide: Running Locally

  1. Environment Setup: Use the provided Dockerfile to create a reproducible environment based on the nvcr.io/nvidia/pytorch:23.09-py3 image.
  2. Model Loading:
    • Import necessary libraries and load the model using the AutoModel.from_pretrained method from the Transformers library.
    • Utilize device_map for distributing the model across multiple GPUs if available.
  3. Inference:
    • Implement the split_model function to allocate model layers across available GPUs.
    • Use the AutoTokenizer for handling input text and image data.
    • Perform inference by loading images with appropriate preprocessing and generating responses to text and image prompts.
  4. Cloud GPUs: Consider using NVIDIA GPUs on cloud platforms like AWS or Google Cloud for optimal performance.

License
The NVLM-D-72B model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC-BY-NC-4.0), permitting non-commercial use with appropriate attribution.

More Related APIs in Image Text To Text