NVILA-15B Model Documentation

Introduction

NVILA-15B is a visual language model (VLM) designed to efficiently and accurately process interleaved image-text data. It is part of a family of models that improve upon previous architectures by optimizing spatial and temporal resolutions and compressing visual tokens. This approach allows NVILA to handle high-resolution images and long videos effectively, reducing training costs and latency while maintaining or surpassing the accuracy of leading VLMs.

Architecture

NVILA employs a "scale-then-compress" strategy, enhancing spatial and temporal resolutions before compressing visual tokens. This design allows the model to process high-resolution images and videos efficiently. A systematic investigation was conducted to improve NVILA's efficiency throughout its lifecycle, from training to deployment.

Training

NVILA was trained using a hybrid data collection and labeling method, combining automated and human processes. The training dataset includes diverse image, video, and text inputs, with formats like RGB, MP4, and String. Supported hardware for training includes Ampere, Jetson, Hopper, and Lovelace architectures, primarily on Linux operating systems.

Guide: Running Locally

  1. Setup Environment: Ensure that your system has the required dependencies installed, including PyTorch and TensorRT-LLM.
  2. Clone Repository: Download the NVILA-15B repository from GitHub.
  3. Install Dependencies: Use a package manager to install the necessary libraries.
  4. Download Model Weights: Obtain the pretrained weights, which are available under the CC-BY-NC-SA-4.0 license.
  5. Run Inference: Utilize supported hardware such as A100 or RTX 4090 to perform inference, using engines like TensorRT or Triton.

Suggested Cloud GPUs: Consider using AWS EC2 with NVIDIA A100 or Azure with RTX 4090 for optimal performance.

License

  • Code License: Apache 2.0
  • Pretrained Weights License: CC-BY-NC-SA-4.0
  • Terms of Service: Non-commercial use only, subject to OpenAI's Terms of Use and specific dataset licenses. For more details, refer to the LICENSE files provided in the repository.

More Related APIs in Text Generation