N V I L A Lite 15 B
Efficient-Large-ModelIntroduction
NVILA-Lite-15B is a visual language model (VLM) optimized for efficiency and accuracy, designed to handle interleaved image-text data at scale. This model is part of the NVILA family, which focuses on improving the efficiency of VLMs by scaling spatial and temporal resolutions and compressing visual tokens. NVILA reduces training costs, memory usage, and latency while maintaining competitive accuracy.
Architecture
NVILA employs a "scale-then-compress" approach to process high-resolution images and long videos efficiently. The model is designed to optimize efficiency at every stage, from training to deployment. It supports a range of hardware microarchitectures, including Ampere, Jetson, Hopper, and Lovelace, and is compatible with Linux operating systems.
Training
The model was trained in November 2024, using a hybrid data collection and labeling method involving both automated and human processes. Training datasets and further details can be found in the associated paper on arXiv. The training process focuses on reducing costs and improving efficiency across various benchmarks.
Guide: Running Locally
- Setup Environment: Ensure you have a compatible Linux system and necessary dependencies installed, such as Python and PyTorch.
- Download Model: Access the NVILA-Lite-15B model from the Hugging Face model hub and download it to your local environment.
- Install Required Libraries: Use
pip
to install libraries liketransformers
andtorch
. - Run Inference: Utilize compatible inference engines like TensorRT-LLM or PyTorch to process input data.
- Utilize Cloud GPUs: For enhanced performance, consider using cloud GPUs such as NVIDIA A100 or RTX 4090.
License
- The code for NVILA-Lite-15B is released under the Apache 2.0 license.
- Pretrained weights are available under the CC-BY-NC-SA-4.0 license, intended for non-commercial research use.
- Users must comply with OpenAI's Terms of Use and dataset licenses associated with the model's training data.