L La V A Video 7 B Qwen2 LLM Model

Introduction

The LLaVA-Video-7B-Qwen2 model is a 7 billion parameter model designed for video-to-text tasks. It is trained on the LLaVA-Video-178K and LLaVA-OneVision datasets and is based on the Qwen2 language model, supporting a context window of 32K tokens and processing up to 64 video frames.

Architecture

Base Model: The model architecture is built on the Qwen2 language model.
Datasets: Utilizes a mixture of 1.6 million single-image, multi-image, and video data instances.
Model Size: 7 billion parameters.
Context Window: Supports up to 32K tokens.
Frame Support: Processes a maximum of 64 frames.

Training

Training Data: Includes LLaVA-Video-178K and LLaVA-OneVision Data.
Training Setup: Uses 256 Nvidia Tesla A100 GPUs with PyTorch as the neural network framework.
Precision: Trained in bfloat16 for efficient computation.

Guide: Running Locally

Prerequisites:
- Install the required packages: pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
- Ensure you have a working installation of PyTorch and a compatible CUDA setup for GPU support.

Loading the Model:

from llava.model.builder import load_pretrained_model
pretrained = "lmms-lab/LLaVA-Video-7B-Qwen2"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)
model.eval()

Processing Video Input:
- Use the VideoReader from decord to load and sample video frames.
- Preprocess video frames using the image_processor.
Inference:
- Prepare a prompt and use the model's generation capabilities to produce text outputs from video inputs.
Suggested Cloud GPUs:
- Consider using cloud services like AWS EC2 with NVIDIA Tesla A100 instances for optimal performance.

License

The LLaVA-Video-7B-Qwen2 model is licensed under the Apache 2.0 License, allowing for wide use and modification with attribution.