L La V A Video 7 B Qwen2
lmms-labIntroduction
The LLaVA-Video-7B-Qwen2 model is a 7 billion parameter model designed for video-to-text tasks. It is trained on the LLaVA-Video-178K and LLaVA-OneVision datasets and is based on the Qwen2 language model, supporting a context window of 32K tokens and processing up to 64 video frames.
Architecture
- Base Model: The model architecture is built on the Qwen2 language model.
- Datasets: Utilizes a mixture of 1.6 million single-image, multi-image, and video data instances.
- Model Size: 7 billion parameters.
- Context Window: Supports up to 32K tokens.
- Frame Support: Processes a maximum of 64 frames.
Training
- Training Data: Includes LLaVA-Video-178K and LLaVA-OneVision Data.
- Training Setup: Uses 256 Nvidia Tesla A100 GPUs with PyTorch as the neural network framework.
- Precision: Trained in bfloat16 for efficient computation.
Guide: Running Locally
-
Prerequisites:
- Install the required packages:
pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
- Ensure you have a working installation of PyTorch and a compatible CUDA setup for GPU support.
- Install the required packages:
-
Loading the Model:
from llava.model.builder import load_pretrained_model pretrained = "lmms-lab/LLaVA-Video-7B-Qwen2" model_name = "llava_qwen" device = "cuda" device_map = "auto" tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) model.eval()
-
Processing Video Input:
- Use the VideoReader from
decord
to load and sample video frames. - Preprocess video frames using the
image_processor
.
- Use the VideoReader from
-
Inference:
- Prepare a prompt and use the model's generation capabilities to produce text outputs from video inputs.
-
Suggested Cloud GPUs:
- Consider using cloud services like AWS EC2 with NVIDIA Tesla A100 instances for optimal performance.
License
The LLaVA-Video-7B-Qwen2 model is licensed under the Apache 2.0 License, allowing for wide use and modification with attribution.