L La V A Video 7 B Qwen2

lmms-lab

Introduction

The LLaVA-Video-7B-Qwen2 model is a 7 billion parameter model designed for video-to-text tasks. It is trained on the LLaVA-Video-178K and LLaVA-OneVision datasets and is based on the Qwen2 language model, supporting a context window of 32K tokens and processing up to 64 video frames.

Architecture

  • Base Model: The model architecture is built on the Qwen2 language model.
  • Datasets: Utilizes a mixture of 1.6 million single-image, multi-image, and video data instances.
  • Model Size: 7 billion parameters.
  • Context Window: Supports up to 32K tokens.
  • Frame Support: Processes a maximum of 64 frames.

Training

  • Training Data: Includes LLaVA-Video-178K and LLaVA-OneVision Data.
  • Training Setup: Uses 256 Nvidia Tesla A100 GPUs with PyTorch as the neural network framework.
  • Precision: Trained in bfloat16 for efficient computation.

Guide: Running Locally

  1. Prerequisites:

    • Install the required packages: pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
    • Ensure you have a working installation of PyTorch and a compatible CUDA setup for GPU support.
  2. Loading the Model:

    from llava.model.builder import load_pretrained_model
    pretrained = "lmms-lab/LLaVA-Video-7B-Qwen2"
    model_name = "llava_qwen"
    device = "cuda"
    device_map = "auto"
    tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)
    model.eval()
    
  3. Processing Video Input:

    • Use the VideoReader from decord to load and sample video frames.
    • Preprocess video frames using the image_processor.
  4. Inference:

    • Prepare a prompt and use the model's generation capabilities to produce text outputs from video inputs.
  5. Suggested Cloud GPUs:

    • Consider using cloud services like AWS EC2 with NVIDIA Tesla A100 instances for optimal performance.

License

The LLaVA-Video-7B-Qwen2 model is licensed under the Apache 2.0 License, allowing for wide use and modification with attribution.

More Related APIs in Video Text To Text