L La V A Video Llama 3.1 8 B

weizhiwang

LLaVA-Video-Llama-3.1-8B

Introduction

LLaVA-Video-Llama-3.1-8B is a video understanding language model that combines a visual encoder and language model to process and comprehend video content. It can handle video lengths of up to 14 minutes by sampling one frame every 30 frames.

Architecture

The model architecture consists of:

  • Visual Encoder: SigLIP-so400m-384px
  • Vision-Language Projector: Average Pooling
  • LLM Backbone: Built on Llama-3.1

Training

The model's training involves processing video frames as tokens and using a specific conversation template (LLaVA-v1) to guide interactions. It supports up to 800 video frames as input, with each frame represented by 144 tokens.

Guide: Running Locally

  1. Install LLaVA:

    pip install git+https://github.com/Victorwz/LLaVA-Unified.git
    
  2. Load the Model:
    Use Python to load the pretrained model and its processor:

    from llava.model.builder import load_pretrained_model
    device = "cuda" if torch.cuda.is_available() else "cpu"
    tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LLaVA-Video-Llama-3.1-8B", None, "Video-Language-Model-Llama-3.1-8B", False, False, device=device)
    
  3. Prepare Video Input:
    Download and process video frames with OpenCV and PIL:

    import cv2
    from PIL import Image
    import requests
    
    url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4"
    
    def read_video(video_url):
        response = requests.get(url)
        with open("tmp_video.mp4", 'wb') as f:
            for chunk in response.iter_content(chunk_size=1024):
                f.write(chunk)
        video = cv2.VideoCapture("tmp_video.mp4")
        video_frames = []
        while video.isOpened():
            success, frame = video.read()
            if not success:
                break
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            pil_image = Image.fromarray(frame_rgb)
            video_frames.append(pil_image)
        video.release()
        return video_frames
    
    video_frames = read_video(video_url=url)
    
  4. Perform Inference:
    Generate and decode text outputs from video frames:

    image_tensors = [image_processor.preprocess(frame, return_tensors='pt')['pixel_values'][0].half().cuda() for frame in video_frames]
    text = "\n".join(['<image>' for _ in range(len(image_tensors))]) + '\n' + "Why is this video funny"
    conv = conv_templates["llama_3"].copy()
    conv.append_message(conv.roles[0], text)
    input_ids = tokenizer_image_token(prompt, tokenizer, return_tensors='pt').unsqueeze(0).cuda()
    
    with torch.inference_mode():
        output_ids = model.generate(input_ids, images=image_tensors, do_sample=False, max_new_tokens=512, use_cache=True)
    
    outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    print(outputs[0])
    
  5. Cloud GPUs: Consider using cloud GPU providers like AWS, Google Cloud, or Azure for better performance and faster processing.

License

This model is released under the Creative Commons license, as indicated in the metadata.

More Related APIs