L La V A Video Llama 3.1 8 B
weizhiwangLLaVA-Video-Llama-3.1-8B
Introduction
LLaVA-Video-Llama-3.1-8B is a video understanding language model that combines a visual encoder and language model to process and comprehend video content. It can handle video lengths of up to 14 minutes by sampling one frame every 30 frames.
Architecture
The model architecture consists of:
- Visual Encoder: SigLIP-so400m-384px
- Vision-Language Projector: Average Pooling
- LLM Backbone: Built on Llama-3.1
Training
The model's training involves processing video frames as tokens and using a specific conversation template (LLaVA-v1) to guide interactions. It supports up to 800 video frames as input, with each frame represented by 144 tokens.
Guide: Running Locally
-
Install LLaVA:
pip install git+https://github.com/Victorwz/LLaVA-Unified.git
-
Load the Model:
Use Python to load the pretrained model and its processor:from llava.model.builder import load_pretrained_model device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LLaVA-Video-Llama-3.1-8B", None, "Video-Language-Model-Llama-3.1-8B", False, False, device=device)
-
Prepare Video Input:
Download and process video frames with OpenCV and PIL:import cv2 from PIL import Image import requests url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4" def read_video(video_url): response = requests.get(url) with open("tmp_video.mp4", 'wb') as f: for chunk in response.iter_content(chunk_size=1024): f.write(chunk) video = cv2.VideoCapture("tmp_video.mp4") video_frames = [] while video.isOpened(): success, frame = video.read() if not success: break frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) pil_image = Image.fromarray(frame_rgb) video_frames.append(pil_image) video.release() return video_frames video_frames = read_video(video_url=url)
-
Perform Inference:
Generate and decode text outputs from video frames:image_tensors = [image_processor.preprocess(frame, return_tensors='pt')['pixel_values'][0].half().cuda() for frame in video_frames] text = "\n".join(['<image>' for _ in range(len(image_tensors))]) + '\n' + "Why is this video funny" conv = conv_templates["llama_3"].copy() conv.append_message(conv.roles[0], text) input_ids = tokenizer_image_token(prompt, tokenizer, return_tensors='pt').unsqueeze(0).cuda() with torch.inference_mode(): output_ids = model.generate(input_ids, images=image_tensors, do_sample=False, max_new_tokens=512, use_cache=True) outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True) print(outputs[0])
-
Cloud GPUs: Consider using cloud GPU providers like AWS, Google Cloud, or Azure for better performance and faster processing.
License
This model is released under the Creative Commons license, as indicated in the metadata.