Intern Video2_chat_8 B_ H D
OpenGVLabIntroduction
InternVideo2-Chat-8B-HD is a video-text-to-text model developed by OpenGVLab. It enhances the semantic capabilities of InternVideo2 by integrating it into a VideoLLM, utilizing a video BLIP and a large language model (LLM). The model is trained using a progressive learning scheme and requires permission to access its base LLM, Mistral-7B.
Architecture
The architecture of InternVideo2-Chat-8B-HD involves a video encoder updated during training. The model uses Mistral-7B as its base LLM and employs a video BLIP for communication with other open-sourced LLMs. It is designed to process video inputs for multimodal understanding and generate text outputs.
Training
InternVideo2-Chat-8B-HD is trained with a focus on enhancing its video-text interaction capabilities. The training involves updating the video encoder to improve its performance in tasks requiring video understanding. Detailed training methodologies are documented in related research papers.
Guide: Running Locally
-
Permissions: Obtain access permissions for the project and the base LLM, Mistral-7B.
-
Environment Setup: Set the Hugging Face user access token.
export HF_TOKEN=hf_...
-
Dependencies: Ensure
transformers
version 4.38.0 or higher is installed. Install additional requirements from the providedrequirements.txt
file. -
Inference: Use the following Python code snippet to perform inference with video input:
import os import torch from transformers import AutoTokenizer, AutoModel token = os.environ['HF_TOKEN'] tokenizer = AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2_chat_8B_HD', trust_remote_code=True, use_fast=False, token=token) if torch.cuda.is_available(): model = AutoModel.from_pretrained('OpenGVLab/InternVideo2_chat_8B_HD', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda() else: model = AutoModel.from_pretrained('OpenGVLab/InternVideo2_chat_8B_HD', torch_dtype=torch.bfloat16, trust_remote_code=True) # Additional code to load and process video inputs omitted for brevity
Cloud GPUs: Consider using cloud GPUs to expedite processing, especially for video data, from providers like AWS, Google Cloud, or Azure.
License
InternVideo2-Chat-8B-HD is licensed under the MIT License. Users must agree not to conduct experiments that could harm human subjects.