Intern Video2_chat_8 B_ H D LLM Model

Introduction

InternVideo2-Chat-8B-HD is a video-text-to-text model developed by OpenGVLab. It enhances the semantic capabilities of InternVideo2 by integrating it into a VideoLLM, utilizing a video BLIP and a large language model (LLM). The model is trained using a progressive learning scheme and requires permission to access its base LLM, Mistral-7B.

Architecture

The architecture of InternVideo2-Chat-8B-HD involves a video encoder updated during training. The model uses Mistral-7B as its base LLM and employs a video BLIP for communication with other open-sourced LLMs. It is designed to process video inputs for multimodal understanding and generate text outputs.

Training

InternVideo2-Chat-8B-HD is trained with a focus on enhancing its video-text interaction capabilities. The training involves updating the video encoder to improve its performance in tasks requiring video understanding. Detailed training methodologies are documented in related research papers.

Guide: Running Locally

Permissions: Obtain access permissions for the project and the base LLM, Mistral-7B.
Environment Setup: Set the Hugging Face user access token.
```
export HF_TOKEN=hf_...
```
Dependencies: Ensure transformers version 4.38.0 or higher is installed. Install additional requirements from the provided requirements.txt file.

Inference: Use the following Python code snippet to perform inference with video input:

import os
import torch
from transformers import AutoTokenizer, AutoModel

token = os.environ['HF_TOKEN']
tokenizer = AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2_chat_8B_HD', trust_remote_code=True, use_fast=False, token=token)

if torch.cuda.is_available():
    model = AutoModel.from_pretrained('OpenGVLab/InternVideo2_chat_8B_HD', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
else:
    model = AutoModel.from_pretrained('OpenGVLab/InternVideo2_chat_8B_HD', torch_dtype=torch.bfloat16, trust_remote_code=True)

# Additional code to load and process video inputs omitted for brevity

Cloud GPUs: Consider using cloud GPUs to expedite processing, especially for video data, from providers like AWS, Google Cloud, or Azure.

License

InternVideo2-Chat-8B-HD is licensed under the MIT License. Users must agree not to conduct experiments that could harm human subjects.