Intern Video2_ Chat_8 B_ Intern L M2_5 LLM Model

Introduction

InternVideo2_Chat_8B_InternLM2.5 is a model designed for video-text-to-text tasks. It incorporates advancements from InternVideo2, using a VideoLLM architecture enhanced by a language model (LLM) and a video BLIP. The model aims to improve human communication through enriched semantic understanding.

Architecture

The model utilizes InternVideo2 as a video encoder. It is integrated with InternLM2.5-7B, which features a 1M long context window. The architecture follows a progressive learning scheme from the VideoChat framework, allowing for effective communication between the video encoder and the open-sourced LLM.

Training

Training involves updating the video encoder and employing a video BLIP for interaction with the LLM. Detailed training methodologies are available in the VideoChat documentation, emphasizing the model's ability to handle high-definition (HD) video inputs.

Guide: Running Locally

Prerequisites: Ensure you have transformers version >= 4.38.0 and peft == 0.5.0 installed.
Install Dependencies: Use the requirements.txt file to install necessary Python packages.
Set Up Environment:
- Use a cloud GPU service for optimal performance, such as AWS EC2 with NVIDIA GPUs, Google Cloud Platform (GCP), or Azure.
Inference with Video Input:
- Load and process video using libraries like Decord and Torch, applying transformations as needed.
- Initialize the model and tokenizer using Hugging Face's AutoTokenizer and AutoModel.
- Perform inference by passing the video tensor to the model and generating text-based responses.

License

The model is released under the MIT License, allowing for broad usage and modification with minimal restrictions.