Intern Video2_chat_8 B_ H D

OpenGVLab

Introduction

InternVideo2-Chat-8B-HD is a video-text-to-text model developed by OpenGVLab. It enhances the semantic capabilities of InternVideo2 by integrating it into a VideoLLM, utilizing a video BLIP and a large language model (LLM). The model is trained using a progressive learning scheme and requires permission to access its base LLM, Mistral-7B.

Architecture

The architecture of InternVideo2-Chat-8B-HD involves a video encoder updated during training. The model uses Mistral-7B as its base LLM and employs a video BLIP for communication with other open-sourced LLMs. It is designed to process video inputs for multimodal understanding and generate text outputs.

Training

InternVideo2-Chat-8B-HD is trained with a focus on enhancing its video-text interaction capabilities. The training involves updating the video encoder to improve its performance in tasks requiring video understanding. Detailed training methodologies are documented in related research papers.

Guide: Running Locally

  1. Permissions: Obtain access permissions for the project and the base LLM, Mistral-7B.

  2. Environment Setup: Set the Hugging Face user access token.

    export HF_TOKEN=hf_...
    
  3. Dependencies: Ensure transformers version 4.38.0 or higher is installed. Install additional requirements from the provided requirements.txt file.

  4. Inference: Use the following Python code snippet to perform inference with video input:

    import os
    import torch
    from transformers import AutoTokenizer, AutoModel
    
    token = os.environ['HF_TOKEN']
    tokenizer = AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2_chat_8B_HD', trust_remote_code=True, use_fast=False, token=token)
    
    if torch.cuda.is_available():
        model = AutoModel.from_pretrained('OpenGVLab/InternVideo2_chat_8B_HD', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
    else:
        model = AutoModel.from_pretrained('OpenGVLab/InternVideo2_chat_8B_HD', torch_dtype=torch.bfloat16, trust_remote_code=True)
    
    # Additional code to load and process video inputs omitted for brevity
    

Cloud GPUs: Consider using cloud GPUs to expedite processing, especially for video data, from providers like AWS, Google Cloud, or Azure.

License

InternVideo2-Chat-8B-HD is licensed under the MIT License. Users must agree not to conduct experiments that could harm human subjects.

More Related APIs in Video Text To Text