Intern Vi T 6 B 448px V2_5

OpenGVLab

Introduction

InternViT-6B-448px-V2_5 is an advanced vision transformer model that improves upon its predecessor, InternViT-6B-448px-V1-5. It utilizes ViT incremental learning with NTP loss to enhance visual feature extraction, especially in domains underrepresented in large-scale datasets like multilingual OCR data and mathematical charts.

Architecture

The InternViT 2.5 series follows the "ViT-MLP-LLM" paradigm, integrating a newly pre-trained InternViT with various pre-trained LLMs such as InternLM 2.5 and Qwen 2.5. It uses a randomly initialized MLP projector and applies a pixel unshuffle operation to reduce visual tokens. The model supports dynamic resolution strategies for handling multi-image and video data, dividing images into 448×448 pixel tiles.

Training

InternVL 2.5 employs a dynamic high-resolution training approach for multi-image and video datasets, enhancing its multimodal data capabilities. The training pipeline comprises three stages:

  • Stage 1: MLP Warmup - Only the MLP projector is trained while the vision encoder and language model are frozen.
  • Stage 1.5: ViT Incremental Learning (Optional) - Incremental training of the vision encoder and MLP projector, improving performance in rare domains.
  • Stage 2: Full Model Instruction Tuning - The entire model is trained on high-quality multimodal instruction datasets.

Evaluation is conducted through image classification and semantic segmentation tasks, assessing the model's representation quality.

Guide: Running Locally

To run InternViT-6B-448px-V2_5 locally:

  1. Install PyTorch and Hugging Face Transformers.
  2. Load the model and image processor:
    import torch
    from PIL import Image
    from transformers import AutoModel, CLIPImageProcessor
    
    model = AutoModel.from_pretrained(
        'OpenGVLab/InternViT-6B-448px-V2_5',
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True).cuda().eval()
    
    image = Image.open('./examples/image1.jpg').convert('RGB')
    
    image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V2_5')
    
    pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
    pixel_values = pixel_values.to(torch.bfloat16).cuda()
    
    outputs = model(pixel_values)
    
  3. Use cloud GPUs for better performance, such as those provided by AWS, Google Cloud, or Azure.

License

This project is released under the MIT License.

More Related APIs in Image Feature Extraction