Intern Vi T 300 M 448px V2_5 LLM Model

Introduction

InternViT-300M-448px-V2_5 is an advanced vision model that builds upon the existing InternViT-300M-448px architecture. It uses Vision Transformer (ViT) incremental learning with NTP loss to enhance visual feature extraction, particularly in domains underrepresented in large-scale datasets like LAION-5B, including multilingual OCR and mathematical charts.

Architecture

InternVL 2.5 maintains the "ViT-MLP-LLM" architecture from its predecessors, integrating an incrementally pre-trained InternViT with various pre-trained large language models (LLMs), such as InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector. The model includes a pixel unshuffle operation, reducing visual tokens to a quarter of their original count, and employs a dynamic resolution strategy, dividing images into 448×448 pixel tiles. New support for multi-image and video data has been added since InternVL 2.0.

Training

Dynamic High-Resolution for Multimodal Data

InternVL 2.5 employs a dynamic high-resolution training approach for handling multi-image and video datasets. For single-image datasets, tiles are allocated to a single image for maximum resolution. Multi-image datasets distribute tiles across all images, while videos resize each frame to 448×448 pixels.

Single Model Training Pipeline

The training pipeline consists of three stages:

Stage 1: MLP Warmup - Trains only the MLP projector with the vision encoder and language model frozen, using a dynamic high-resolution training strategy.
Stage 1.5: ViT Incremental Learning (Optional) - Optionally trains the vision encoder and MLP projector to improve handling of rare domains.
Stage 2: Full Model Instruction Tuning - The entire model is trained on high-quality multimodal instruction datasets with strict data quality controls.

Guide: Running Locally

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

model = AutoModel.from_pretrained(
    'OpenGVLab/InternViT-300M-448px-V2_5',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image = Image.open('./examples/image1.jpg').convert('RGB')

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-300M-448px-V2_5')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

outputs = model(pixel_values)

Cloud GPUs

To enhance performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure for running the model.

License

This project is released under the MIT License.

More Related APIs in Image Feature Extraction