Intern Vi T 6 B 448px V2_5
OpenGVLabIntroduction
InternViT-6B-448px-V2_5 is an advanced vision transformer model that improves upon its predecessor, InternViT-6B-448px-V1-5. It utilizes ViT incremental learning with NTP loss to enhance visual feature extraction, especially in domains underrepresented in large-scale datasets like multilingual OCR data and mathematical charts.
Architecture
The InternViT 2.5 series follows the "ViT-MLP-LLM" paradigm, integrating a newly pre-trained InternViT with various pre-trained LLMs such as InternLM 2.5 and Qwen 2.5. It uses a randomly initialized MLP projector and applies a pixel unshuffle operation to reduce visual tokens. The model supports dynamic resolution strategies for handling multi-image and video data, dividing images into 448×448 pixel tiles.
Training
InternVL 2.5 employs a dynamic high-resolution training approach for multi-image and video datasets, enhancing its multimodal data capabilities. The training pipeline comprises three stages:
- Stage 1: MLP Warmup - Only the MLP projector is trained while the vision encoder and language model are frozen.
- Stage 1.5: ViT Incremental Learning (Optional) - Incremental training of the vision encoder and MLP projector, improving performance in rare domains.
- Stage 2: Full Model Instruction Tuning - The entire model is trained on high-quality multimodal instruction datasets.
Evaluation is conducted through image classification and semantic segmentation tasks, assessing the model's representation quality.
Guide: Running Locally
To run InternViT-6B-448px-V2_5 locally:
- Install PyTorch and Hugging Face Transformers.
- Load the model and image processor:
import torch from PIL import Image from transformers import AutoModel, CLIPImageProcessor model = AutoModel.from_pretrained( 'OpenGVLab/InternViT-6B-448px-V2_5', torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).cuda().eval() image = Image.open('./examples/image1.jpg').convert('RGB') image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V2_5') pixel_values = image_processor(images=image, return_tensors='pt').pixel_values pixel_values = pixel_values.to(torch.bfloat16).cuda() outputs = model(pixel_values)
- Use cloud GPUs for better performance, such as those provided by AWS, Google Cloud, or Azure.
License
This project is released under the MIT License.