Intern V L2_5 1 B
OpenGVLabIntroduction
InternVL 2.5 is an advanced multimodal large language model (MLLM) series, building upon InternVL 2.0, with enhancements in both training/testing strategies and data quality. It combines visual and language model components for improved interaction between modalities.
Architecture
InternVL 2.5 retains the "ViT-MLP-LLM" architecture from its predecessors, integrating a pre-trained InternViT with various LLMs such as InternLM 2.5 and Qwen 2.5 using a newly initialized MLP projector. Key architectural refinements include pixel unshuffle operations to reduce visual token count and improved support for multi-image and video data.
Training
The training strategy involves dynamic high-resolution processing for multimodal datasets, a three-stage training pipeline enhancing visual perception and multimodal capabilities, and a progressive scaling strategy to align vision encoders with various LLMs. Techniques like random JPEG compression and loss reweighting are applied to enhance robustness and balance across dataset responses.
Guide: Running Locally
-
Install Dependencies:
- Ensure your environment uses
transformers>=4.37.2
. - Install necessary libraries such as PyTorch and Hugging Face's Transformers.
- Ensure your environment uses
-
Model Loading:
- Load the model with 16-bit precision or 8-bit quantization using PyTorch, enabling settings like
bfloat16
and low memory usage.
- Load the model with 16-bit precision or 8-bit quantization using PyTorch, enabling settings like
-
Multi-GPU Configuration:
- Use a device map to allocate model layers across multiple GPUs to prevent device mismatch errors.
-
Inference:
- Utilize dynamic preprocessing for images and videos.
- Conduct image-text interactions and video analysis using the model's chat interface.
-
Cloud GPU Suggestion:
- Consider using cloud GPU services like AWS, Google Cloud, or Azure for efficient training and inference.
License
The InternVL 2.5 series is released under the MIT License. Components such as the pre-trained Qwen2.5-0.5B-Instruct are licensed under the Apache License 2.0.