Intern V L2 26 B

OpenGVLab

Introduction

InternVL 2.0 is the latest in the series of multimodal large language models, boasting a range of instruction-tuned models from 1 billion to 108 billion parameters. This repository hosts the InternVL2-26B model. InternVL 2.0 outperforms many open-source models and is competitive with commercial models in tasks such as document comprehension, scene text understanding, and multimodal capabilities. It features an 8k context window and is trained using long texts, images, and videos, improving upon its predecessor, InternVL 1.5.

Architecture

InternVL2-26B is composed of the InternViT-6B-448px-V1-5 vision component and the internlm2-chat-20b language component. These components are optimized for multimodal tasks and leverage a combination of vision and language processing capabilities.

Training

InternVL 2.0 is trained with extensive multimodal data and is capable of processing long text along with multiple images and videos. It includes various instruction-tuned models optimized for different multimodal tasks. Its training process focuses on enhancing document comprehension, scene text understanding, and cultural insights.

Guide: Running Locally

To run InternVL2-26B locally:

  1. Environment Setup: Ensure you have transformers version 4.37.2 or higher.
  2. Model Loading:
    • Use FP16 or BF16 precision for loading the model.
    • Multi-GPU setup is supported to handle larger models.
  3. Inference:
    • Import necessary libraries like torch, transformers, and others as shown.
    • Load the model using AutoModel.from_pretrained and set it to evaluation mode.
    • Preprocess input images and videos as necessary.
  4. Run Inference: Use the provided scripts to execute inference tasks.

For better performance, consider using cloud GPUs such as those from AWS or Google Cloud.

License

This project is licensed under the MIT License. It incorporates the pre-trained internlm2-chat-20b, which is under the Apache License 2.0.

More Related APIs in Image Text To Text