Intern V L2_5 4 B A W Q

OpenGVLab

Introduction

InternVL 2.5 represents an advanced series of multimodal large language models (MLLMs), evolving from the InternVL 2.0 architecture. It introduces significant improvements in training and testing methodologies, as well as data quality advancements.

Architecture

InternVL 2.5 maintains the foundational "ViT-MLP-LLM" paradigm used in InternVL 1.5 and 2.0. The model architecture integrates a newly incrementally pre-trained InternViT with pre-trained LLMs like InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector. Enhancements include a pixel unshuffle operation reducing visual tokens and a dynamic resolution strategy, with support for multi-image and video data.

Training

InternVL 2.5 series includes models with vision and language components such as InternViT and Qwen. These models are incrementally pre-trained and fine-tuned to enhance performance across various tasks.

Guide: Running Locally

  1. Install LMDeploy:

    pip install lmdeploy>=0.6.4
    
  2. Run a 'Hello, World' Example:

    from lmdeploy import pipeline, TurbomindEngineConfig
    from lmdeploy.vl import load_image
    
    model = 'OpenGVLab/InternVL2_5-4B-AWQ'
    image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
    pipe = pipeline(model)
    response = pipe(('describe this image', image))
    print(response.text)
    
  3. Multi-Image Inference: Load multiple images and process them through the pipeline to handle complex tasks.

  4. Batch Prompts Inference: Use a list structure for prompts to conduct batch processing.

  5. Multi-Turn Conversations: Utilize the pipeline.chat interface for interactive sessions.

  6. Service Deployment: Deploy the model using LMDeploy's api_server to create RESTful APIs compatible with OpenAI interfaces.

    lmdeploy serve api_server OpenGVLab/InternVL2_5-4B-AWQ --server-port 23333
    

    For cloud GPUs, consider platforms like AWS, Google Cloud, or Azure for enhanced computational power.

License

This project is licensed under the MIT License. It incorporates components such as the pre-trained Qwen2.5-3B-Instruct, which is licensed under the Apache License 2.0.

More Related APIs in Image Text To Text