Ho V L E H D

OpenGVLab

Introduction

HoVLE is a novel monolithic vision-language model (VLM) designed to process images and texts in a unified manner. The core innovation is a holistic embedding module that projects both image and text inputs into a shared embedding space. This allows the Large Language Model (LLM) to interpret images similarly to texts. HoVLE-HD, with 2.6 billion parameters, is an enhanced version that utilizes high-definition images to excel in visual-question answering benchmarks.

Architecture

The architecture of HoVLE includes a holistic embedding module and an LLM, both utilizing causal Transformer layers. The model operates in three stages:

  • Stage I (Distillation): Trains the holistic embedding module to extract image features from a pre-trained visual encoder and text embeddings from an LLM.
  • Stage II (Alignment): Involves auto-regressive training to align different modalities into a shared embedding space.
  • Stage III (Instruction Tuning): Enhances the model's ability to follow instructions, making the entire model trainable.

Training

Training HoVLE involves three stages focused on distillation, alignment, and instruction tuning. The holistic embedding module is the primary trainable component in the first two stages, while the final stage tunes the entire model.

Guide: Running Locally

To run HoVLE locally, follow these steps:

  1. Install Dependencies:

    • Ensure you have Python installed, and set up a virtual environment.
    • Install the transformers library version 4.37.2.
  2. Load the Model:

    • Use the AutoModel and AutoTokenizer from the transformers library to load the model from the Hugging Face repository.
  3. Prepare Data:

    • Use the provided preprocessing functions to prepare image inputs.
  4. Run Inference:

    • Execute the inference script, modifying paths to your local environment as necessary.

For optimal performance, consider using cloud GPUs like those available from AWS, Google Cloud, or Azure.

License

This project is released under the MIT license, while InternLM2 is licensed under the Apache-2.0 license.

More Related APIs in Image Text To Text