Ho V L E H D
OpenGVLabIntroduction
HoVLE is a novel monolithic vision-language model (VLM) designed to process images and texts in a unified manner. The core innovation is a holistic embedding module that projects both image and text inputs into a shared embedding space. This allows the Large Language Model (LLM) to interpret images similarly to texts. HoVLE-HD, with 2.6 billion parameters, is an enhanced version that utilizes high-definition images to excel in visual-question answering benchmarks.
Architecture
The architecture of HoVLE includes a holistic embedding module and an LLM, both utilizing causal Transformer layers. The model operates in three stages:
- Stage I (Distillation): Trains the holistic embedding module to extract image features from a pre-trained visual encoder and text embeddings from an LLM.
- Stage II (Alignment): Involves auto-regressive training to align different modalities into a shared embedding space.
- Stage III (Instruction Tuning): Enhances the model's ability to follow instructions, making the entire model trainable.
Training
Training HoVLE involves three stages focused on distillation, alignment, and instruction tuning. The holistic embedding module is the primary trainable component in the first two stages, while the final stage tunes the entire model.
Guide: Running Locally
To run HoVLE locally, follow these steps:
-
Install Dependencies:
- Ensure you have Python installed, and set up a virtual environment.
- Install the
transformers
library version4.37.2
.
-
Load the Model:
- Use the
AutoModel
andAutoTokenizer
from thetransformers
library to load the model from the Hugging Face repository.
- Use the
-
Prepare Data:
- Use the provided preprocessing functions to prepare image inputs.
-
Run Inference:
- Execute the inference script, modifying paths to your local environment as necessary.
For optimal performance, consider using cloud GPUs like those available from AWS, Google Cloud, or Azure.
License
This project is released under the MIT license, while InternLM2 is licensed under the Apache-2.0 license.