L La V A U H D v2
YipengZhangIntroduction
LLaVA-UHD v2 is an advanced large multimodal language model (MLLM) that focuses on integrating diverse visual granularity using a hierarchical window transformer. It is primarily designed for research purposes in large multimodal models and chatbots, targeting researchers and AI enthusiasts in computer vision and natural language processing.
Architecture
The model utilizes a Hierarchical Window Transformer enabling the construction of a high-resolution feature pyramid, which captures various levels of visual granularity. This architecture enhances its capability to process and generate high-fidelity visual and textual data.
Training
The model was trained in November 2024 using several datasets:
- JBU Pretrain: MS-COCO stuff 2017.
- Pretrain: LLaVA-Pretrain 558K, which includes filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
- SFT: An 858k-mixed dataset available at Hugging Face.
Guide: Running Locally
-
Setup Environment:
- Ensure Python and necessary libraries like
transformers
are installed.
- Ensure Python and necessary libraries like
-
Download Model:
- Clone the repository or download the model files from Hugging Face.
-
Load the Model:
- Use the
transformers
library to load the model and begin inference.
- Use the
-
Run Inference:
- Prepare your image-text input and generate text using the model's capabilities.
-
Cloud GPUs:
- For optimal performance, consider using cloud GPU services like AWS, Google Cloud, or Azure to handle computations efficiently.
License
LLaVA-UHD v2 is distributed under the LLAMA 2 Community License, with all rights reserved by Meta Platforms, Inc. For any questions or comments, refer to the GitHub issues page.