L La V A U H D v2

YipengZhang

Introduction

LLaVA-UHD v2 is an advanced large multimodal language model (MLLM) that focuses on integrating diverse visual granularity using a hierarchical window transformer. It is primarily designed for research purposes in large multimodal models and chatbots, targeting researchers and AI enthusiasts in computer vision and natural language processing.

Architecture

The model utilizes a Hierarchical Window Transformer enabling the construction of a high-resolution feature pyramid, which captures various levels of visual granularity. This architecture enhances its capability to process and generate high-fidelity visual and textual data.

Training

The model was trained in November 2024 using several datasets:

  • JBU Pretrain: MS-COCO stuff 2017.
  • Pretrain: LLaVA-Pretrain 558K, which includes filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
  • SFT: An 858k-mixed dataset available at Hugging Face.

Guide: Running Locally

  1. Setup Environment:

    • Ensure Python and necessary libraries like transformers are installed.
  2. Download Model:

    • Clone the repository or download the model files from Hugging Face.
  3. Load the Model:

    • Use the transformers library to load the model and begin inference.
  4. Run Inference:

    • Prepare your image-text input and generate text using the model's capabilities.
  5. Cloud GPUs:

    • For optimal performance, consider using cloud GPU services like AWS, Google Cloud, or Azure to handle computations efficiently.

License

LLaVA-UHD v2 is distributed under the LLAMA 2 Community License, with all rights reserved by Meta Platforms, Inc. For any questions or comments, refer to the GitHub issues page.

More Related APIs in Image Text To Text