llava v1.5 7b

liuhaotian

Introduction

LLaVA-v1.5-7B is an open-source chatbot model fine-tuned on multimodal instruction-following data generated by GPT. It is designed for research in large multimodal models and chatbots, utilizing an auto-regressive language model based on the transformer architecture.

Architecture

The LLaVA model is built upon the LLaMA/Vicuna framework and employs a transformer architecture. It functions as an auto-regressive language model, primarily used for generating text based on image-text inputs.

Training

The training dataset for LLaVA-v1.5-7B includes:

  • 558,000 filtered image-text pairs from LAION/CC/SBU, annotated by BLIP.
  • 158,000 GPT-generated multimodal instruction-following data points.
  • 450,000 samples of academic-task-oriented VQA data mixture.
  • 40,000 entries from ShareGPT data.

The evaluation dataset comprises 12 benchmarks, featuring 5 academic VQA benchmarks and 7 benchmarks tailored for instruction-following LMMs.

Guide: Running Locally

To run LLaVA-v1.5-7B locally, follow these steps:

  1. Clone the Repository: Ensure you have Git installed and clone the model repository from Hugging Face.
  2. Install Dependencies: Use pip to install necessary libraries such as PyTorch and Transformers.
  3. Download the Model: Use the Hugging Face model hub to download the LLaVA-v1.5-7B model files.
  4. Run the Model: Load the model in your Python environment and test it with your data or use pre-existing datasets for evaluation.

For optimal performance, it is recommended to use cloud-based GPUs such as those provided by AWS, Google Cloud, or Azure.

License

LLaVA-v1.5-7B is licensed under the LLAMA 2 Community License, with all rights reserved by Meta Platforms, Inc. For inquiries or feedback, users can contact the developers via the GitHub issues page for LLaVA.

More Related APIs in Image Text To Text