llava v1.5 7b
liuhaotianIntroduction
LLaVA-v1.5-7B is an open-source chatbot model fine-tuned on multimodal instruction-following data generated by GPT. It is designed for research in large multimodal models and chatbots, utilizing an auto-regressive language model based on the transformer architecture.
Architecture
The LLaVA model is built upon the LLaMA/Vicuna framework and employs a transformer architecture. It functions as an auto-regressive language model, primarily used for generating text based on image-text inputs.
Training
The training dataset for LLaVA-v1.5-7B includes:
- 558,000 filtered image-text pairs from LAION/CC/SBU, annotated by BLIP.
- 158,000 GPT-generated multimodal instruction-following data points.
- 450,000 samples of academic-task-oriented VQA data mixture.
- 40,000 entries from ShareGPT data.
The evaluation dataset comprises 12 benchmarks, featuring 5 academic VQA benchmarks and 7 benchmarks tailored for instruction-following LMMs.
Guide: Running Locally
To run LLaVA-v1.5-7B locally, follow these steps:
- Clone the Repository: Ensure you have Git installed and clone the model repository from Hugging Face.
- Install Dependencies: Use
pip
to install necessary libraries such as PyTorch and Transformers. - Download the Model: Use the Hugging Face model hub to download the LLaVA-v1.5-7B model files.
- Run the Model: Load the model in your Python environment and test it with your data or use pre-existing datasets for evaluation.
For optimal performance, it is recommended to use cloud-based GPUs such as those provided by AWS, Google Cloud, or Azure.
License
LLaVA-v1.5-7B is licensed under the LLAMA 2 Community License, with all rights reserved by Meta Platforms, Inc. For inquiries or feedback, users can contact the developers via the GitHub issues page for LLaVA.