llava v1.5 13b
liuhaotianIntroduction
LLaVA-v1.5-13B is an open-source chatbot model designed for image-text-to-text tasks. It is an auto-regressive language model that follows the transformer architecture. The model is a result of fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data and was trained in September 2023. The intended use is primarily for research in large multimodal models and chatbots, targeting researchers and hobbyists in fields like computer vision, natural language processing, machine learning, and AI.
Architecture
LLaVA-v1.5-13B is based on the transformer architecture and functions as an auto-regressive language model. It is designed to handle multimodal data by generating text based on image and text inputs.
Training
The model was trained using a diverse dataset:
- 558K image-text pairs from LAION/CC/SBU, captioned by BLIP.
- 158K GPT-generated multimodal instruction-following data.
- 450K academic-task-oriented VQA data mixture.
- 40K ShareGPT data.
Guide: Running Locally
To run LLaVA-v1.5-13B locally, follow these steps:
- Set up Environment: Ensure you have Python and PyTorch installed.
- Clone the Repository: Clone the model repository from Hugging Face.
- Install Dependencies: Navigate to the cloned directory and install necessary dependencies with
pip install -r requirements.txt
. - Download Pre-trained Model: Download the model weights from the Hugging Face model card.
- Run Inference: Use the provided scripts to run inference on your dataset.
For optimal performance, it is recommended to use cloud GPUs such as those provided by AWS, Google Cloud, or Azure.
License
LLaVA-v1.5-13B is licensed under the LLAMA 2 Community License by Meta Platforms, Inc. All rights reserved. For questions or comments, refer to the GitHub issues page.