llava v1.6 34b

liuhaotian

Introduction

LLAVA-v1.6-34B is an open-source chatbot model designed for multimodal instruction-following applications. It is built on the transformer architecture and fine-tuned from the base LLM NousResearch/Nous-Hermes-2-Yi-34B. The model is tailored for research in large multimodal models and chatbots, with potential users being researchers and hobbyists in related fields.

Architecture

LLAVA is an auto-regressive language model based on the transformer architecture. It leverages a combination of image-text data and instruction-following datasets to enhance its capabilities in multimodal interactions.

Training

The model was trained in December 2023 using several datasets:

  • 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
  • 158K GPT-generated multimodal instruction-following data.
  • 500K academic-task-oriented VQA data mixture.
  • 50K GPT-4V data mixture.
  • 40K ShareGPT data.

Guide: Running Locally

To run LLAVA-v1.6-34B locally, follow these basic steps:

  1. Clone the Repository: Begin by cloning the relevant repository to your local machine.
  2. Set Up Environment: Ensure you have the necessary environment set up, including Python and any required libraries.
  3. Download the Model: Retrieve the model files from Hugging Face's model hub.
  4. Inference Setup: Use Hugging Face Transformers to load the model and begin inference.

For optimal performance, especially with large models like LLAVA, it is recommended to use cloud GPUs such as those provided by AWS, Google Cloud, or Azure.

License

LLAVA-v1.6-34B is licensed under the Apache-2.0 License, based on the underlying model, NousResearch/Nous-Hermes-2-Yi-34B. For questions or comments, refer to the model's GitHub issues page: https://github.com/haotian-liu/LLaVA/issues.

More Related APIs in Image Text To Text