llava 1.5 7b hf

llava-hf

Introduction

The LLAVA model is a chatbot designed for multi-modal instruction-following tasks, leveraging a transformer-based architecture. It is trained on data generated by GPT models and fine-tuned on LLaMA/Vicuna datasets. This model version, LLAVA-1.5-7B, was released in September 2023.

Architecture

LLAVA is an auto-regressive language model utilizing the transformer architecture. It supports multi-image and multi-prompt generation, allowing users to input multiple images in their prompts.

Training

The LLAVA model was fine-tuned on the LLaVA-Instruct-150K dataset using GPT-generated multimodal data. This ensures the model's capability to handle complex, multi-modal conversational tasks.

Guide: Running Locally

  1. Install Required Packages: Ensure you have transformers >= 4.35.3 and optionally bitsandbytes for 4-bit quantization, and flash-attn for faster processing.
  2. Setup the Environment: Make sure you have access to a CUDA-compatible GPU.
  3. Load the Model:
    • Use the Hugging Face Transformers library to load the model with options for 4-bit quantization or Flash Attention 2 for performance optimization.
  4. Run a Sample Input:
    • Prepare the input using the AutoProcessor and format it with apply_chat_template.
    • Run the model using the prepared input and process the output.
  5. Consider Cloud GPUs: For extensive tasks, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure to optimize performance and resource management.

License

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

More Related APIs in Image Text To Text