llava v1.6 mistral 7b hf

llava-hf

LLAVA-V1.6-MISTRAL-7B-HF

Introduction

The LLaVA-V1.6-Mistral-7B-HF model is an advanced multimodal model combining a large language model with a vision encoder, designed for tasks involving both visual and text data, such as image captioning and visual question answering. This model is an improvement over its predecessor, LLaVA-1.5, featuring enhanced input image resolution and training on a more robust visual instruction dataset to improve Optical Character Recognition (OCR) and reasoning capabilities.

Architecture

LLaVA-NeXT utilizes the Mistral-7B architecture, known for its commercial-friendly licensing and support for bilingual tasks. The model incorporates a diverse and high-quality data mixture and dynamic high-resolution image inputs to enhance performance in multimodal chatbot applications.

Training

The model was trained with an emphasis on improving reasoning, OCR, and world knowledge capabilities. It uses a combination of advanced techniques like a dynamic high-resolution input and a diverse dataset to enhance its multimodal capabilities.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install the Required Libraries:

    pip install transformers torch PIL requests
    
  2. Set Up GPU Environment:

    • Ensure you have access to a CUDA-compatible GPU. Cloud GPU services like AWS EC2 with GPU instances, Google Cloud's AI Platform, or Azure's GPU VMs are recommended for optimal performance.
  3. Load and Use the Model:

    from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
    import torch
    from PIL import Image
    import requests
    
    processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
    model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
    model.to("cuda:0")
    
    url = "https://example.com/image.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    conversation = [{"role": "user", "content": [{"type": "text", "text": "What is shown in this image?"}, {"type": "image"}]}]
    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    
    inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda:0")
    output = model.generate(**inputs, max_new_tokens=100)
    
    print(processor.decode(output[0], skip_special_tokens=True))
    
  4. Optimize Model Performance:

    • Install bitsandbytes for 4-bit quantization:
      pip install bitsandbytes
      
    • Modify model loading to enable 4-bit quantization:
      model = LlavaNextForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True, load_in_4bit=True)
      
    • Install flash-attn to accelerate generation:
      # Refer to Flash Attention GitHub for installation instructions
      
    • Enable Flash Attention:
      model = LlavaNextForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True, use_flash_attention_2=True).to(0)
      

License

The model is available under the Apache 2.0 license, which allows for both commercial and non-commercial use as long as the terms of the license are followed.

More Related APIs in Image Text To Text