llava v1.6 mistral 7b hf
llava-hfLLAVA-V1.6-MISTRAL-7B-HF
Introduction
The LLaVA-V1.6-Mistral-7B-HF model is an advanced multimodal model combining a large language model with a vision encoder, designed for tasks involving both visual and text data, such as image captioning and visual question answering. This model is an improvement over its predecessor, LLaVA-1.5, featuring enhanced input image resolution and training on a more robust visual instruction dataset to improve Optical Character Recognition (OCR) and reasoning capabilities.
Architecture
LLaVA-NeXT utilizes the Mistral-7B architecture, known for its commercial-friendly licensing and support for bilingual tasks. The model incorporates a diverse and high-quality data mixture and dynamic high-resolution image inputs to enhance performance in multimodal chatbot applications.
Training
The model was trained with an emphasis on improving reasoning, OCR, and world knowledge capabilities. It uses a combination of advanced techniques like a dynamic high-resolution input and a diverse dataset to enhance its multimodal capabilities.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install the Required Libraries:
pip install transformers torch PIL requests
-
Set Up GPU Environment:
- Ensure you have access to a CUDA-compatible GPU. Cloud GPU services like AWS EC2 with GPU instances, Google Cloud's AI Platform, or Azure's GPU VMs are recommended for optimal performance.
-
Load and Use the Model:
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration import torch from PIL import Image import requests processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf") model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True) model.to("cuda:0") url = "https://example.com/image.jpg" image = Image.open(requests.get(url, stream=True).raw) conversation = [{"role": "user", "content": [{"type": "text", "text": "What is shown in this image?"}, {"type": "image"}]}] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda:0") output = model.generate(**inputs, max_new_tokens=100) print(processor.decode(output[0], skip_special_tokens=True))
-
Optimize Model Performance:
- Install
bitsandbytes
for 4-bit quantization:pip install bitsandbytes
- Modify model loading to enable 4-bit quantization:
model = LlavaNextForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True, load_in_4bit=True)
- Install
flash-attn
to accelerate generation:# Refer to Flash Attention GitHub for installation instructions
- Enable Flash Attention:
model = LlavaNextForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True, use_flash_attention_2=True).to(0)
- Install
License
The model is available under the Apache 2.0 license, which allows for both commercial and non-commercial use as long as the terms of the license are followed.