nano L La V A 1.5
qnguyen3Introduction
nanoLLaVA-1.5 is an advanced vision-language model tailored for efficient performance on edge devices. It features enhancements over the previous version, nanoLLaVA-1.0, and is designed to handle image-text-to-text tasks effectively.
Architecture
- Base LLM: Quyen-SE-v0.1 (Qwen1.5-0.5B)
- Vision Encoder: google/siglip-so400m-patch14-384
The model incorporates components for vision and language processing to address a variety of multimodal tasks.
Training
Detailed information on the training data is pending as it is being documented in an upcoming paper. The model promises improved performance over its predecessor, nanoLLaVA-1.0.
Guide: Running Locally
To use nanoLLaVA-1.5 with the Transformers library, follow these steps:
-
Install Required Libraries:
pip install -U transformers accelerate flash_attn
-
Set Up the Model:
import torch from transformers import AutoModelForCausalLM, AutoTokenizer from PIL import Image model_name = 'qnguyen3/nanoLLaVA-1.5' model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
-
Prepare Input and Generate Output:
prompt = 'Describe this image in detail' messages = [{"role": "user", "content": f'<image>\n{prompt}'}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) input_ids = torch.tensor([tokenizer(text).input_ids], dtype=torch.long) image = Image.open('/path/to/image.png') image_tensor = model.process_images([image], model.config).to(dtype=model.dtype) output_ids = model.generate(input_ids, images=image_tensor, max_new_tokens=2048, use_cache=True)[0] print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
For optimal performance, consider using cloud GPUs such as AWS, Google Cloud, or Azure.
License
nanoLLaVA-1.5 is distributed under the Apache 2.0 license, which allows for both commercial and non-commercial use.