nano L La V A 1.5

qnguyen3

Introduction

nanoLLaVA-1.5 is an advanced vision-language model tailored for efficient performance on edge devices. It features enhancements over the previous version, nanoLLaVA-1.0, and is designed to handle image-text-to-text tasks effectively.

Architecture

  • Base LLM: Quyen-SE-v0.1 (Qwen1.5-0.5B)
  • Vision Encoder: google/siglip-so400m-patch14-384

The model incorporates components for vision and language processing to address a variety of multimodal tasks.

Training

Detailed information on the training data is pending as it is being documented in an upcoming paper. The model promises improved performance over its predecessor, nanoLLaVA-1.0.

Guide: Running Locally

To use nanoLLaVA-1.5 with the Transformers library, follow these steps:

  1. Install Required Libraries:

    pip install -U transformers accelerate flash_attn
    
  2. Set Up the Model:

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from PIL import Image
    
    model_name = 'qnguyen3/nanoLLaVA-1.5'
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map='auto',
        trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    
  3. Prepare Input and Generate Output:

    prompt = 'Describe this image in detail'
    messages = [{"role": "user", "content": f'<image>\n{prompt}'}]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
    input_ids = torch.tensor([tokenizer(text).input_ids], dtype=torch.long)
    image = Image.open('/path/to/image.png')
    image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)
    
    output_ids = model.generate(input_ids, images=image_tensor, max_new_tokens=2048, use_cache=True)[0]
    print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
    

For optimal performance, consider using cloud GPUs such as AWS, Google Cloud, or Azure.

License

nanoLLaVA-1.5 is distributed under the Apache 2.0 license, which allows for both commercial and non-commercial use.

More Related APIs in Image Text To Text