nano L La V A

qnguyen3

Introduction

nanoLLaVA is a sub-1 billion parameter vision-language model designed to efficiently operate on edge devices. It is a compact model delivering robust performance for vision-language tasks. The model integrates the Quyen-SE-v0.1 language model and the Google Siglip vision encoder.

Architecture

  • Base LLM: Quyen-SE-v0.1, a smaller version of the Qwen model with 0.5 billion parameters.
  • Vision Encoder: Google Siglip-so400m-patch14-384, which preprocesses images before inputting them into the language model.

Training

The training data for nanoLLaVA is not yet released, as it is part of an ongoing research paper. The model promises enhanced performance with the final release. Finetuning code will be available soon.

Guide: Running Locally

  1. Install Dependencies:

    pip install -U transformers accelerate flash_attn
    
  2. Load Model and Tokenizer:

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model = AutoModelForCausalLM.from_pretrained('qnguyen3/nanoLLaVA', torch_dtype=torch.float16, device_map='auto', trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained('qnguyen3/nanoLLaVA', trust_remote_code=True)
    
  3. Prepare Input:

    • Setup prompts using the ChatML format.
    • Load and preprocess images using PIL.
  4. Generate Output:

    output_ids = model.generate(input_ids, images=image_tensor, max_new_tokens=2048, use_cache=True)[0]
    print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
    

Consider using cloud GPUs from services like AWS, Google Cloud, or Azure for improved performance during inference.

License

nanoLLaVA is released under the Apache 2.0 License, allowing for both personal and commercial use.

More Related APIs in Text Generation