Introduction

moondream2 is a compact vision-language model optimized for edge devices, facilitating efficient image-text-to-text transformations. For more information, visit the GitHub repository or explore the model on Hugging Face Space.

Architecture

moondream2 is part of the Hugging Face Transformers ecosystem and supports safetensors and GGUF libraries. It is designed to perform image-text-to-text tasks, making it suitable for running on resource-constrained devices.

Training

The model has been evaluated on several benchmarks, including VQAv2, GQA, TextVQA, DocVQA, and TallyQA, with results continually improving over time. This indicates ongoing development and refinement in its capabilities.

Guide: Running Locally

  1. Install Required Libraries:
    pip install transformers einops
    
  2. Load the Model:
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from PIL import Image
    
    model_id = "vikhyatk/moondream2"
    revision = "2024-08-26"
    model = AutoModelForCausalLM.from_pretrained(
        model_id, trust_remote_code=True, revision=revision
    )
    tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
    
  3. Use the Model:
    image = Image.open('<IMAGE_PATH>')
    enc_image = model.encode_image(image)
    print(model.answer_question(enc_image, "Describe this image.", tokenizer))
    

It is advisable to pin the model version to a specific release to ensure consistency. For running models faster, consider using cloud-based GPU services such as AWS, GCP, or Azure.

License

The model is licensed under the Apache-2.0 license, allowing for broad usage and modifications.

More Related APIs in Image Text To Text