moondream2
vikhyatkIntroduction
moondream2
is a compact vision-language model optimized for edge devices, facilitating efficient image-text-to-text transformations. For more information, visit the GitHub repository or explore the model on Hugging Face Space.
Architecture
moondream2
is part of the Hugging Face Transformers ecosystem and supports safetensors and GGUF libraries. It is designed to perform image-text-to-text tasks, making it suitable for running on resource-constrained devices.
Training
The model has been evaluated on several benchmarks, including VQAv2, GQA, TextVQA, DocVQA, and TallyQA, with results continually improving over time. This indicates ongoing development and refinement in its capabilities.
Guide: Running Locally
- Install Required Libraries:
pip install transformers einops
- Load the Model:
from transformers import AutoModelForCausalLM, AutoTokenizer from PIL import Image model_id = "vikhyatk/moondream2" revision = "2024-08-26" model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, revision=revision ) tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
- Use the Model:
image = Image.open('<IMAGE_PATH>') enc_image = model.encode_image(image) print(model.answer_question(enc_image, "Describe this image.", tokenizer))
It is advisable to pin the model version to a specific release to ensure consistency. For running models faster, consider using cloud-based GPU services such as AWS, GCP, or Azure.
License
The model is licensed under the Apache-2.0 license, allowing for broad usage and modifications.