hawky ai Smol L M2 1.7 B Instruct
Sri-Vigneshwar-DJIntroduction
Hawky AI SmolLM2-1.7B-Instruct is a multimodal model that combines image and text inputs to perform tasks such as image captioning, visual question answering, and storytelling based on visual content. It is designed for inference and does not support image generation.
Architecture
The model utilizes SmolLM2, a lightweight language model, and introduces several enhancements compared to previous Idefics models:
- Image Compression: Employs a more aggressive image compression technique to improve inference speed and reduce memory usage.
- Visual Token Encoding: Utilizes 81 visual tokens to encode image patches of size 384×384. Larger images are divided into patches and encoded separately to enhance efficiency without affecting performance.
Training
SmolVLM is optimized for multimodal tasks and uses a specific architecture for visual token encoding. The model is trained to handle interleaved text and image inputs effectively. Users can fine-tune SmolVLM on specific tasks by following the provided fine-tuning guidelines.
Guide: Running Locally
To run the model locally, follow these steps:
- Install Dependencies: Ensure you have PyTorch and the Transformers library installed.
- Load Images: Use the
load_image
function fromtransformers.image_utils
to load your images. - Initialize Processor and Model:
from transformers import AutoProcessor, AutoModelForVision2Seq processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct") model = AutoModelForVision2Seq.from_pretrained( "HuggingFaceTB/SmolVLM-Instruct", torch_dtype=torch.bfloat16 ).to("cuda")
- Prepare Inputs: Create input messages and prepare them using the processor.
- Generate Outputs: Use the model to generate text from the input images and text queries.
For optimal performance, use cloud GPUs like those provided by AWS, Google Cloud, or Azure.
License
The model is licensed under the Apache-2.0 license, allowing for wide use and distribution under the specified terms.