hawky ai Smol L M2 1.7 B Instruct LLM Model

Introduction

Hawky AI SmolLM2-1.7B-Instruct is a multimodal model that combines image and text inputs to perform tasks such as image captioning, visual question answering, and storytelling based on visual content. It is designed for inference and does not support image generation.

Architecture

The model utilizes SmolLM2, a lightweight language model, and introduces several enhancements compared to previous Idefics models:

Image Compression: Employs a more aggressive image compression technique to improve inference speed and reduce memory usage.
Visual Token Encoding: Utilizes 81 visual tokens to encode image patches of size 384×384. Larger images are divided into patches and encoded separately to enhance efficiency without affecting performance.

Training

SmolVLM is optimized for multimodal tasks and uses a specific architecture for visual token encoding. The model is trained to handle interleaved text and image inputs effectively. Users can fine-tune SmolVLM on specific tasks by following the provided fine-tuning guidelines.

Guide: Running Locally

To run the model locally, follow these steps:

Install Dependencies: Ensure you have PyTorch and the Transformers library installed.
Load Images: Use the load_image function from transformers.image_utils to load your images.

Initialize Processor and Model:

from transformers import AutoProcessor, AutoModelForVision2Seq
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct",
    torch_dtype=torch.bfloat16
).to("cuda")

Prepare Inputs: Create input messages and prepare them using the processor.
Generate Outputs: Use the model to generate text from the input images and text queries.

For optimal performance, use cloud GPUs like those provided by AWS, Google Cloud, or Azure.

License

The model is licensed under the Apache-2.0 license, allowing for wide use and distribution under the specified terms.

More Related APIs in Image Text To Text