Smol V L M Instruct
HuggingFaceTBIntroduction
SmolVLM is a compact multimodal model designed by Hugging Face that processes image and text sequences to generate text outputs. It excels in tasks such as image captioning, visual question answering, and storytelling based on visual content. The model is efficient, suitable for on-device applications, and supports English language processing.
Architecture
SmolVLM utilizes the SmolLM2 language model and incorporates visual processing capabilities. Key architectural features include:
- Image Compression: Enhanced compression techniques for faster inference and reduced RAM usage.
- Visual Token Encoding: Utilizes 81 visual tokens to encode image patches of size 384×384, allowing efficient processing without loss of performance.
Training
The model is trained using The Cauldron and Docmatix datasets, focusing on document understanding, image captioning, visual reasoning, and more. SmolVLM's training emphasizes balance across multiple capabilities, ensuring versatile performance.
Guide: Running Locally
To run SmolVLM locally, follow these steps:
- Install Dependencies: Ensure you have Python and necessary libraries installed, particularly
torch
andtransformers
. - Set Up Environment: Use a CUDA-enabled GPU for optimal performance. Consider cloud GPU services like AWS or Google Cloud for efficient processing.
- Load the Model:
import torch from transformers import AutoProcessor, AutoModelForVision2Seq DEVICE = "cuda" if torch.cuda.is_available() else "cpu" processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct") model = AutoModelForVision2Seq.from_pretrained( "HuggingFaceTB/SmolVLM-Instruct", torch_dtype=torch.bfloat16 ).to(DEVICE)
- Prepare Inputs: Load images and text prompts, process them using the
AutoProcessor
, and move inputs to the device. - Generate Output: Use the model to generate text based on inputs, and decode the results.
License
SmolVLM is released under the Apache 2.0 license. This license allows for broad use and distribution, provided proper acknowledgment is given and any modifications are documented.