Sa2 V A 4 B
ByteDanceSa2VA-4B
Introduction
Sa2VA is a multi-lingual language model (MLLM) designed for question answering, visual prompt understanding, and dense object segmentation in both images and videos. It performs comparably to state-of-the-art (SOTA) models like Qwen2-VL and InternVL2.5 in question-answering benchmarks while also excelling in visual prompt understanding and segmentation, areas where the aforementioned models fall short.
Architecture
The Sa2VA series is built on models like Qwen2-VL and InternVL2/2.5, utilizing a merge of base models such as OpenGVLab's InternVL and Qwen's Instruct models. The architecture supports multilingual capabilities and can handle various tasks, including feature extraction and conversational interfaces.
Training
Sa2VA models are designed to achieve SOTA performance on image and video grounding and segmentation benchmarks. The models have been trained and evaluated on multiple datasets to validate their effectiveness in dense grounded understanding.
Guide: Running Locally
To run Sa2VA locally, follow these steps:
- Install Dependencies: Ensure you have Python and the necessary libraries installed. Use
pip
to install thetransformers
library from Hugging Face. - Load the Model and Tokenizer:
import torch from transformers import AutoTokenizer, AutoModel path = "ByteDance/Sa2VA-4B" model = AutoModel.from_pretrained(path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda() tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
- Prepare Input Data: For image processing, load an image using PIL and prepare the input dictionary with the tokenizer.
- Run Predictions: Use the model's
predict_forward
method to obtain answers and segmentation masks if necessary. - GPU Recommendation: For optimal performance, consider using cloud GPU services like Google Cloud, AWS, or Azure.
License
The Sa2VA model is released under the MIT License, allowing for flexible usage and distribution.