Sa2VA-4B

Introduction

Sa2VA is a multi-lingual language model (MLLM) designed for question answering, visual prompt understanding, and dense object segmentation in both images and videos. It performs comparably to state-of-the-art (SOTA) models like Qwen2-VL and InternVL2.5 in question-answering benchmarks while also excelling in visual prompt understanding and segmentation, areas where the aforementioned models fall short.

Architecture

The Sa2VA series is built on models like Qwen2-VL and InternVL2/2.5, utilizing a merge of base models such as OpenGVLab's InternVL and Qwen's Instruct models. The architecture supports multilingual capabilities and can handle various tasks, including feature extraction and conversational interfaces.

Training

Sa2VA models are designed to achieve SOTA performance on image and video grounding and segmentation benchmarks. The models have been trained and evaluated on multiple datasets to validate their effectiveness in dense grounded understanding.

Guide: Running Locally

To run Sa2VA locally, follow these steps:

Install Dependencies: Ensure you have Python and the necessary libraries installed. Use pip to install the transformers library from Hugging Face.

Load the Model and Tokenizer:

import torch
from transformers import AutoTokenizer, AutoModel
path = "ByteDance/Sa2VA-4B"
model = AutoModel.from_pretrained(path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

Prepare Input Data: For image processing, load an image using PIL and prepare the input dictionary with the tokenizer.
Run Predictions: Use the model's predict_forward method to obtain answers and segmentation masks if necessary.
GPU Recommendation: For optimal performance, consider using cloud GPU services like Google Cloud, AWS, or Azure.

License

The Sa2VA model is released under the MIT License, allowing for flexible usage and distribution.