Sa2 V A 8 B LLM Model — Open LLM List

Introduction

Sa2VA is a Multilingual Language Model (MLLM) with capabilities in question answering, visual prompt understanding, and dense object segmentation for both images and videos. It performs on par with state-of-the-art models like Qwen2-VL and InternVL2.5 in question-answering tasks while offering additional features like visual prompt understanding and segmentation.

Architecture

The Sa2VA series is based on models like Qwen2-VL and InternVL2/2.5. It comprises several models including Sa2VA-1B, Sa2VA-4B, and Sa2VA-8B, which are built upon different base models to enhance performance and capabilities across various benchmarks.

Training

Sa2VA models have been trained to achieve state-of-the-art performance in image and video grounding and segmentation tasks, outperforming existing MLLMs in these areas.

Guide: Running Locally

To run Sa2VA models locally, follow these steps:

Install Dependencies: Ensure you have PyTorch, Transformers, and PIL installed.

Load Model and Tokenizer:

from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained("ByteDance/Sa2VA-4B").eval().cuda()
tokenizer = AutoTokenizer.from_pretrained("ByteDance/Sa2VA-4B")

Prepare Input: Load an image or video and prepare text prompts.

Prediction:

For image input:

input_dict = {'image': image, 'text': text_prompts, 'tokenizer': tokenizer}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"]

For video input, sample frames and follow similar steps as above.

For optimal performance, consider using cloud GPUs like AWS EC2, Google Cloud, or Azure for resource-intensive tasks.

License

Sa2VA is licensed under the MIT License, allowing for open use and modification with proper attribution.

More Related APIs in Image Text To Text