Sa2 V A 8 B
ByteDanceIntroduction
Sa2VA is a Multilingual Language Model (MLLM) with capabilities in question answering, visual prompt understanding, and dense object segmentation for both images and videos. It performs on par with state-of-the-art models like Qwen2-VL and InternVL2.5 in question-answering tasks while offering additional features like visual prompt understanding and segmentation.
Architecture
The Sa2VA series is based on models like Qwen2-VL and InternVL2/2.5. It comprises several models including Sa2VA-1B, Sa2VA-4B, and Sa2VA-8B, which are built upon different base models to enhance performance and capabilities across various benchmarks.
Training
Sa2VA models have been trained to achieve state-of-the-art performance in image and video grounding and segmentation tasks, outperforming existing MLLMs in these areas.
Guide: Running Locally
To run Sa2VA models locally, follow these steps:
- Install Dependencies: Ensure you have PyTorch, Transformers, and PIL installed.
- Load Model and Tokenizer:
from transformers import AutoTokenizer, AutoModel model = AutoModel.from_pretrained("ByteDance/Sa2VA-4B").eval().cuda() tokenizer = AutoTokenizer.from_pretrained("ByteDance/Sa2VA-4B")
- Prepare Input: Load an image or video and prepare text prompts.
- Prediction:
- For image input:
input_dict = {'image': image, 'text': text_prompts, 'tokenizer': tokenizer} return_dict = model.predict_forward(**input_dict) answer = return_dict["prediction"]
- For video input, sample frames and follow similar steps as above.
- For image input:
For optimal performance, consider using cloud GPUs like AWS EC2, Google Cloud, or Azure for resource-intensive tasks.
License
Sa2VA is licensed under the MIT License, allowing for open use and modification with proper attribution.