Introduction

Sa2VA is a Multilingual Language Model (MLLM) with capabilities in question answering, visual prompt understanding, and dense object segmentation for both images and videos. It performs on par with state-of-the-art models like Qwen2-VL and InternVL2.5 in question-answering tasks while offering additional features like visual prompt understanding and segmentation.

Architecture

The Sa2VA series is based on models like Qwen2-VL and InternVL2/2.5. It comprises several models including Sa2VA-1B, Sa2VA-4B, and Sa2VA-8B, which are built upon different base models to enhance performance and capabilities across various benchmarks.

Training

Sa2VA models have been trained to achieve state-of-the-art performance in image and video grounding and segmentation tasks, outperforming existing MLLMs in these areas.

Guide: Running Locally

To run Sa2VA models locally, follow these steps:

  1. Install Dependencies: Ensure you have PyTorch, Transformers, and PIL installed.
  2. Load Model and Tokenizer:
    from transformers import AutoTokenizer, AutoModel
    model = AutoModel.from_pretrained("ByteDance/Sa2VA-4B").eval().cuda()
    tokenizer = AutoTokenizer.from_pretrained("ByteDance/Sa2VA-4B")
    
  3. Prepare Input: Load an image or video and prepare text prompts.
  4. Prediction:
    • For image input:
      input_dict = {'image': image, 'text': text_prompts, 'tokenizer': tokenizer}
      return_dict = model.predict_forward(**input_dict)
      answer = return_dict["prediction"]
      
    • For video input, sample frames and follow similar steps as above.

For optimal performance, consider using cloud GPUs like AWS EC2, Google Cloud, or Azure for resource-intensive tasks.

License

Sa2VA is licensed under the MIT License, allowing for open use and modification with proper attribution.

More Related APIs in Image Text To Text