Introduction

SAIL-VL is a vision-language model developed by the Bytedance Douyin Content Team, aimed at high-performance deployment on mobile devices. It prioritizes accessibility and affordability, outperforming comparable models such as Qwen2-VL and InternVL2. SAIL-VL uses data scaling to enhance performance, positioning itself as a foundational model for vision-language applications.

Architecture

SAIL-VL utilizes a ViT LLM Adapter Token Merge Resolution structure. It is designed to efficiently handle vision-language tasks, showcasing superior performance metrics compared to models of similar sizes, including InternViT-300M and Qwen2.5-1.5B.

Training

The training of SAIL-VL is based on high-quality data and a carefully curated training pipeline. The design of the curriculum and data scaling are crucial for its performance. The model's capacity scales effectively with data expansion, leading to improved outcomes. Further detailed training methodologies are to be released.

Guide: Running Locally

To use SAIL-VL locally, follow these steps:

  1. Install Required Packages:

    pip3 install einops transformers timm
    
  2. Load Image: Use the provided load_image function to prepare images for input.

  3. Initialize Model and Tokenizer:

    from transformers import AutoModel, AutoTokenizer
    model = AutoModel.from_pretrained("BytedanceDouyinContent/SAIL-VL-2B", torch_dtype=torch.bfloat16, trust_remote_code=True).eval().cuda()
    tokenizer = AutoTokenizer.from_pretrained("BytedanceDouyinContent/SAIL-VL-2B", trust_remote_code=True, use_fast=False)
    
  4. Run Inference: Use the model to generate responses for text or image-based queries.

For enhanced performance, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.

License

This project is licensed under the Apache License 2.0.

More Related APIs