S A I L V L 2 B
BytedanceDouyinContentIntroduction
SAIL-VL is a vision-language model developed by the Bytedance Douyin Content Team, aimed at high-performance deployment on mobile devices. It prioritizes accessibility and affordability, outperforming comparable models such as Qwen2-VL and InternVL2. SAIL-VL uses data scaling to enhance performance, positioning itself as a foundational model for vision-language applications.
Architecture
SAIL-VL utilizes a ViT LLM Adapter Token Merge Resolution structure. It is designed to efficiently handle vision-language tasks, showcasing superior performance metrics compared to models of similar sizes, including InternViT-300M and Qwen2.5-1.5B.
Training
The training of SAIL-VL is based on high-quality data and a carefully curated training pipeline. The design of the curriculum and data scaling are crucial for its performance. The model's capacity scales effectively with data expansion, leading to improved outcomes. Further detailed training methodologies are to be released.
Guide: Running Locally
To use SAIL-VL locally, follow these steps:
-
Install Required Packages:
pip3 install einops transformers timm
-
Load Image: Use the provided
load_image
function to prepare images for input. -
Initialize Model and Tokenizer:
from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("BytedanceDouyinContent/SAIL-VL-2B", torch_dtype=torch.bfloat16, trust_remote_code=True).eval().cuda() tokenizer = AutoTokenizer.from_pretrained("BytedanceDouyinContent/SAIL-VL-2B", trust_remote_code=True, use_fast=False)
-
Run Inference: Use the model to generate responses for text or image-based queries.
For enhanced performance, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.
License
This project is licensed under the Apache License 2.0.