deepseek vl2 tiny
deepseek-aiIntroduction
DeepSeek-VL2 is a series of advanced Mixture-of-Experts (MoE) Vision-Language Models designed to outperform its predecessor, DeepSeek-VL. It excels in various tasks such as visual question answering, optical character recognition, and visual grounding. The series includes three models: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2, with 1.0B, 2.8B, and 4.5B activated parameters, respectively. These models achieve competitive or state-of-the-art performance with fewer activated parameters than existing models.
Architecture
DeepSeek-VL2-Tiny is based on DeepSeekMoE-3B, utilizing 1.0 billion activated parameters. The models in this series employ a Mixture-of-Experts approach, improving multimodal understanding tasks.
Training
The main difference between DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2 lies in the base language model used, allowing for scalability and flexibility depending on the task requirements.
Guide: Running Locally
Installation
Ensure you have Python >= 3.8 and install the necessary dependencies with:
pip install -e .
Simple Inference Example
To run inference, use the following Python example:
import torch
from transformers import AutoModelForCausalLM
from deepseek_vl.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl.utils.io import load_pil_images
model_path = "deepseek-ai/deepseek-vl2-small"
vl_chat_processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "<|User|>",
"content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.",
"images": ["./images/visual_grounding.jpeg"],
},
{"role": "<|Assistant|>", "content": ""},
]
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation,
images=pil_images,
force_batchify=True,
system_prompt=""
).to(vl_gpt.device)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
max_new_tokens=512
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
Suggestions for Cloud GPUs
For effective performance, consider using cloud-based GPU services like AWS, Google Cloud, or Azure.
License
The code repository is licensed under the MIT License. The models are subject to the DeepSeek Model License, which permits commercial use. For more details, refer to the DeepSeek Model License.