deepseek vl2
deepseek-aiIntroduction
DeepSeek-VL2 is a series of advanced Mixture-of-Experts (MoE) Vision-Language Models designed for enhanced multimodal understanding. It builds upon DeepSeek-VL, offering improved performance in tasks like visual question answering, optical character recognition, and visual grounding. The series includes three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2, with 1.0B, 2.8B, and 4.5B activated parameters, respectively. It achieves competitive or superior performance with fewer activated parameters than other open-source models.
Architecture
DeepSeek-VL2 is based on the DeepSeekMoE-27B architecture. It utilizes a Mixture-of-Experts approach to efficiently handle vision-language tasks with varying parameter sizes across its variants, ensuring optimal performance for different use cases.
Training
The training process involves optimizing the model for various multimodal tasks using the Mixture-of-Experts framework. Specific details about the training datasets, methodologies, and performance benchmarks are available in the accompanying academic paper and GitHub repository.
Guide: Running Locally
Installation
To run DeepSeek-VL2 locally, ensure you have Python >= 3.8 installed. Install the necessary dependencies with the following command:
pip install -e .
Simple Inference Example
import torch
from transformers import AutoModelForCausalLM
from deepseek_vl.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl.utils.io import load_pil_images
model_path = "deepseek-ai/deepseek-vl2-small"
vl_chat_processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda().eval()
conversation = [{"role": "<|User|>", "content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.", "images": ["./images/visual_grounding.jpeg"]}]
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(conversations=conversation, images=pil_images, force_batchify=True, system_prompt="").to(vl_gpt.device)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language_model.generate(inputs_embeds=inputs_embeds, attention_mask=prepare_inputs.attention_mask, pad_token_id=tokenizer.eos_token_id, max_new_tokens=512)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(answer)
Cloud GPUs
For better performance, especially with larger model variants, consider using cloud GPU services like AWS, Google Cloud, or Azure.
License
The code repository for DeepSeek-VL2 is licensed under the MIT License. The model use is governed by the DeepSeek Model License, which supports commercial applications. Full license details can be found in the LICENSE-MODEL.