deepseek vl2 tiny

deepseek-ai

Introduction

DeepSeek-VL2 is a series of advanced Mixture-of-Experts (MoE) Vision-Language Models designed to outperform its predecessor, DeepSeek-VL. It excels in various tasks such as visual question answering, optical character recognition, and visual grounding. The series includes three models: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2, with 1.0B, 2.8B, and 4.5B activated parameters, respectively. These models achieve competitive or state-of-the-art performance with fewer activated parameters than existing models.

Architecture

DeepSeek-VL2-Tiny is based on DeepSeekMoE-3B, utilizing 1.0 billion activated parameters. The models in this series employ a Mixture-of-Experts approach, improving multimodal understanding tasks.

Training

The main difference between DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2 lies in the base language model used, allowing for scalability and flexibility depending on the task requirements.

Guide: Running Locally

Installation

Ensure you have Python >= 3.8 and install the necessary dependencies with:

pip install -e .

Simple Inference Example

To run inference, use the following Python example:

import torch
from transformers import AutoModelForCausalLM
from deepseek_vl.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl.utils.io import load_pil_images

model_path = "deepseek-ai/deepseek-vl2-small"
vl_chat_processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.",
        "images": ["./images/visual_grounding.jpeg"],
    },
    {"role": "<|Assistant|>", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=512
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

Suggestions for Cloud GPUs

For effective performance, consider using cloud-based GPU services like AWS, Google Cloud, or Azure.

License

The code repository is licensed under the MIT License. The models are subject to the DeepSeek Model License, which permits commercial use. For more details, refer to the DeepSeek Model License.

More Related APIs in Image Text To Text