deepseek vl2 small LLM Model

Introduction

DeepSeek-VL2 is a series of advanced Mixture-of-Experts (MoE) Vision-Language Models that improve upon the previous DeepSeek-VL models. These models are designed for tasks such as visual question answering, optical character recognition, and visual grounding. The series includes three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2, with 1.0B, 2.8B, and 4.5B activated parameters, respectively. DeepSeek-VL2 models achieve competitive performance with fewer parameters than existing models.

Architecture

DeepSeek-VL2-small is constructed on the DeepSeekMoE-16B framework, utilizing a Mixture-of-Experts approach to enhance multimodal understanding capabilities across various tasks.

Training

The models leverage a Mixture-of-Experts (MoE) architecture, which allows for efficient parameter utilization, enabling superior performance in visual and language tasks. The training details, including specific datasets and methodologies, are available in the linked research paper and GitHub repository.

Guide: Running Locally

Installation

Ensure a Python environment of version 3.8 or higher.
Install dependencies using the command:
```
pip install -e .
```

Inference Example

The following is a simple example of running an inference:

import torch
from transformers import AutoModelForCausalLM
from deepseek_vl.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl.utils.io import load_pil_images

# Specify the path to the model
model_path = "deepseek-ai/deepseek-vl2-small"
vl_chat_processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

# Define a conversation
conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.",
        "images": ["./images/visual_grounding.jpeg"],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# Load images and prepare inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(conversations=conversation, images=pil_images, force_batchify=True).to(vl_gpt.device)

# Run the model
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

Cloud GPUs

For efficient computation, consider running the model on cloud GPU platforms such as AWS EC2, Google Cloud, or Azure.

License

The code repository is licensed under the MIT License. Use of DeepSeek-VL2 models is governed by the DeepSeek Model License, which permits commercial use. Further details can be found in the LICENSE-MODEL file.

More Related APIs in Image Text To Text