DeepSeek-V2 Documentation

Introduction

DeepSeek-V2 is a Mixture-of-Experts (MoE) language model offering economical training and efficient inference. It includes 236B total parameters, with 21B activated per token, and provides enhanced performance compared to its predecessor, DeepSeek 67B. It reduces training costs by 42.5%, decreases KV cache usage by 93.3%, and increases generation throughput by 5.76 times. The model is pretrained on 8.1 trillion tokens and fine-tuned using Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).

Architecture

DeepSeek-V2 employs an innovative architecture designed for efficiency:

  • MLA (Multi-head Latent Attention): Utilizes low-rank key-value union compression to remove inference-time bottlenecks.
  • DeepSeekMoE: A high-performance MoE architecture enabling stronger models at lower costs.

Training

The model was pretrained on a diverse dataset of 8.1 trillion tokens. Following pretraining, the model underwent Supervised Fine-Tuning and Reinforcement Learning to maximize its potential, achieving outstanding performance in standard benchmarks and open-ended generation evaluations.

Guide: Running Locally

Requirements

  • GPUs: Requires 80GB*8 GPUs for BF16 format inference.

Inference with Hugging Face's Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
max_memory = {i: "75GB" for i in range(8)}
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output..."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Inference with vLLM (Recommended)

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=8, max_model_len=8192, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [{"role": "user", "content": "Who are you?"}]
prompt_token_ids = tokenizer.apply_chat_template(messages_list, add_generation_prompt=True)
outputs = llm.generate(prompt_token_ids=[prompt_token_ids], sampling_params=sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

Cloud GPUs

For optimal performance, consider using cloud GPUs available from providers like AWS, Google Cloud, or Azure.

License

The code in this repository is under the MIT License. The DeepSeek-V2 models (Base and Chat) are subject to a specific Model License, supporting commercial use. License details can be found here for code and here for models.

More Related APIs in Text Generation