Deep Seek V2
deepseek-aiDeepSeek-V2 Documentation
Introduction
DeepSeek-V2 is a Mixture-of-Experts (MoE) language model offering economical training and efficient inference. It includes 236B total parameters, with 21B activated per token, and provides enhanced performance compared to its predecessor, DeepSeek 67B. It reduces training costs by 42.5%, decreases KV cache usage by 93.3%, and increases generation throughput by 5.76 times. The model is pretrained on 8.1 trillion tokens and fine-tuned using Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).
Architecture
DeepSeek-V2 employs an innovative architecture designed for efficiency:
- MLA (Multi-head Latent Attention): Utilizes low-rank key-value union compression to remove inference-time bottlenecks.
- DeepSeekMoE: A high-performance MoE architecture enabling stronger models at lower costs.
Training
The model was pretrained on a diverse dataset of 8.1 trillion tokens. Following pretraining, the model underwent Supervised Fine-Tuning and Reinforcement Learning to maximize its potential, achieving outstanding performance in standard benchmarks and open-ended generation evaluations.
Guide: Running Locally
Requirements
- GPUs: Requires 80GB*8 GPUs for BF16 format inference.
Inference with Hugging Face's Transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_name = "deepseek-ai/DeepSeek-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
max_memory = {i: "75GB" for i in range(8)}
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
text = "An attention function can be described as mapping a query and a set of key-value pairs to an output..."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Inference with vLLM (Recommended)
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=8, max_model_len=8192, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
messages_list = [{"role": "user", "content": "Who are you?"}]
prompt_token_ids = tokenizer.apply_chat_template(messages_list, add_generation_prompt=True)
outputs = llm.generate(prompt_token_ids=[prompt_token_ids], sampling_params=sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)
Cloud GPUs
For optimal performance, consider using cloud GPUs available from providers like AWS, Google Cloud, or Azure.
License
The code in this repository is under the MIT License. The DeepSeek-V2 models (Base and Chat) are subject to a specific Model License, supporting commercial use. License details can be found here for code and here for models.