Deep Seek V2 Lite
deepseek-aiIntroduction
DeepSeek-V2-Lite is a compact version of the DeepSeek-V2 model, designed for efficient inference and economical training. It features 16 billion total parameters, with 2.4 billion active parameters, trained from scratch with 5.7 trillion tokens. It surpasses other models on English and Chinese benchmarks and can be deployed on a single 40GB GPU or fine-tuned on multiple GPUs.
Architecture
DeepSeek-V2 utilizes Multi-head Latent Attention (MLA) and DeepSeekMoE architectures. MLA compresses the Key-Value (KV) cache into a latent vector for efficient inference, while DeepSeekMoE supports economical training through sparse computation. DeepSeek-V2-Lite has 27 layers with a hidden dimension of 2048, using 16 attention heads. It employs MoE layers with 2 shared experts and 64 routed experts per layer.
Training
The model is trained from scratch using the AdamW optimizer with specific hyper-parameters and a warmup-and-step-decay learning rate strategy. It maintains a constant batch size of 4608 sequences with a maximum sequence length of 4K, trained on 5.7 trillion tokens. Post-training includes long-context extension and fine-tuning for chat applications.
Guide: Running Locally
To run DeepSeek-V2-Lite locally, a single 40GB GPU is required. It can be run using Hugging Face's Transformers library or vLLM for optimized performance.
Inference with Transformers:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_name = "deepseek-ai/DeepSeek-V2-Lite"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
text = "An attention function can be described as mapping a query..."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Inference with vLLM (Recommended):
- Merge the pull request #4650 into your vLLM codebase.
- Use the following setup:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, trust_remote_code=True)
messages_list = [{"role": "user", "content": "Who are you?"}]
prompt_token_ids = [tokenizer.apply_chat_template(messages) for messages in messages_list]
outputs = llm.generate(prompt_token_ids=prompt_token_ids)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)
Cloud GPUs: Consider using cloud-based solutions like AWS or Google Cloud for GPU resources.
License
The code is licensed under the MIT License. The model's usage is governed by a Model License, allowing commercial use.