Deep Seek V3 bf16
unslothIntroduction
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model developed by DeepSeek AI. It features 671 billion total parameters with 37 billion activated for each token. The model is designed for efficient inference and cost-effective training, utilizing novel strategies such as Multi-head Latent Attention (MLA) and an auxiliary-loss-free load balancing strategy.
Architecture
DeepSeek-V3 builds on the architecture of DeepSeek-V2, introducing an auxiliary-loss-free strategy to enhance load balancing while minimizing performance degradation. It also employs a Multi-Token Prediction (MTP) objective to improve model performance. The architecture supports speculative decoding for accelerated inference.
Training
DeepSeek-V3 employs an FP8 mixed precision training framework validated on large-scale models. The training process is highly efficient due to the co-design of algorithms, frameworks, and hardware, allowing full computation-communication overlap. Pre-training on 14.8 trillion tokens is achieved at an economical cost of 2.664 million H800 GPU hours, followed by post-training with knowledge distillation from the DeepSeek R1 series models.
Guide: Running Locally
-
Preparation:
- Clone the DeepSeek-V3 repository:
git clone https://github.com/deepseek-ai/DeepSeek-V3.git
- Navigate to the inference folder and install dependencies:
cd DeepSeek-V3/inference pip install -r requirements.txt
- Clone the DeepSeek-V3 repository:
-
Model Weights:
- Download model weights from HuggingFace and place them in the
/path/to/DeepSeek-V3
folder. - Convert FP8 weights to BF16 if necessary using:
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights
- Download model weights from HuggingFace and place them in the
-
Run Inference:
- Use DeepSeek-Infer Demo for chatting with the model:
torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200
- Use DeepSeek-Infer Demo for chatting with the model:
-
Recommended Frameworks:
- SGLang and LMDeploy are recommended for efficient inference on NVIDIA and AMD GPUs.
- TensorRT-LLM and vLLM are also supported for FP8 and BF16 modes.
-
Cloud GPUs:
- Consider using cloud services like Google Colab with GPUs such as Tesla T4 for running notebooks and performing model inference.
License
The DeepSeek-V3 code is licensed under the MIT License, and the use of the Base/Chat models is subject to the Model License. The models support commercial use.