Deep Seek V3 bf16 LLM Model

Introduction

DeepSeek-V3 is a Mixture-of-Experts (MoE) language model developed by DeepSeek AI. It features 671 billion total parameters with 37 billion activated for each token. The model is designed for efficient inference and cost-effective training, utilizing novel strategies such as Multi-head Latent Attention (MLA) and an auxiliary-loss-free load balancing strategy.

Architecture

DeepSeek-V3 builds on the architecture of DeepSeek-V2, introducing an auxiliary-loss-free strategy to enhance load balancing while minimizing performance degradation. It also employs a Multi-Token Prediction (MTP) objective to improve model performance. The architecture supports speculative decoding for accelerated inference.

Training

DeepSeek-V3 employs an FP8 mixed precision training framework validated on large-scale models. The training process is highly efficient due to the co-design of algorithms, frameworks, and hardware, allowing full computation-communication overlap. Pre-training on 14.8 trillion tokens is achieved at an economical cost of 2.664 million H800 GPU hours, followed by post-training with knowledge distillation from the DeepSeek R1 series models.

Guide: Running Locally

Preparation:

Clone the DeepSeek-V3 repository:

git clone https://github.com/deepseek-ai/DeepSeek-V3.git

Navigate to the inference folder and install dependencies:

cd DeepSeek-V3/inference
pip install -r requirements.txt

Model Weights:
- Download model weights from HuggingFace and place them in the /path/to/DeepSeek-V3 folder.
- Convert FP8 weights to BF16 if necessary using:
```
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights
```

Run Inference:

Use DeepSeek-Infer Demo for chatting with the model:

torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200

Recommended Frameworks:
- SGLang and LMDeploy are recommended for efficient inference on NVIDIA and AMD GPUs.
- TensorRT-LLM and vLLM are also supported for FP8 and BF16 modes.
Cloud GPUs:
- Consider using cloud services like Google Colab with GPUs such as Tesla T4 for running notebooks and performing model inference.

License

The DeepSeek-V3 code is licensed under the MIT License, and the use of the Base/Chat models is subject to the Model License. The models support commercial use.

More Related APIs in Text Generation