Deep Seek V3 G G U F
unslothIntroduction
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model with 671B parameters and 37B activated for each token. It employs Multi-head Latent Attention and DeepSeekMoE architectures for efficient inference and cost-effective training. The model is pre-trained on 14.8 trillion tokens, followed by supervised fine-tuning and reinforcement learning, achieving performance comparable to leading models while requiring only 2.788M H800 GPU hours for full training.
Architecture
DeepSeek-V3 features an innovative load-balancing strategy without auxiliary loss, minimizing performance degradation. The model uses a Multi-Token Prediction (MTP) objective, enhancing model performance and enabling speculative decoding to accelerate inference.
Training
DeepSeek-V3 employs a mixed FP8 precision training framework, validated on an extremely large scale. The model overcomes communication bottlenecks in MoE training, achieving significant efficiency and cost reductions. Pre-training on 14.8T tokens requires 2.664M H800 GPU hours, with subsequent stages needing 0.1M GPU hours. The training process is stable, with no irrecoverable loss spikes.
Guide: Running Locally
-
Clone Repository:
Clone the DeepSeek-V3 GitHub repository:git clone https://github.com/deepseek-ai/DeepSeek-V3.git
-
Install Dependencies:
Navigate to the inference folder and install the necessary packages:cd DeepSeek-V3/inference pip install -r requirements.txt
-
Download Model Weights:
Obtain model weights from HuggingFace and place them in the specified directory. -
Convert Model Weights:
Use the conversion script to format model weights:python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 --save-path /path/to/DeepSeek-V3-Demo --n-experts 256 --model-parallel 16
-
Run Inference:
Execute the inference script for interactive or batch processing:torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200
-
Cloud GPUs:
Consider using cloud GPU services like AWS or Google Cloud for enhanced performance.
License
DeepSeek-V3 is licensed under the MIT License, allowing commercial use. The model license applies to both the Base and Chat models.