Deep Seek V3

unsloth

Introduction

DeepSeek-V3 is a Mixture-of-Experts (MoE) language model with 671 billion parameters, designed for efficient inference and training. It incorporates Multi-head Latent Attention (MLA) and DeepSeekMoE architectures to enhance performance, validated in its predecessor, DeepSeek-V2. The model outperforms many open-source models and competes with top closed-source models. It was pre-trained on 14.8 trillion tokens and requires 2.788 million H800 GPU hours for training.

Architecture

DeepSeek-V3 introduces an auxiliary-loss-free strategy for load balancing, minimizing performance degradation. It employs a Multi-Token Prediction (MTP) objective to boost performance and facilitate speculative decoding for faster inference. The model uses FP8 mixed precision training, validated on a large scale, to improve training efficiency and reduce costs.

Training

DeepSeek-V3's pre-training involves FP8 mixed precision, achieving efficient communication in cross-node MoE training. It completes pre-training on 14.8 trillion tokens with 2.664 million H800 GPU hours. Post-training involves distilling reasoning capabilities from DeepSeek-R1 into DeepSeek-V3, enhancing its reasoning and output control.

Guide: Running Locally

DeepSeek-V3 can be run locally using various hardware and software:

  1. DeepSeek-Infer Demo: Lightweight setup for FP8 and BF16 inference.
  2. SGLang: Supports FP8 and BF16 inference on NVIDIA and AMD GPUs.
  3. LMDeploy: Provides offline and online deployment options.
  4. TensorRT-LLM: Supports BF16 and INT4/8 quantizations; FP8 support is coming.
  5. vLLM: Offers pipeline parallelism for distributed runs.
  6. AMD GPU and Huawei Ascend NPU: Provide support for FP8 and BF16 modes.

To run locally, clone the DeepSeek-V3 repository, install dependencies, and follow the conversion and inference steps provided in the respective framework documentation.

Suggested Cloud GPUs

  • Google Colab with Tesla T4
  • NVIDIA GPUs for TensorRT-LLM
  • AMD GPUs for SGLang
  • Huawei Ascend for MindIE

License

DeepSeek-V3 is licensed under the MIT License for code. The model's Base and Chat versions support commercial use under the Model License.

More Related APIs in Text Generation