Deep Seek V3 bf16

unsloth

Introduction

DeepSeek-V3 is a Mixture-of-Experts (MoE) language model developed by DeepSeek AI. It features 671 billion total parameters with 37 billion activated for each token. The model is designed for efficient inference and cost-effective training, utilizing novel strategies such as Multi-head Latent Attention (MLA) and an auxiliary-loss-free load balancing strategy.

Architecture

DeepSeek-V3 builds on the architecture of DeepSeek-V2, introducing an auxiliary-loss-free strategy to enhance load balancing while minimizing performance degradation. It also employs a Multi-Token Prediction (MTP) objective to improve model performance. The architecture supports speculative decoding for accelerated inference.

Training

DeepSeek-V3 employs an FP8 mixed precision training framework validated on large-scale models. The training process is highly efficient due to the co-design of algorithms, frameworks, and hardware, allowing full computation-communication overlap. Pre-training on 14.8 trillion tokens is achieved at an economical cost of 2.664 million H800 GPU hours, followed by post-training with knowledge distillation from the DeepSeek R1 series models.

Guide: Running Locally

  1. Preparation:

    • Clone the DeepSeek-V3 repository:
      git clone https://github.com/deepseek-ai/DeepSeek-V3.git
      
    • Navigate to the inference folder and install dependencies:
      cd DeepSeek-V3/inference
      pip install -r requirements.txt
      
  2. Model Weights:

    • Download model weights from HuggingFace and place them in the /path/to/DeepSeek-V3 folder.
    • Convert FP8 weights to BF16 if necessary using:
      python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights
      
  3. Run Inference:

    • Use DeepSeek-Infer Demo for chatting with the model:
      torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200
      
  4. Recommended Frameworks:

    • SGLang and LMDeploy are recommended for efficient inference on NVIDIA and AMD GPUs.
    • TensorRT-LLM and vLLM are also supported for FP8 and BF16 modes.
  5. Cloud GPUs:

    • Consider using cloud services like Google Colab with GPUs such as Tesla T4 for running notebooks and performing model inference.

License

The DeepSeek-V3 code is licensed under the MIT License, and the use of the Base/Chat models is subject to the Model License. The models support commercial use.

More Related APIs in Text Generation