Deep Seek V3 G G U F

unsloth

Introduction

DeepSeek-V3 is a Mixture-of-Experts (MoE) language model with 671B parameters and 37B activated for each token. It employs Multi-head Latent Attention and DeepSeekMoE architectures for efficient inference and cost-effective training. The model is pre-trained on 14.8 trillion tokens, followed by supervised fine-tuning and reinforcement learning, achieving performance comparable to leading models while requiring only 2.788M H800 GPU hours for full training.

Architecture

DeepSeek-V3 features an innovative load-balancing strategy without auxiliary loss, minimizing performance degradation. The model uses a Multi-Token Prediction (MTP) objective, enhancing model performance and enabling speculative decoding to accelerate inference.

Training

DeepSeek-V3 employs a mixed FP8 precision training framework, validated on an extremely large scale. The model overcomes communication bottlenecks in MoE training, achieving significant efficiency and cost reductions. Pre-training on 14.8T tokens requires 2.664M H800 GPU hours, with subsequent stages needing 0.1M GPU hours. The training process is stable, with no irrecoverable loss spikes.

Guide: Running Locally

  1. Clone Repository:
    Clone the DeepSeek-V3 GitHub repository:

    git clone https://github.com/deepseek-ai/DeepSeek-V3.git
    
  2. Install Dependencies:
    Navigate to the inference folder and install the necessary packages:

    cd DeepSeek-V3/inference
    pip install -r requirements.txt
    
  3. Download Model Weights:
    Obtain model weights from HuggingFace and place them in the specified directory.

  4. Convert Model Weights:
    Use the conversion script to format model weights:

    python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 --save-path /path/to/DeepSeek-V3-Demo --n-experts 256 --model-parallel 16
    
  5. Run Inference:
    Execute the inference script for interactive or batch processing:

    torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200
    
  6. Cloud GPUs:
    Consider using cloud GPU services like AWS or Google Cloud for enhanced performance.

License

DeepSeek-V3 is licensed under the MIT License, allowing commercial use. The model license applies to both the Base and Chat models.

More Related APIs