Deep Seek V3 int4 Tensor R T

inarikami

Introduction

DeepSeek V3 INT4 (TensorRT-LLM) is an optimized version of the DeepSeek V3 model, designed for efficient high-speed inference using INT4 quantization with TensorRT-LLM. This model supports memory-efficient processing while maintaining performance.

Architecture

  • Base Model: DeepSeek V3 (BF16), adapted from Nvidia FP8.
  • Quantization: Utilizes a weight-only INT4 (W4A16) approach for reduced memory footprint and accelerated computation.

Training

The model is quantized to INT4 using a conversion script that transforms the original DeepSeek V3 (BF16) model. The script requires specifying the model directory, output directory, data type, tensor parallel size, and weight precision.

python convert_checkpoint.py \
  --model_dir /home/user/hf/deepseek-v3-bf16 \
  --output_dir /home/user/hf/deepseek-v3-int4 \
  --dtype bfloat16 \
  --tp_size 4 \
  --use_weight_only \
  --weight_only_precision int4 \
  --workers 4

Guide: Running Locally

  1. Hardware Requirements:

    • Optimal performance is achieved with 4×80 GB H100 or H200 GPUs.
  2. Example Usage:

    • Utilize the trtllm-build command to generate inference engines:
    trtllm-build --checkpoint_dir /DeepSeek-V3-int4-TensorRT  \
    --output_dir ./trtllm_engines/deepseek_v3/int4/tp4-sel4096-isl2048-bs4
    
  3. Cloud GPU Recommendation:

    • Consider using cloud services like AWS or Google Cloud Platform to access suitable GPUs for model inference.

License

This model is provided as a quantized checkpoint for research and experimentation in high-performance inference contexts. Users are advised to validate outputs for any production use-cases and acknowledge the inherent risks.

More Related APIs in Text Generation