Deep Seek V3 int4 Tensor R T LLM Model

Introduction

DeepSeek V3 INT4 (TensorRT-LLM) is an optimized version of the DeepSeek V3 model, designed for efficient high-speed inference using INT4 quantization with TensorRT-LLM. This model supports memory-efficient processing while maintaining performance.

Architecture

Base Model: DeepSeek V3 (BF16), adapted from Nvidia FP8.
Quantization: Utilizes a weight-only INT4 (W4A16) approach for reduced memory footprint and accelerated computation.

Training

The model is quantized to INT4 using a conversion script that transforms the original DeepSeek V3 (BF16) model. The script requires specifying the model directory, output directory, data type, tensor parallel size, and weight precision.

python convert_checkpoint.py \
  --model_dir /home/user/hf/deepseek-v3-bf16 \
  --output_dir /home/user/hf/deepseek-v3-int4 \
  --dtype bfloat16 \
  --tp_size 4 \
  --use_weight_only \
  --weight_only_precision int4 \
  --workers 4

Guide: Running Locally

Hardware Requirements:
- Optimal performance is achieved with 4×80 GB H100 or H200 GPUs.

Example Usage:

Utilize the trtllm-build command to generate inference engines:

trtllm-build --checkpoint_dir /DeepSeek-V3-int4-TensorRT  \
--output_dir ./trtllm_engines/deepseek_v3/int4/tp4-sel4096-isl2048-bs4

Cloud GPU Recommendation:
- Consider using cloud services like AWS or Google Cloud Platform to access suitable GPUs for model inference.

License

This model is provided as a quantized checkpoint for research and experimentation in high-performance inference contexts. Users are advised to validate outputs for any production use-cases and acknowledge the inherent risks.

More Related APIs in Text Generation