Deep Seek V3 int4 Tensor R T
inarikamiIntroduction
DeepSeek V3 INT4 (TensorRT-LLM) is an optimized version of the DeepSeek V3 model, designed for efficient high-speed inference using INT4 quantization with TensorRT-LLM. This model supports memory-efficient processing while maintaining performance.
Architecture
- Base Model: DeepSeek V3 (BF16), adapted from Nvidia FP8.
- Quantization: Utilizes a weight-only INT4 (W4A16) approach for reduced memory footprint and accelerated computation.
Training
The model is quantized to INT4 using a conversion script that transforms the original DeepSeek V3 (BF16) model. The script requires specifying the model directory, output directory, data type, tensor parallel size, and weight precision.
python convert_checkpoint.py \
--model_dir /home/user/hf/deepseek-v3-bf16 \
--output_dir /home/user/hf/deepseek-v3-int4 \
--dtype bfloat16 \
--tp_size 4 \
--use_weight_only \
--weight_only_precision int4 \
--workers 4
Guide: Running Locally
-
Hardware Requirements:
- Optimal performance is achieved with 4×80 GB H100 or H200 GPUs.
-
Example Usage:
- Utilize the
trtllm-build
command to generate inference engines:
trtllm-build --checkpoint_dir /DeepSeek-V3-int4-TensorRT \ --output_dir ./trtllm_engines/deepseek_v3/int4/tp4-sel4096-isl2048-bs4
- Utilize the
-
Cloud GPU Recommendation:
- Consider using cloud services like AWS or Google Cloud Platform to access suitable GPUs for model inference.
License
This model is provided as a quantized checkpoint for research and experimentation in high-performance inference contexts. Users are advised to validate outputs for any production use-cases and acknowledge the inherent risks.