Deep Seek V3 int4 sym gptq inc
OPEAIntroduction
DeepSeek-V3-int4-sym-gptq-inc is an int4 model with a group size of 128 and symmetric quantization, based on the DeepSeek-V3 model. It uses Intel’s auto-round algorithm for quantization. The model is designed for inference on both CUDA and CPU devices and supports INT4 quantization.
Architecture
The model is derived from the DeepSeek-V3 base model and utilizes symmetric quantization. It implements specific configurations to handle large inputs and outputs that may exceed typical FP16 ranges. The model's architecture allows for both CUDA and CPU deployment, with specific device mapping configurations for optimal performance.
Training
The model was quantized using the Intel auto-round algorithm, which optimizes weight rounding for better performance. The training process considers various quantization configurations, especially for layers prone to large tensor values. However, complete training details, including datasets and preprocessing, are not provided.
Guide: Running Locally
Basic Steps
-
Install Dependencies:
pip3 install auto-round
-
Prepare Model and Tokenizer:
from transformers import AutoModelForCausalLM, AutoTokenizer import torch quantized_model_dir = "OPEA/DeepSeek-V3-int4-sym-gptq-inc" model = AutoModelForCausalLM.from_pretrained(quantized_model_dir, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
-
Inference Example:
prompts = ["Your prompt here"] inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True) outputs = model.generate(input_ids=inputs["input_ids"].to(model.device), attention_mask=inputs["attention_mask"].to(model.device)) decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
-
Run on CPU with ITREX:
from auto_round import AutoRoundConfig quantization_config = AutoRoundConfig(backend="cpu") model = AutoModelForCausalLM.from_pretrained(quantized_model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="cpu", quantization_config=quantization_config)
Suggested Cloud GPUs
For optimal performance, consider using cloud GPUs with at least 80GB memory for CUDA operations. Ensure that you have sufficient resources to manage model loading times and inference workloads.
License
Please follow the license of the original DeepSeek-V3 model. The license does not constitute legal advice, and Hugging Face is not responsible for third-party use of the model. Consult an attorney for commercial use.