Deep Seek V3 int4 sym gptq inc

OPEA

Introduction

DeepSeek-V3-int4-sym-gptq-inc is an int4 model with a group size of 128 and symmetric quantization, based on the DeepSeek-V3 model. It uses Intel’s auto-round algorithm for quantization. The model is designed for inference on both CUDA and CPU devices and supports INT4 quantization.

Architecture

The model is derived from the DeepSeek-V3 base model and utilizes symmetric quantization. It implements specific configurations to handle large inputs and outputs that may exceed typical FP16 ranges. The model's architecture allows for both CUDA and CPU deployment, with specific device mapping configurations for optimal performance.

Training

The model was quantized using the Intel auto-round algorithm, which optimizes weight rounding for better performance. The training process considers various quantization configurations, especially for layers prone to large tensor values. However, complete training details, including datasets and preprocessing, are not provided.

Guide: Running Locally

Basic Steps

  1. Install Dependencies:

    pip3 install auto-round
    
  2. Prepare Model and Tokenizer:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    
    quantized_model_dir = "OPEA/DeepSeek-V3-int4-sym-gptq-inc"
    model = AutoModelForCausalLM.from_pretrained(quantized_model_dir, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
    
  3. Inference Example:

    prompts = ["Your prompt here"]
    inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True)
    outputs = model.generate(input_ids=inputs["input_ids"].to(model.device), attention_mask=inputs["attention_mask"].to(model.device))
    decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    
  4. Run on CPU with ITREX:

    from auto_round import AutoRoundConfig
    quantization_config = AutoRoundConfig(backend="cpu")
    model = AutoModelForCausalLM.from_pretrained(quantized_model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="cpu", quantization_config=quantization_config)
    

Suggested Cloud GPUs

For optimal performance, consider using cloud GPUs with at least 80GB memory for CUDA operations. Ensure that you have sufficient resources to manage model loading times and inference workloads.

License

Please follow the license of the original DeepSeek-V3 model. The license does not constitute legal advice, and Hugging Face is not responsible for third-party use of the model. Consult an attorney for commercial use.

More Related APIs