Qw Q 32 B Preview gptqmodel 4bit vortex v2

ModelCloud

Introduction

The QwQ-32B-Preview-GPTQMODEL-4BIT-VORTEX-V2 is a quantized language model for text generation, designed to operate with high efficiency and low precision, specifically 4-bit. It is part of the ModelCloud suite and optimized for chat and instructive tasks.

Architecture

This model utilizes the GPTQ quantization method to reduce model size and increase computational efficiency. Key architectural features include:

  • 4-bit precision
  • Group size of 32
  • True sequential processing
  • Symmetric quantization

Training

The model has been quantized with the GPTQModel library, version 1.4.4, using a dampening percentage of 0.1 and an auto-increment of 0.0015. The quantization process involves techniques like symmetric quantization and static group adjustment to ensure optimal performance.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python installed. Use pip to install transformers and gptqmodel.

    pip install transformers
    pip install gptqmodel
    
  2. Load the Model and Tokenizer:

    from transformers import AutoTokenizer
    from gptqmodel import GPTQModel
    
    tokenizer = AutoTokenizer.from_pretrained("ModelCloud/QwQ-32B-Preview-gptqmodel-4bit-vortex-v2")
    model = GPTQModel.load("ModelCloud/QwQ-32B-Preview-gptqmodel-4bit-vortex-v2")
    
  3. Create Input Messages:

    messages = [
        {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."},
        {"role": "user", "content": "How can I design a data structure in C++ to store the top 5 largest integer numbers?"},
    ]
    
  4. Generate Responses:

    input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
    outputs = model.generate(input_ids=input_tensor.to(model.device), max_new_tokens=512)
    result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
    
    print(result)
    
  5. Consider Using Cloud GPUs: For better performance, especially with large models, consider using cloud-based GPUs such as AWS, Google Cloud, or Azure.

License

This model is released under the Apache 2.0 license. For more details, refer to the license file here.

More Related APIs in Text Generation