Qw Q 32 B Preview G G U F
bartowskiIntroduction
QwQ-32B-Preview-GGUF is a quantized version of the Qwen/QwQ-32B-Preview model, designed for efficient text generation. It has been optimized using llama.cpp and supports various quantization levels to suit different performance and resource requirements.
Architecture
The model utilizes LLAMACPP IMATRIX quantizations and has been prepared using the llama.cpp framework. Various quantization types such as BF16, Q8_0, Q6_K, and others are available, offering a range of quality and performance options.
Training
The quantization process involved using a dataset from a public repository. This process included employing the imatrix option to ensure different levels of model compression, offering trade-offs between size and performance. The training and quantization efforts were assisted by contributors who provided dataset support and inspiration for advanced techniques.
Guide: Running Locally
To run the QwQ-32B-Preview-GGUF model locally:
-
Install Dependencies:
- Ensure you have
huggingface_hub
CLI installed:pip install -U "huggingface_hub[cli]"
- Ensure you have
-
Download the Model:
- Use the following command to download the desired file:
huggingface-cli download bartowski/QwQ-32B-Preview-GGUF --include "QwQ-32B-Preview-Q4_K_M.gguf" --local-dir ./
- Use the following command to download the desired file:
-
Choose the Right Quant:
- Consider your system's RAM/VRAM capacity and choose a quant file that fits within available resources. A quant 1-2GB smaller than your GPU's VRAM is recommended for optimal performance.
-
Running on Cloud GPUs:
- For enhanced performance, consider using cloud platforms offering NVIDIA GPUs. Services like AWS, Google Cloud, and Azure provide robust GPU options suitable for running large models.
License
The QwQ-32B-Preview-GGUF model is released under the Apache-2.0 license, which allows for both personal and commercial use, modifications, and distribution. More details can be found on the license page.