Q V Q 72 B Preview G G U F
bartowskiIntroduction
The QVQ-72B-Preview-GGUF is an advanced quantized model designed to handle image-text-to-text tasks. It uses the GGUF library and provides functionalities for chat applications. The model is based on the Qwen/QVQ-72B and is available in multiple quantized versions to suit different performance and quality needs.
Architecture
The model uses quantization techniques, specifically LLAMACPP IMATRIX QUANTIZATIONS, with the original model being Qwen/QVQ-72B-Preview. It utilizes the llama.cpp library for quantization processes, applying the imatrix option for dataset calibration.
Training
Quantization of the model involves using different techniques to balance between model size and performance. Various quantized versions are available, such as Q8_0, Q6_K, Q5_K_M, and others, each differing in size and quality. The quantization process ensures that the model can be efficiently run on hardware with different capabilities.
Guide: Running Locally
To run the model locally, the following steps are recommended:
-
Download the Model: Use the
huggingface-cli
to download the specific quantized file from the Hugging Face repository:pip install -U "huggingface_hub[cli]" huggingface-cli download bartowski/QVQ-72B-Preview-GGUF --include "QVQ-72B-Preview-Q4_K_M.gguf" --local-dir ./
If the model is larger than 50GB, it will be split into multiple files.
-
Run the Model: Use the
llama-qwen2vl-cli
tool to execute commands. For example:./llama-qwen2vl-cli -m /models/QVQ-72B-Preview-Q4_K_M.gguf \ --mmproj /models/mmproj-QVQ-72B-Preview-f16.gguf \ -p 'How many fingers does this hand have.' \ --image '/models/hand.jpg'
-
Select Appropriate Quantization: Based on the available hardware, choose a quantization type that fits the memory and performance requirements. K-quants are recommended for general use, while I-quants offer better performance for specific hardware setups.
Cloud GPUs
For optimal performance, especially with larger models, utilizing cloud GPUs such as those provided by AWS, Google Cloud, or Azure is recommended. This allows the model to fully leverage GPU capabilities for faster inference and better handling of large datasets.
License
The model is distributed under the "qwen" license. For more details, refer to the license link.