Qw Q 32 B Preview LLM Model

Introduction

QwQ-32B-Preview is an experimental research model created by the Qwen Team, aimed at enhancing AI reasoning capabilities. This preview release showcases advanced analytical abilities but presents several limitations, such as language mixing, recursive reasoning loops, and safety concerns. It is particularly strong in math and coding but needs improvement in common sense reasoning and nuanced language understanding.

Architecture

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Architecture Components: Transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
Number of Layers: 64
Attention Heads (GQA): 40 for Q and 8 for KV
Context Length: Full 32,768 tokens

Training

The model is trained in stages, including pretraining and post-training, utilizing the latest advancements in transformer architectures. It requires the latest version of Hugging Face transformers for optimal functionality.

Model Stats Number

Total Parameters: 32.5 billion
Parameters (Non-Embedding): 31.0 billion

Guide: Running Locally

Install Dependencies: Ensure you have the latest version of the Hugging Face Transformers library. Older versions may cause errors.

Load Model and Tokenizer:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/QwQ-32B-Preview"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

Prepare Input and Generate Text:

prompt = "Your prompt here."
messages = [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]