Qw Q 1.5 B Persona LLM Model

Introduction

QwQ-1.5B-Persona is a fine-tuned model based on Qwen2.5-1.5B-Instruct, developed using a dataset of one million math persona examples. It is designed to efficiently accelerate the inference of the QwQ-32B model and can also be employed as a standalone model.

Architecture

QwQ-1.5B-Persona utilizes three draft length policies to enhance performance:

Constant: Maintains a fixed draft length of 5 tokens.
Heuristics: Adjusts the draft length dynamically based on previous round performance.
SVIP: Adapts draft length using model entropy, offering advanced dynamic control.

Training

The model's performance is evaluated on datasets such as MATH, GPQA, and AIME using 200 samples across different levels. These evaluations are conducted on hardware equipped with two A100 GPUs, each with 40GB of memory. The model demonstrates varying speedups across different draft length policies, with SVIP showing the highest average improvements.

Guide: Running Locally

To run QwQ-1.5B-Persona locally:

Install the Transformers Library:
```
pip install transformers
```

Load the Models:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/QwQ-32B-Preview",
    torch_dtype="auto",
    device_map={'': 0}
)

draft_model = AutoModelForCausalLM.from_pretrained(
    "Geralt-Targaryen/QwQ-1.5B-Persona",
    torch_dtype="auto",
    device_map={'': 0}
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B-Preview")

Prepare the Input and Generate Text:

prompt = "How many r in strawberry."
messages = [
    {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    assistant_model=draft_model
)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Cloud GPUs: For optimal performance, consider using cloud-based GPUs such as NVIDIA A100 or similar.

License

The QwQ-1.5B-Persona model is licensed under the Apache 2.0 License, permitting use, distribution, and modification under defined terms.

More Related APIs