Qwen2.5 72 B Instruct A W Q LLM Model

Introduction

Qwen2.5 is the latest iteration of the Qwen series of large language models. This version includes models ranging from 0.5 to 72 billion parameters and boasts significant improvements in knowledge, instruction-following, text generation, and multilingual support. It is designed for enhanced coding and mathematics capabilities, long-context handling, and structured data understanding. The AWQ-quantized 4-bit instruction-tuned 72B model is a major highlight, featuring causal language model architecture with advanced components like RoPE and SwiGLU.

Architecture

Qwen2.5 employs a transformers architecture with several advanced features:

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Components: RoPE, SwiGLU, RMSNorm, and Attention QKV bias
Number of Parameters: 72.7 billion in total; 70 billion non-embedding
Layer Count: 80
Attention Heads (GQA): 64 for Q and 8 for KV
Context Length: Full 131,072 tokens with a generation capacity of 8,192 tokens
Quantization: AWQ 4-bit

Training

The model is pre-trained and post-trained to enhance its performance across various tasks, including instruction following and multilingual text generation. The AWQ quantization is used to reduce the model size while preserving accuracy.

Guide: Running Locally

To run Qwen2.5 locally, follow these steps:

Install Dependencies: Ensure you have the latest version of the transformers library, as versions below 4.37.0 may cause errors.

Load Model and Tokenizer:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-72B-Instruct-AWQ"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

Generate Text:

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Suggested Cloud GPUs: Consider using cloud services like AWS, Google Cloud, or Azure with high-memory GPUs to efficiently handle the model's resource requirements.

License

The Qwen2.5 model is released under the qwen license. For more details, refer to the license link.

More Related APIs in Text Generation