Qwen2.5 72 B Instruct A W Q
QwenIntroduction
Qwen2.5 is the latest iteration of the Qwen series of large language models. This version includes models ranging from 0.5 to 72 billion parameters and boasts significant improvements in knowledge, instruction-following, text generation, and multilingual support. It is designed for enhanced coding and mathematics capabilities, long-context handling, and structured data understanding. The AWQ-quantized 4-bit instruction-tuned 72B model is a major highlight, featuring causal language model architecture with advanced components like RoPE and SwiGLU.
Architecture
Qwen2.5 employs a transformers architecture with several advanced features:
- Type: Causal Language Models
- Training Stage: Pretraining & Post-training
- Components: RoPE, SwiGLU, RMSNorm, and Attention QKV bias
- Number of Parameters: 72.7 billion in total; 70 billion non-embedding
- Layer Count: 80
- Attention Heads (GQA): 64 for Q and 8 for KV
- Context Length: Full 131,072 tokens with a generation capacity of 8,192 tokens
- Quantization: AWQ 4-bit
Training
The model is pre-trained and post-trained to enhance its performance across various tasks, including instruction following and multilingual text generation. The AWQ quantization is used to reduce the model size while preserving accuracy.
Guide: Running Locally
To run Qwen2.5 locally, follow these steps:
- Install Dependencies: Ensure you have the latest version of the
transformers
library, as versions below 4.37.0 may cause errors. - Load Model and Tokenizer:
from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/Qwen2.5-72B-Instruct-AWQ" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name)
- Generate Text:
prompt = "Give me a short introduction to large language model." messages = [ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate(**model_inputs, max_new_tokens=512) response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Suggested Cloud GPUs: Consider using cloud services like AWS, Google Cloud, or Azure with high-memory GPUs to efficiently handle the model's resource requirements.
License
The Qwen2.5 model is released under the qwen
license. For more details, refer to the license link.