Qwen2.5 72 B Instruct
QwenIntroduction
Qwen2.5 is the latest series of Qwen large language models, offering a range of base and instruction-tuned models from 0.5 to 72 billion parameters. It enhances capabilities in coding, mathematics, instruction following, and supports multilingual functionalities across more than 29 languages. The instruction-tuned 72B Qwen2.5 model is designed with advanced features for improved text generation and structured data handling.
Architecture
The Qwen2.5-72B model is a causal language model featuring transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias. It comprises 72.7 billion parameters (70.0 billion non-embedding), structured into 80 layers with 64 and 8 attention heads for Q and KV, respectively. It supports a full context length of 131,072 tokens and can generate up to 8,192 tokens.
Training
Qwen2.5 undergoes both pretraining and post-training. Its development includes specialized expert models for coding and mathematics, significantly boosting its knowledge and performance in these areas. The model excels at generating long texts, understanding structured data, and is robust against diverse system prompts, facilitating effective chatbot implementations.
Guide: Running Locally
-
Install Requirements: Ensure the latest version of Hugging Face Transformers is installed to avoid compatibility issues.
-
Load Model and Tokenizer:
from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/Qwen2.5-72B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name)
-
Generate Text:
prompt = "Give me a short introduction to large language model." messages = [ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate(**model_inputs, max_new_tokens=512) response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
-
Handling Long Texts: Adjust the configuration to enable YaRN for processing inputs exceeding 32,768 tokens.
-
Cloud GPUs: For optimal performance, especially with large models, utilizing cloud GPUs such as those from AWS, Google Cloud, or Azure is recommended.
License
The model is released under a specific license, denoted as "qwen." For detailed licensing information, refer to the license document.