Qwen2.5 7 B Instruct
QwenQwen2.5-7B-Instruct
Introduction
Qwen2.5 is part of the Qwen series of large language models, featuring a range of base and instruction-tuned models with parameters from 0.5 to 72 billion. This latest iteration offers substantial improvements in knowledge retention, coding, and mathematical capabilities, alongside enhanced instruction-following, long-text generation, structured data understanding, and multilingual support for over 29 languages. The instruction-tuned 7B Qwen2.5 model is characterized by its advanced causal language modeling architecture.
Architecture
- Type: Causal Language Models
- Training Stages: Pretraining & Post-training
- Architecture: Transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
- Parameters: Total 7.61B (Non-Embedding 6.53B)
- Layers: 28
- Attention Heads: 28 for Q and 4 for KV
- Context Length: Up to 131,072 tokens; can generate up to 8,192 tokens
Training
The model has been pre-trained and post-trained to enhance its capabilities in various domains, including coding and mathematics. It has also been fine-tuned for better instruction following and handling longer contexts through techniques like YaRN.
Guide: Running Locally
- Requirements: Ensure you have the latest Hugging Face Transformers library. Versions below 4.37.0 will not work.
- Quickstart:
from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/Qwen2.5-7B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "Give me a short introduction to large language model." messages = [ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=512 ) response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
- Processing Long Texts: For contexts longer than 32,768 tokens, enable YaRN in
config.json
:{ "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" } }
- Deployment: Use vLLM for deployment with long text support. Adjust
rope_scaling
only for long contexts as it affects short text performance.
Cloud GPUs: Consider using cloud providers like AWS, Azure, or Google Cloud for access to powerful GPUs for model training and inference.
License
The Qwen2.5-7B-Instruct model is licensed under the Apache-2.0 License. More details can be found in the LICENSE file.