Qwen2.5-7B-Instruct

Introduction

Qwen2.5 is part of the Qwen series of large language models, featuring a range of base and instruction-tuned models with parameters from 0.5 to 72 billion. This latest iteration offers substantial improvements in knowledge retention, coding, and mathematical capabilities, alongside enhanced instruction-following, long-text generation, structured data understanding, and multilingual support for over 29 languages. The instruction-tuned 7B Qwen2.5 model is characterized by its advanced causal language modeling architecture.

Architecture

Type: Causal Language Models
Training Stages: Pretraining & Post-training
Architecture: Transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
Parameters: Total 7.61B (Non-Embedding 6.53B)
Layers: 28
Attention Heads: 28 for Q and 4 for KV
Context Length: Up to 131,072 tokens; can generate up to 8,192 tokens

Training

The model has been pre-trained and post-trained to enhance its capabilities in various domains, including coding and mathematics. It has also been fine-tuned for better instruction following and handling longer contexts through techniques like YaRN.

Guide: Running Locally

Requirements: Ensure you have the latest Hugging Face Transformers library. Versions below 4.37.0 will not work.

Quickstart:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Processing Long Texts: For contexts longer than 32,768 tokens, enable YaRN in config.json:

{
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

Deployment: Use vLLM for deployment with long text support. Adjust rope_scaling only for long contexts as it affects short text performance.

Cloud GPUs: Consider using cloud providers like AWS, Azure, or Google Cloud for access to powerful GPUs for model training and inference.

License

The Qwen2.5-7B-Instruct model is licensed under the Apache-2.0 License. More details can be found in the LICENSE file.