Qwen2.5 7 B Instruct

Qwen

Qwen2.5-7B-Instruct

Introduction

Qwen2.5 is part of the Qwen series of large language models, featuring a range of base and instruction-tuned models with parameters from 0.5 to 72 billion. This latest iteration offers substantial improvements in knowledge retention, coding, and mathematical capabilities, alongside enhanced instruction-following, long-text generation, structured data understanding, and multilingual support for over 29 languages. The instruction-tuned 7B Qwen2.5 model is characterized by its advanced causal language modeling architecture.

Architecture

  • Type: Causal Language Models
  • Training Stages: Pretraining & Post-training
  • Architecture: Transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
  • Parameters: Total 7.61B (Non-Embedding 6.53B)
  • Layers: 28
  • Attention Heads: 28 for Q and 4 for KV
  • Context Length: Up to 131,072 tokens; can generate up to 8,192 tokens

Training

The model has been pre-trained and post-trained to enhance its capabilities in various domains, including coding and mathematics. It has also been fine-tuned for better instruction following and handling longer contexts through techniques like YaRN.

Guide: Running Locally

  1. Requirements: Ensure you have the latest Hugging Face Transformers library. Versions below 4.37.0 will not work.
  2. Quickstart:
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model_name = "Qwen/Qwen2.5-7B-Instruct"
    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype="auto",
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    prompt = "Give me a short introduction to large language model."
    messages = [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512
    )
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
  3. Processing Long Texts: For contexts longer than 32,768 tokens, enable YaRN in config.json:
    {
      "rope_scaling": {
        "factor": 4.0,
        "original_max_position_embeddings": 32768,
        "type": "yarn"
      }
    }
    
  4. Deployment: Use vLLM for deployment with long text support. Adjust rope_scaling only for long contexts as it affects short text performance.

Cloud GPUs: Consider using cloud providers like AWS, Azure, or Google Cloud for access to powerful GPUs for model training and inference.

License

The Qwen2.5-7B-Instruct model is licensed under the Apache-2.0 License. More details can be found in the LICENSE file.

More Related APIs in Text Generation