Qwen 2.5 0.5 B Instruct 4bit

CaioXapelaum

Introduction

Qwen2.5 is the latest series in the Qwen large language models lineup. It offers significant improvements in knowledge, coding, mathematics, instruction following, long text generation, structured data understanding, and multilingual support, covering over 29 languages. The Qwen2.5-0.5B model is an instruction-tuned version with 0.5 billion parameters, designed for use in various applications such as chatbots and content generation.

Architecture

The Qwen2.5-0.5B model is a causal language model with:

  • 0.49 billion total parameters (0.36 billion non-embedding).
  • 24 layers and 14 attention heads for Q and 2 for KV.
  • Transformers architecture incorporating RoPE, SwiGLU, RMSNorm, Attention QKV bias, and tied word embeddings.
  • Context length support of 32,768 tokens, with generation up to 8,192 tokens.

Training

The model undergoes both pretraining and post-training phases to enhance its performance in various tasks, leveraging expert models for domains like coding and mathematics. It has been instruction-tuned to improve its ability to follow prompts and generate structured outputs.

Guide: Running Locally

  1. Requirements: Use the latest version of Hugging Face Transformers to avoid compatibility issues.
  2. Installation: Install the Transformers library using pip:
    pip install transformers
    
  3. Model and Tokenizer Loading:
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model_name = "Qwen/Qwen2.5-0.5B-Instruct"
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
  4. Prompt and Generation:
    prompt = "Give me a short introduction to large language model."
    messages = [{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, {"role": "user", "content": prompt}]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    generated_ids = model.generate(**model_inputs, max_new_tokens=512)
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
  5. Hardware Requirements: The model is optimized for 4-bit quantization, allowing it to run efficiently on GPUs like NVIDIA MX150. For better performance, consider using cloud GPUs such as NVIDIA V100 or A100.

License

The Qwen2.5-0.5B-Instruct model is distributed under the Apache 2.0 license. For more information, refer to the license link.

More Related APIs in Text Generation