Qwen2.5 Coder 32 B Instruct G P T Q Int4
QwenIntroduction
Qwen2.5-Coder is an advanced series of code-specific large language models, formerly known as CodeQwen. The series features six model sizes ranging from 0.5 to 32 billion parameters, aimed at different developer requirements. It offers significant enhancements in code generation, reasoning, and fixing, with a training dataset of 5.5 trillion tokens. The 32B model is state-of-the-art in open-source code language models, comparable to GPT-4o. It supports long-context processing up to 128K tokens and is designed for real-world applications, including Code Agents.
Architecture
Qwen2.5-Coder-32B uses a transformer architecture with several advanced features:
- Type: Causal Language Models
- Training Stage: Pretraining & Post-training
- Parameters: 32.5 billion (31.0 billion excluding embeddings)
- Layers: 64
- Attention Heads (GQA): 40 for Q and 8 for KV
- Context Length: Supports up to 131,072 tokens
- Quantization: GPTQ 4-bit
- Key Technologies: RoPE, SwiGLU, RMSNorm, and Attention QKV bias
Training
The model has been trained on a large dataset including source code, text-code grounding, and synthetic data. It includes improvements in handling real-world applications, maintaining strengths in mathematics and general competencies.
Guide: Running Locally
To run Qwen2.5-Coder-32B locally, follow these steps:
-
Install Requirements:
- Ensure you have the latest version of Hugging Face's transformers library. Versions below 4.37.0 may cause compatibility issues.
-
Load the Model:
from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name)
-
Process Input and Generate Text:
prompt = "write a quick sort algorithm." messages = [ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=512 ) response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
-
Cloud GPUs: For optimal performance, especially with larger models, consider using cloud GPU services like AWS, Google Cloud, or Azure.
License
Qwen2.5-Coder-32B-Instruct-GPTQ-Int4 is licensed under the Apache-2.0 License. More details can be found in the license file.