Qwen2.5 72 B Instruct G P T Q Int4
QwenIntroduction
Qwen2.5 is part of the latest series of Qwen large language models. It includes several base and instruction-tuned models ranging from 0.5 to 72 billion parameters. Enhancements in Qwen2.5 include improved knowledge in coding and mathematics, better instruction following, support for generating long texts, understanding structured data, and multilingual support for over 29 languages. This repository provides a GPTQ-quantized 4-bit instruction-tuned version of the 72B Qwen2.5 model.
Architecture
The 72B Qwen2.5 model features the following architectural specifications:
- Causal Language Models
- Uses transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
- 72.7 billion parameters in total, with 70.0 billion non-embedding parameters
- 80 layers
- 64 attention heads for Q and 8 for KV
- Supports a full context length of 131,072 tokens and can generate up to 8,192 tokens
- Quantized using GPTQ 4-bit
Training
The model undergoes both pretraining and post-training stages. It has been enhanced with specialized expert models for improved performance in coding and mathematics. The training process includes instruction tuning to improve its ability to follow instructions and generate structured outputs.
Guide: Running Locally
-
Requirements: Ensure you have the latest version of Hugging Face Transformers installed, as versions below 4.37.0 may result in errors.
-
Quickstart Code:
from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name)
-
Processing Long Texts: Utilize the YaRN technique for handling texts exceeding 32,768 tokens by modifying the
config.json
file to include:"rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" }
-
Deployment: Consider using vLLM for deployment, especially for processing extensive inputs.
-
Cloud GPUs: Due to the model's size, running on a local machine might be resource-intensive. Consider using cloud GPUs from platforms like AWS, Google Cloud, or Azure.
License
The model is released under the Qwen license. For detailed terms and conditions, please refer to the license link.