Qwen2 72 B Instruct
QwenQwen2-72B-Instruct
Introduction
Qwen2 is a series of large language models, with versions ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. Qwen2 models are designed to outperform open-source models and compete with proprietary models across various benchmarks in language understanding, generation, multilingual capabilities, coding, mathematics, and reasoning. The Qwen2-72B-Instruct model supports a context length of up to 131,072 tokens, allowing for extensive input processing.
Architecture
Qwen2 models are based on the Transformer architecture, incorporating features such as SwiGLU activation, attention QKV bias, and group query attention. The models include an improved tokenizer that supports multiple natural languages and codes. Each model size has a base language model and an aligned chat model.
Training
The Qwen2 models underwent pretraining with a large dataset, followed by post-training with supervised fine-tuning and direct preference optimization.
Guide: Running Locally
Requirements
Ensure you have transformers>=4.37.0
to avoid potential errors related to the 'qwen2' key.
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2-72B-Instruct",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-72B-Instruct")
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Processing Long Texts
For inputs exceeding 32,768 tokens, use YARN to enhance model length extrapolation.
-
Install vLLM:
pip install "vllm>=0.4.3"
-
Configure Model Settings: Modify
config.json
to include the following snippet:{ "architectures": ["Qwen2ForCausalLM"], "vocab_size": 152064, "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" } }
-
Model Deployment: Use vLLM to deploy the model.
python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-72B-Instruct --model path/to/weights
Access the Chat API using:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen2-72B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Your Long Input Here."} ] }'
Cloud GPUs
For optimal performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
This model is licensed under the Tongyi-Qianwen license. For more information, view the license document.