Qwen2 72 B Instruct

Qwen

Qwen2-72B-Instruct

Introduction

Qwen2 is a series of large language models, with versions ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. Qwen2 models are designed to outperform open-source models and compete with proprietary models across various benchmarks in language understanding, generation, multilingual capabilities, coding, mathematics, and reasoning. The Qwen2-72B-Instruct model supports a context length of up to 131,072 tokens, allowing for extensive input processing.

Architecture

Qwen2 models are based on the Transformer architecture, incorporating features such as SwiGLU activation, attention QKV bias, and group query attention. The models include an improved tokenizer that supports multiple natural languages and codes. Each model size has a base language model and an aligned chat model.

Training

The Qwen2 models underwent pretraining with a large dataset, followed by post-training with supervised fine-tuning and direct preference optimization.

Guide: Running Locally

Requirements

Ensure you have transformers>=4.37.0 to avoid potential errors related to the 'qwen2' key.

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"  # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-72B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-72B-Instruct")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Processing Long Texts

For inputs exceeding 32,768 tokens, use YARN to enhance model length extrapolation.

  1. Install vLLM:

    pip install "vllm>=0.4.3"
    
  2. Configure Model Settings: Modify config.json to include the following snippet:

    {
        "architectures": ["Qwen2ForCausalLM"],
        "vocab_size": 152064,
        "rope_scaling": {
            "factor": 4.0,
            "original_max_position_embeddings": 32768,
            "type": "yarn"
        }
    }
    
  3. Model Deployment: Use vLLM to deploy the model.

    python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-72B-Instruct --model path/to/weights
    

    Access the Chat API using:

    curl http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
        "model": "Qwen2-72B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Your Long Input Here."}
        ]
        }'
    

Cloud GPUs

For optimal performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

This model is licensed under the Tongyi-Qianwen license. For more information, view the license document.

More Related APIs in Text Generation