glm 4 9b chat
THUDMIntroduction
GLM-4-9B-Chat is an open-source model from the GLM-4 series developed by Zhipu AI. It demonstrates high performance on various datasets involving semantics, mathematics, reasoning, code, and knowledge tasks. The model supports multi-round dialogue, web browsing, code execution, custom tool calls, and long-text reasoning with up to 128K context. Additionally, it supports 26 languages, including Japanese, Korean, and German.
Architecture
GLM-4-9B-Chat is built on the GLM-4 series architecture, focusing on comprehensive language understanding and generation capabilities. It incorporates advanced features like long-text processing and multilingual support. The model is designed to handle extensive contexts, up to 1 million tokens, and integrates seamlessly with modern tool-calling frameworks.
Training
GLM-4-9B-Chat was evaluated on multiple benchmarks, showing superior results in tasks like MMLU, GSM8K, and HumanEval. It was also tested on multilingual datasets, outperforming models such as Llama-3-8B-Instruct in most cases. Its tool-calling capabilities were validated on the Berkeley Function Calling Leaderboard, where it achieved high accuracy and relevance scores.
Guide: Running Locally
To run GLM-4-9B-Chat locally, follow these steps:
-
Install Dependencies:
- Ensure you have Python installed.
- Install the required packages specified in the requirements.txt file.
-
Setup Environment:
- Use a CUDA-enabled GPU for optimal performance.
- Consider using cloud GPUs such as AWS EC2 with GPU instances, Google Cloud GPUs, or Azure for enhanced processing power.
-
Run Model with Transformers:
import torch from transformers import AutoModelForCausalLM, AutoTokenizer device = "cuda" tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat", trust_remote_code=True) query = "你好" inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}], add_generation_prompt=True, tokenize=True, return_tensors="pt", return_dict=True ) inputs = inputs.to(device) model = AutoModelForCausalLM.from_pretrained( "THUDM/glm-4-9b-chat", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True ).to(device).eval() gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1} with torch.no_grad(): outputs = model.generate(**inputs, **gen_kwargs) outputs = outputs[:, inputs['input_ids'].shape[1]:] print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-
Run Model with vLLM:
from transformers import AutoTokenizer from vllm import LLM, SamplingParams max_model_len, tp_size = 131072, 1 model_name = "THUDM/glm-4-9b-chat" prompt = [{"role": "user", "content": "你好"}] tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) llm = LLM( model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True ) stop_token_ids = [151329, 151336, 151338] sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids) inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) outputs = llm.generate(prompts=inputs, sampling_params=sampling_params) print(outputs[0].outputs[0].text)
License
The use of GLM-4 model weights is governed by the GLM-4 License. Use of the model requires adherence to this licensing agreement.