glm 4 9b chat

THUDM

Introduction

GLM-4-9B-Chat is an open-source model from the GLM-4 series developed by Zhipu AI. It demonstrates high performance on various datasets involving semantics, mathematics, reasoning, code, and knowledge tasks. The model supports multi-round dialogue, web browsing, code execution, custom tool calls, and long-text reasoning with up to 128K context. Additionally, it supports 26 languages, including Japanese, Korean, and German.

Architecture

GLM-4-9B-Chat is built on the GLM-4 series architecture, focusing on comprehensive language understanding and generation capabilities. It incorporates advanced features like long-text processing and multilingual support. The model is designed to handle extensive contexts, up to 1 million tokens, and integrates seamlessly with modern tool-calling frameworks.

Training

GLM-4-9B-Chat was evaluated on multiple benchmarks, showing superior results in tasks like MMLU, GSM8K, and HumanEval. It was also tested on multilingual datasets, outperforming models such as Llama-3-8B-Instruct in most cases. Its tool-calling capabilities were validated on the Berkeley Function Calling Leaderboard, where it achieved high accuracy and relevance scores.

Guide: Running Locally

To run GLM-4-9B-Chat locally, follow these steps:

  1. Install Dependencies:

    • Ensure you have Python installed.
    • Install the required packages specified in the requirements.txt file.
  2. Setup Environment:

    • Use a CUDA-enabled GPU for optimal performance.
    • Consider using cloud GPUs such as AWS EC2 with GPU instances, Google Cloud GPUs, or Azure for enhanced processing power.
  3. Run Model with Transformers:

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    device = "cuda"
    
    tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat", trust_remote_code=True)
    query = "你好"
    
    inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                           add_generation_prompt=True,
                                           tokenize=True,
                                           return_tensors="pt",
                                           return_dict=True
                                           )
    
    inputs = inputs.to(device)
    model = AutoModelForCausalLM.from_pretrained(
        "THUDM/glm-4-9b-chat",
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True
    ).to(device).eval()
    
    gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
    with torch.no_grad():
        outputs = model.generate(**inputs, **gen_kwargs)
        outputs = outputs[:, inputs['input_ids'].shape[1]:]
        print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    
  4. Run Model with vLLM:

    from transformers import AutoTokenizer
    from vllm import LLM, SamplingParams
    
    max_model_len, tp_size = 131072, 1
    model_name = "THUDM/glm-4-9b-chat"
    prompt = [{"role": "user", "content": "你好"}]
    
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    llm = LLM(
        model=model_name,
        tensor_parallel_size=tp_size,
        max_model_len=max_model_len,
        trust_remote_code=True,
        enforce_eager=True
    )
    stop_token_ids = [151329, 151336, 151338]
    sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)
    
    inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
    outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)
    
    print(outputs[0].outputs[0].text)
    

License

The use of GLM-4 model weights is governed by the GLM-4 License. Use of the model requires adherence to this licensing agreement.

More Related APIs