Introduction

GLM-4V-9B is the latest open-source, multimodal version of the GLM-4 series, developed by Zhipu AI. This model supports bilingual dialogue in both Chinese and English at high resolutions and showcases superior performance in various multimodal evaluations, surpassing models like GPT-4-turbo-2024-04-09 and Claude 3 Opus.

Architecture

GLM-4V-9B is a multimodal language model with visual understanding capabilities. It is designed to handle tasks such as perception reasoning, text recognition, and chart understanding, with a model context length of up to 8K.

Training

The model has been trained and evaluated on several benchmarks, demonstrating strong performance in both English and Chinese comprehensive tasks, perception reasoning, and text recognition. It supports high-resolution dialogues and has been tested against other leading models, showing competitive results.

Guide: Running Locally

To run the GLM-4V-9B model locally:

  1. Setup Environment: Ensure your environment meets the dependencies listed in the requirements.txt.
  2. Install Transformers: Make sure you have transformers version 4.44 or higher.
  3. Load Model and Tokenizer:
    import torch
    from PIL import Image
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    device = "cuda"
    
    tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b", trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        "THUDM/glm-4v-9b",
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True
    ).to(device).eval()
    
  4. Run Inference:
    query = '描述这张图片'
    image = Image.open("your image").convert('RGB')
    inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
                                           add_generation_prompt=True, tokenize=True, return_tensors="pt",
                                           return_dict=True).to(device)
    
    gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
    with torch.no_grad():
        outputs = model.generate(**inputs, **gen_kwargs)
        outputs = outputs[:, inputs['input_ids'].shape[1]:]
        print(tokenizer.decode(outputs[0]))
    
  5. Suggested Cloud GPUs: For optimal performance, consider using cloud services that offer powerful GPUs such as NVIDIA V100 or A100.

License

The use of GLM-4V-9B model weights is governed by its license agreement. Ensure compliance with these terms before use.

More Related APIs