chatglm2 6b int4

THUDM

ChatGLM2-6B-INT4 Documentation

Introduction

ChatGLM2-6B is the second-generation version of the open-source bilingual (Chinese-English) chat model ChatGLM-6B. It maintains the smooth conversation flow and low deployment threshold of its predecessor while introducing several new features, including improved performance, extended context length, and more efficient inference. The model is designed to be competitive among open-source models of similar sizes with enhanced abilities in various datasets.

Architecture

ChatGLM2-6B has been upgraded with a fully enhanced base model using a hybrid objective function from GLM. This model has undergone pre-training with 1.4 trillion bilingual tokens and human preference alignment training. It features:

  • Stronger Performance: Significant improvements in performance on datasets such as MMLU, CEval, GSM8K, and BBH.
  • Longer Context: Extended context length from 2K to 32K using FlashAttention, with training on 8K context length.
  • More Efficient Inference: Improved inference speed by 42% and reduced GPU memory usage with Multi-Query Attention. INT4 quantization allows for extended dialogue length on GPUs with 6G memory.

Training

The model has been trained with 1.4 trillion bilingual tokens and aligns with human preferences. The improvements in performance are evidenced in various datasets. FlashAttention and Multi-Query Attention techniques enhance context length and inference efficiency, respectively.

Guide: Running Locally

To run the ChatGLM2-6B model locally, follow these steps:

  1. Install Dependencies:

    pip install protobuf transformers==4.30.2 cpm_kernels torch>=2.0 gradio mdtex2html sentencepiece accelerate
    
  2. Load and Run the Model:

    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b-int4", trust_remote_code=True)
    model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4", trust_remote_code=True).half().cuda()
    model = model.eval()
    
    response, history = model.chat(tokenizer, "你好", history=[])
    print(response)
    
  3. Cloud GPUs: For extended capabilities and faster processing, consider using cloud GPU services like AWS, Google Cloud, or Azure, which can provide the necessary computational power for running large models.

  4. Further Instructions: For more detailed usage instructions, including CLI and web demos, visit the GitHub Repository.

License

The code in this repository is open-sourced under the Apache-2.0 License. Usage of the ChatGLM2-6B model weights must adhere to the Model License.

More Related APIs