cogagent 9b 20241220

THUDM

Introduction

The CogAgent-9B-20241220 model is a bilingual vision-language model (VLM) based on the GLM-4V-9B architecture. It is designed to enhance GUI perception, inference prediction accuracy, action space completeness, and task generalizability. The model supports interactions in both Chinese and English, using screenshots and language inputs. This release aims to assist researchers and developers in advancing GUI agent applications.

Architecture

CogAgent-9B-20241220 is built on the GLM-4V-9B, leveraging data collection, optimization, and multi-stage training strategies. It supports complex tasks like image-text-to-text transformations and is optimized for GUI agent capabilities. This model also seamlessly integrates into ZhipuAI's GLM-PC product.

Training

The model underwent extensive multi-stage training to improve its capabilities in GUI perception and task execution accuracy. Training focused on enhancing its bilingual interaction capabilities and general task applicability. It supports continuous execution history but not continuous conversations.

Guide: Running Locally

To run the model locally, refer to the GitHub repository for detailed examples and guidance on prompt concatenation, which is crucial for optimal performance.

Basic steps:

  1. Clone the repository from GitHub: https://github.com/THUDM/CogAgent.
  2. Set up the environment as per the instructions in the repository.
  3. Pay close attention to the prompt concatenation process to ensure correct model execution.

For those looking to run the model effectively, using cloud GPUs is recommended for handling large-scale computations and accelerating performance.

License

The model is distributed under a specific license. Please consult the Model License for information on using the model weights: Model License.

More Related APIs in Image Text To Text