VisionReward-Image

Introduction

VisionReward is a comprehensive strategy for aligning visual generation models, such as image and video generation, with human preferences. It utilizes a fine-grained and multi-dimensional approach, where human preferences are broken down into multiple dimensions. Each dimension is represented by judgment questions that are linearly weighted and summed to produce an interpretable score. VisionReward addresses video quality assessment challenges by analyzing dynamic video features, outperforming VideoScore by 17.2%, and excelling in video preference prediction.

Architecture

VisionReward-Image focuses on decomposing human visual preferences into multiple dimensions with a series of judgment questions. These dimensions are then combined to form a coherent and accurate score, guiding the model in aligning with human preferences.

Training

The model uses a structured approach by analyzing dynamic video features to improve video quality assessment and preference prediction. This systematic analysis allows VisionReward to achieve superior performance compared to existing models.

Guide: Running Locally

Basic Steps

Clone the Repository:
Clone the VisionReward repository from GitHub.

git clone https://github.com/THUDM/VisionReward
cd VisionReward

Install Dependencies:
Ensure all Python package dependencies are installed.
```
pip install -r requirements.txt
```

Merge and Extract Checkpoint Files:
Combine split files into a .tar file and extract them.

cat ckpts/split_part_* > ckpts/visionreward_image.tar
tar -xvf ckpts/visionreward_image.tar

Run Inference:
Execute the model inference script as described in the repository documentation.

Cloud GPUs

For efficient processing and faster inference, consider using cloud-based GPUs from providers such as AWS, Google Cloud, or Azure.

License

The VisionReward-Image model is licensed under a custom license, cogvlm2. More details can be found here.

More Related APIs in Text Generation

Vision Reward Image