Vision Reward Image
THUDMVisionReward-Image
Introduction
VisionReward is a comprehensive strategy for aligning visual generation models, such as image and video generation, with human preferences. It utilizes a fine-grained and multi-dimensional approach, where human preferences are broken down into multiple dimensions. Each dimension is represented by judgment questions that are linearly weighted and summed to produce an interpretable score. VisionReward addresses video quality assessment challenges by analyzing dynamic video features, outperforming VideoScore by 17.2%, and excelling in video preference prediction.
Architecture
VisionReward-Image focuses on decomposing human visual preferences into multiple dimensions with a series of judgment questions. These dimensions are then combined to form a coherent and accurate score, guiding the model in aligning with human preferences.
Training
The model uses a structured approach by analyzing dynamic video features to improve video quality assessment and preference prediction. This systematic analysis allows VisionReward to achieve superior performance compared to existing models.
Guide: Running Locally
Basic Steps
-
Clone the Repository:
Clone the VisionReward repository from GitHub.git clone https://github.com/THUDM/VisionReward cd VisionReward
-
Install Dependencies:
Ensure all Python package dependencies are installed.pip install -r requirements.txt
-
Merge and Extract Checkpoint Files:
Combine split files into a.tar
file and extract them.cat ckpts/split_part_* > ckpts/visionreward_image.tar tar -xvf ckpts/visionreward_image.tar
-
Run Inference:
Execute the model inference script as described in the repository documentation.
Cloud GPUs
For efficient processing and faster inference, consider using cloud-based GPUs from providers such as AWS, Google Cloud, or Azure.
License
The VisionReward-Image model is licensed under a custom license, cogvlm2. More details can be found here.