EurusPRM-Stage1

Introduction

EurusPRM-Stage1 is trained using Implicit Process Reinforcement Modeling (PRM), which extracts process rewards without additional labeling costs. This is achieved by training an Outcome Reward Model (ORM) on cheaper response-level labels and obtaining implicit process rewards through forward passes and log-likelihood ratio calculations.

Architecture

The foundation of Implicit PRM is the reward representation parameterized by the log-likelihood ratio of two causal language models. The process rewards are derived by modeling this ratio, allowing training of an ORM to implicitly learn a Q function, thereby simplifying the extraction of PRMs without the need for step-level annotations.

Training

EurusPRM-Stage1 is instantiated with cross-entropy loss for memory efficiency. The training process uses a learning rate of 5e-7 and a batch size of 64. The model is trained by replacing the standard reward parameter with the log-likelihood ratio.

Guide: Running Locally

To run EurusPRM-Stage1 locally, follow these steps:

Installation: Make sure to have PyTorch and Transformers library installed.
Model Loading: Use the AutoModelForCausalLM and AutoTokenizer from the Transformers library to load the model and tokenizer.
Inference: Prepare input queries and answers, tokenize them, and pass them through the model to obtain log probabilities.
Reward Calculation: Compute the raw and beta-scaled rewards using the difference in log probabilities between the model and a reference model.

For efficient computation, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

This project is licensed under the Apache 2.0 License, allowing for wide usage and modification while maintaining attribution and notice requirements.