Eurus P R M Stage2

PRIME-RL

EurusPRM-Stage2

Introduction

EurusPRM-Stage2 is a model trained using Implicit Process Reinforcement Method (PRM), which involves obtaining process rewards without the need for additional labels. The training involves using a log-likelihood ratio of two causal language models (LMs) to parameterize rewards, enabling the collection of process rewards by simply training on response-level data.

Architecture

The model leverages the reward representation defined by the log-likelihood ratio between the trained model and a reference model. This approach allows for the training of a process reward model (PRM) without the need for step-level annotations, using an objective instantiated with cross-entropy loss for memory efficiency.

Training

The training of EurusPRM-Stage2 builds on EurusPRM-Stage1, incorporating both step-level and response-level data. Step-level labels are generated using specific models like Llama-3.1-70B-Inst and Qwen2.5-72B-Inst. The training uses a learning rate of 5e-7 and a batch size of 64, utilizing cross-entropy loss to optimize the model.

Guide: Running Locally

Basic Steps

  1. Install Dependencies: Ensure you have PyTorch and Transformers installed.
  2. Load Model: Use the AutoModelForCausalLM and AutoTokenizer from the transformers library to load the EurusPRM-Stage2 model.
  3. Prepare Inputs: Tokenize the input data and prepare it for inference.
  4. Inference: Use the model to generate predictions and calculate rewards based on the log-likelihood ratio.

Suggest Cloud GPUs

For efficient computation, consider using cloud GPU services like AWS EC2, Google Cloud Compute Engine, or Azure Virtual Machines to run the model.

License

EurusPRM-Stage2 is released under the Apache-2.0 License.

More Related APIs