Eurus P R M Stage2
PRIME-RLEurusPRM-Stage2
Introduction
EurusPRM-Stage2 is a model trained using Implicit Process Reinforcement Method (PRM), which involves obtaining process rewards without the need for additional labels. The training involves using a log-likelihood ratio of two causal language models (LMs) to parameterize rewards, enabling the collection of process rewards by simply training on response-level data.
Architecture
The model leverages the reward representation defined by the log-likelihood ratio between the trained model and a reference model. This approach allows for the training of a process reward model (PRM) without the need for step-level annotations, using an objective instantiated with cross-entropy loss for memory efficiency.
Training
The training of EurusPRM-Stage2 builds on EurusPRM-Stage1, incorporating both step-level and response-level data. Step-level labels are generated using specific models like Llama-3.1-70B-Inst and Qwen2.5-72B-Inst. The training uses a learning rate of 5e-7 and a batch size of 64, utilizing cross-entropy loss to optimize the model.
Guide: Running Locally
Basic Steps
- Install Dependencies: Ensure you have PyTorch and Transformers installed.
- Load Model: Use the
AutoModelForCausalLM
andAutoTokenizer
from thetransformers
library to load the EurusPRM-Stage2 model. - Prepare Inputs: Tokenize the input data and prepare it for inference.
- Inference: Use the model to generate predictions and calculate rewards based on the log-likelihood ratio.
Suggest Cloud GPUs
For efficient computation, consider using cloud GPU services like AWS EC2, Google Cloud Compute Engine, or Azure Virtual Machines to run the model.
License
EurusPRM-Stage2 is released under the Apache-2.0 License.