Eurus 2 7 B P R I M E LLM Model

Introduction

Eurus-2-7B-PRIME is a language model trained using the PRIME method, which stands for Process Reinforcement through IMplicit rEward. This approach aims to enhance the reasoning abilities of language models beyond simple imitation or distillation. The model starts with Eurus-2-7B-SFT and is further trained on the Eurus-2-RL-Data dataset.

Architecture

In the PRIME method, both the policy model and the Process Reward Model (PRM) are initialized with the SFT model. During each reinforcement learning iteration, the policy model generates rollouts. These rollouts are scored by the implicit PRM and an outcome verifier. The implicit PRM is updated with the outcome reward, and this information is used to update the policy model by combining outcome rewards and process rewards.

Training

The PRIME algorithm involves several steps:

Prompt Filtering: Only prompts where the policy model achieves an accuracy between 0.2 and 0.8 are preserved.
Implicit Process Reward Calculation: Calculate the implicit process reward.
Update Implicit PRM: Update based on predicted rewards and ground truth labels.
Advantage Estimation: Perform return calculations for outcome and process rewards to establish an advantage.
Policy Update: Use PPO loss for importance sampling to update the policy model.

The model shows significant improvements on reasoning benchmarks, achieving a 16.7% improvement on average compared to its SFT version.

Guide: Running Locally

To run Eurus-2-7B-PRIME locally, follow these steps:

Install Dependencies: Ensure you have Python and the necessary libraries installed, such as Transformers.
Download the Model: Access the model from the Hugging Face repository.
Set Up Environment: Configure your environment to utilize GPUs for better performance. Cloud GPUs such as those from AWS, Google Cloud, or Azure are recommended for handling large models.
Run the Model: Use the model for text generation tasks by providing prompts and processing outputs using suggested actions like [ASSESS], [ADVANCE], and [VERIFY].