- What this repo is / isn’t
- Repo structure
- Quickstart
- Context: RL for LLMs
- PPO (actor–critic with clipping)
- DPO (direct preference optimization)
- GRPO (group-relative policy optimization)
- Policy gradient & REINFORCE
- Summary table (from PDF)
- Symbol reference (LLM context)
- References
- This is a learning repo with compact PyTorch implementations of key RL-for-LLMs objectives (PPO, DPO, GRPO, REINFORCE).
- It is not a full RLHF training stack, production system, or large-scale data pipeline.
- It aims to be a didactic reference you can read end-to-end in one sitting.
ppo/: PPO loss utilities and equations.dpo/: DPO loss utilities and equations.grpo/: GRPO loss utilities and equations.reinforce/: REINFORCE utilities and equations.summary.JPG: one-page “RL for LLMs at a glance” diagram (Lambert, Reinforcement Learning from Human Feedback).notes.pdf: Slide-style notes backing this README.
-
Install dependencies
pip install -r requirements.txt
-
Run toy demos
# REINFORCE on a tiny bandit python examples/demo_reinforce_bandit.py # PPO-style clipped update on a bandit python examples/demo_ppo_bandit.py # GRPO-style group-relative update on a bandit python examples/demo_grpo_bandit.py # Synthetic DPO preference optimization python examples/demo_dpo_synthetic.py
Each script prints simple scalar metrics (e.g., moving-average reward or preference margin) so you can see the objective behaving as expected.
- Pipeline: pre-training → supervised fine-tuning → preference/reasoning fine-tuning.
- Goal: improve using human preferences or verifiable rewards when web-scale data is saturated.
- Challenge: propagate a sparse final reward through long token trajectories; variance is high.
- Advantage:
$A_t = Q(s_t, a_t) - V(s_t)$ . - Ratio:
$r_t(\theta) = \dfrac{\pi_{\text{new}}(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)}$ . - Clipped objective:
$L_{\text{CLIP}} = \mathbb{E}\big[\min(r_t A_t,\ \text{clip}(r_t, 1-\epsilon, 1+\epsilon), A_t)\big]$ . - Pros: stable; mitigates collapse. Cons: heavier compute/memory (policy + ref + reward + critic).
- Bradley–Terry preference:
$P(y_w > y_l) = \sigma\big(r(x, y_w) - r(x, y_l)\big)$ . - Loss using log-prob ratios to a frozen reference:
$L_{\text{DPO}} = -\mathbb{E}\big[\log \sigma(\beta \log \tfrac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta \log \tfrac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)})\big]$ . - Pros: simple (policy + reference only). Cons: best aligned to pairwise preference data.
- Group sampling: draw
$G$ outputs per prompt; compute group mean/std of rewards. - Normalized advantage: $A_i = \dfrac{r_i - \text{mean}(\text{Rewards}{\text{group}})}{\text{std}(\text{Rewards}{\text{group}})}$.
- Loss:
$L_{\text{GRPO}} = L_{\text{PPO_CLIP}} + \beta, D_{\text{KL}}(\pi_\theta ,|, \pi_{\text{ref}})$ . - Pros: removes critic → memory/compute savings; good for verifiable rewards. Cons: needs multiple samples per prompt; sensitive to group size.
- Return:
$G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots$ - Update:
$\theta_{t+1} = \theta_t + \alpha ,\nabla \log \pi_\theta(a_t \mid s_t), G_t$ - Loss form:
$L_{\text{REINFORCE}} = -\sum_t G_t \log \pi_\theta(a_t \mid s_t)$ - With baseline
$b$ : use$G_t - b$ to reduce variance (critic in PPO; group stats in GRPO). - Pros: simple, unbiased; supports stochastic policies. Cons: high variance; Monte Carlo delay.
| Feature | PPO | DPO | GRPO |
|---|---|---|---|
| Type | RL (actor–critic) | Direct preference optimization | RL (policy gradient) |
| Models | Policy, reference, reward, value (critic) | Policy, reference | Policy, reference (no critic) |
| Baseline | Learned value | N/A | Group mean/std |
| Mechanism | Clipping | Log-sigmoid margin loss | Group-normalized advantage + clip + KL |
| Complexity | High | Low | Medium |
| Use case | General RLHF | Preference fine-tuning | Reasoning (math/code/logic) |
| Symbol | Meaning |
|---|---|
| Input prompt | |
| Preferred and rejected responses | |
| Current policy (LLM) | |
| Policy at data collection | |
| Frozen reference policy | |
| Reward at step |
|
| Discounted return | |
| Advantage (with critic or group baseline) | |
| Ratio |
|
| Clip range | |
| KL or preference strength | |
| Discount factor | |
| Learning-rate or entropy coefficient (contextual) |
- RL for LLMs / RLHF
- Christiano et al., Deep Reinforcement Learning from Human Preferences (arXiv:1706.03741)
- Ouyang et al., Training language models to follow instructions with human feedback (arXiv:2203.02155)
- PPO (Proximal Policy Optimization)
- Schulman et al., Proximal Policy Optimization Algorithms (arXiv:1707.06347)
- DPO (Direct Preference Optimization)
- Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model (arXiv:2305.18290)
- GRPO (Group-Relative Policy Optimization)
- DeepSeek AI, DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (OpenReview)
- Policy gradient / REINFORCE
- Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning (DOI:10.1007/BF00992696)
- Reference sheet source
- Lambert, Reinforcement Learning from Human Feedback (rlhfbook.com)