RLHF & Alignment — Turning Reward into Behavior
Language models don't know right from wrong. They predict the next token. RLHF teaches them human values by learning from preferences, optimizing reward, and staying aligned. Learn how ChatGPT learned to be helpful, harmless, and honest—and why naive reward maximization breaks down.
Why raw language models aren't enough
A language model trained only on next-token prediction is agnostic to your values. It's learned the statistical patterns of human language, but has no notion of helpfulness, harmlessness, or honesty. Ask it something and it might generate a harmful, misleading, or unhelpful response—not because it's malicious, but because the training objective doesn't care.
This is where RLHF (Reinforcement Learning from Human Feedback) comes in. Instead of predicting the next token, we train the model to maximize human satisfaction. We show it preference examples, learn a reward model, and optimize the policy using techniques like PPO to balance reward with stability.
The alignment problem
Language models are intelligent but not aligned. RLHF is one of the most effective techniques we have to steer their behavior toward human values without constraining their capabilities.
Pre-RLHF behavior
- ✗Toxic or biased outputs
- ✗Factually incorrect claims presented confidently
- ✗Refuses safe requests; complies with harmful ones
- ✗Verbose and repetitive
Post-RLHF behavior
- ✓Thoughtful, respectful responses
- ✓Expresses uncertainty when appropriate
- ✓Refuses harmful requests; helps with legitimate ones
- ✓Clear and concise
Quick check
Why can't we just train a language model on next-token prediction and call it done?
RLHF is one tool, not the only one. Other alignment approaches include:
- Supervised Fine-Tuning (SFT): Train on human-written examples of good responses. Simple but doesn't leverage preference data.
- Constitutional AI (CAI): Define explicit principles the model should follow. More interpretable than black-box rewards.
- Mechanistic Interpretability: Understand what the model is computing internally. Complementary to RLHF.
- DPO (Direct Preference Optimization): Skip the reward model, optimize directly on preferences. Emerging as a competitive alternative.
RLHF is powerful because it can leverage large amounts of preference data and is agnostic to the specific alignment goal (defined by the reward model).
Learning from human preferences
The first step of RLHF is to gather preference data. For each prompt, generate several candidate responses and ask humans which one is better. These preferences are the signal that will train the reward model.
How does a reward model learn from preferences?
Pick which response is better. The reward model learns by observing which responses humans prefer, then learns to assign higher scores to preferred ones. Use this to understand Bradley-Terry ranking.
Comparisons
0
Accuracy
-%
Loss
—
When learning from preferences, we use the Bradley-Terry model to convert preference pairs into probabilities:
P(A > B) = exp(r_A) / (exp(r_A) + exp(r_B))
This means: the probability that A is preferred is proportional to the exponential ratio of their reward scores. If r_A = 2 and r_B = 0, then P(A > B) ≈ 0.88—high confidence that A is better.
The reward model is trained to maximize the likelihood of observed preference pairs. For each pair where humans chose A over B, we increase r_A and decrease r_B, making that preference more likely under the model.
Quick check
What is the Bradley-Terry model used for?
The reward model: learning from feedback
A reward model is a classifier that predicts human preferences. Given a response, it outputs a scalar reward score. The model learns by observing preference pairs and improving its ability to predict which response humans would prefer.
How the reward model learns
During RLHF, a separate model learns to score responses based on human preferences. As it sees more examples, it gets better at predicting which responses humans would prefer.
Sample responses and learned rewards:
Training Progress
Reward
0.228
KL Divergence
0.964
Key insight
The reward model doesn't score responses in absolute terms—it learns a preference ordering from human feedback. As training progresses, it gets better at predicting which responses future humans would prefer, creating a dense reward signal.
The reward model is the heart of RLHF. It distills human preferences into a single scalar signal that can guide policy optimization. Instead of asking humans to label every generated response, we train a reward model once and use it to score unlimited responses.
Trade-offs:
- Reward model quality: The final policy is only as good as the reward model. Garbage in, garbage out.
- Generalization: A reward model trained on preference pairs for math problems might not generalize to coding.
- Reward hacking: The policy might find ways to game the reward signal (e.g., high scores without actual quality).
Modern approaches like Constitutional AI and DPO reduce reliance on a single reward model.
Quick check
What is the primary function of the reward model in RLHF?
Optimizing with PPO
Now we have a reward signal. We could naively maximize it, but that leads to disaster: the policy diverges from the base model, generating nonsensical text that somehow fools the reward model. Instead, we use Proximal Policy Optimization (PPO), which maximizes reward while staying close to the original model.
PPO: Optimizing reward vs staying close to the base model
PPO finds the sweet spot: maximize reward while keeping the policy close to the base model (KL constraint). Higher KL weights enforce stricter constraint, but may leave reward on the table. Adjust the KL penalty to see the tradeoff.
Select KL weights to compare:
PPO tradeoff
Lower KL weights (β = 0.01) allow bigger reward gains but drift further from the base model. Higher KL weights (β = 0.2) stay closer to the base model but achieve lower rewards. PPO finds the optimal balance by clipping the policy update.
Notice: As KL weight increases, final reward decreases but stability improves.
PPO uses a clever trick: it clips the probability ratio to prevent large updates:
Loss = -min(r * A, clip(r, 1-ε, 1+ε) * A)
Where r = exp(log p_new - log p_old) is the importance ratio and A is the advantage (reward). If the new policy assigns much higher probability to an action, the clip prevents the loss from decreasing further. This naturally limits policy divergence.
We can also add an explicit KL penalty: Loss += β * KL(π_new || π_old). This directly measures how much the policy changed. Higher β weights enforce stricter constraints, keeping the policy closer to the base model but potentially leaving reward on the table.
Quick check
Why do we need a KL constraint during PPO training?
The complete RLHF pipeline
Let's trace through the entire pipeline from raw model to aligned assistant. Click through the steps to understand how each component feeds into the next.
The RLHF Pipeline: A 6-step walkthrough
Follow the journey from a raw language model to an aligned assistant.
Step 1: Start with a base model
A language model trained on large amounts of text (GPT-style pretraining).
Trained to predict the next token given context
Good at language, but not aligned with human preferences
Example: might be verbose, toxic, or unhelpful
This is our starting point for RLHF
Base Language Model
Trained on next-token prediction. Not aligned yet.
Direct Preference Optimization (DPO) is a newer approach that skips the reward model entirely. Instead of training a separate reward model, DPO optimizes the policy directly on preferences:
Loss = -log(sigmoid(β * (log p_π(y_w) - log p_π(y_l))))
This is much simpler and avoids the reward modeling bottleneck. Empirically, DPO often performs as well as RLHF while being more efficient and easier to tune.
The trade-off: DPO still requires preference data, but doesn't scale to new reward functions as easily as a trained reward model.
Quick check
What is the main advantage of DPO over traditional RLHF?
Alignment playground
Experiment with reward signals, detect reward hacking, and understand how constitutional principles constrain model behavior. This playground demonstrates the core trade-offs in alignment.
Alignment Playground
Experiment with different reward signals and see how model behavior changes. Explore reward hacking and constitutional AI constraints.
Try these examples:
Reward Score
0.697
Alignment Score
0.800
Gap (Reward - Alignment)
-0.103
Larger gap = potential hacking
How it works
The reward model scores responses based on length, coherence, and semantic diversity. Higher reward doesn't always mean better alignment—a model can game the reward signal by being verbose without being helpful.
Anthropic's Constitutional AI (CAI) approach combines RLHF with explicit constitutional principles. The model is given a set of criteria and trained to:
- Critique its own responses against the constitution
- Revise responses to satisfy the principles
- Optimize using RLHF where human raters prefer the revised responses
This combines the interpretability of explicit rules with the power of RLHF, reducing reliance on a single reward model.
Quick check
What does reward hacking refer to?
Key takeaways
- Language models trained on next-token prediction are amoral—they need alignment to behave helpfully and safely.
- RLHF learns from human preferences, trains a reward model, and uses PPO to optimize the policy.
- The reward model is the bottleneck—its quality determines final alignment. Garbage in, garbage out.
- PPO uses clipping and KL constraints to prevent the policy from diverging too far from the base model and gaming the reward.
- Reward hacking (policy finding loopholes in the reward signal) is a major challenge. Constitutional AI and better evaluation help.
- DPO is an emerging alternative that skips the reward model and optimizes directly on preferences.
- Alignment is hard. These techniques work in practice, but the problem is fundamentally challenging and ongoing.
What's next? Explore other learning resources: