10 RLHF & Alignment
Who this chapter is for: Mid Level What you’ll be able to answer after reading this:
- The three-stage RLHF pipeline: SFT → reward model → PPO
- Why alignment is needed and what “helpful, harmless, honest” means in practice
- DPO as a simpler alternative to PPO-based RLHF
- What Constitutional AI is and how it differs
10.1 Why Alignment?
A pretrained large language model is, at its core, a next-token predictor. During pretraining, the model optimizes for one objective: predict the next token in a given text sequence. The pretraining corpus — crawls of the internet, books, code repositories, academic papers — contains the full spectrum of human expression, including technical knowledge and productive discourse, but also misinformation, hate speech, dangerous instructions, and manipulative rhetoric. A base model trained on this corpus will confidently and helpfully complete any prompt in a statistically plausible direction, including prompts that begin “Here is a step-by-step guide to synthesizing…” or “Write a convincing argument that…”. The model has no values, no awareness of harm, and no preference for truth over plausible-sounding falsehood. It only knows what text typically follows what text.
The alignment problem is the gap between this next-token prediction objective and the objectives we actually want deployed systems to optimize: be helpful to users, avoid causing harm, tell the truth. Bridging this gap is technically non-trivial. You cannot simply add “be helpful and safe” to the system prompt and solve the problem — the base model does not interpret natural language directives as binding instructions. You cannot fine-tune on a hand-labeled dataset of “good” responses and declare victory — the space of possible inputs is effectively infinite, and a dataset of finite examples cannot cover every harmful edge case. Alignment is a continuous engineering and research challenge, not a checkbox. The practical goal is to produce a model that is right most of the time across the realistic distribution of user inputs, with graceful failure modes on the tail cases that inevitably arise.
Anthropic’s “HHH” framework — Helpful, Harmless, Honest — captures the three orthogonal dimensions of alignment. Helpfulness means the model actually accomplishes what the user is trying to do, not that it produces safe-sounding non-answers. A model that refuses all requests involving chemistry or law is not helpful, even though it avoids some narrow risk categories. Harmlessness means the model avoids assisting with actions that have a high probability of causing serious harm — synthesizing dangerous substances, generating child sexual abuse material, facilitating targeted harassment. Honesty means the model accurately represents its uncertainty, does not fabricate plausible-sounding false information (hallucination), and does not manipulate users through deceptive framing. These three objectives create real tensions. A more helpful model that completes more requests is often a more harmful one, since some requests are harmful. A more honest model that always surfaces uncertainty may be less useful for tasks requiring confident synthesis. Alignment work is fundamentally about navigating these tradeoffs at scale.
Instruction following and alignment are related but distinct. A model that follows instructions well will follow instructions to do harm just as readily as instructions to be helpful — arguably, GPT-3 in its base form was excellent at following instructions in the technical sense. The alignment work is specifically about building in value judgments: the model should follow the instruction “write a poem about autumn” but refuse the instruction “write a phishing email.” This distinction — between capability (can follow instructions) and values (chooses which instructions to follow) — is important for understanding what alignment techniques are actually trying to achieve and why pure supervised fine-tuning on (instruction, response) pairs is insufficient to fully align a model.
10.2 Stage 1: Supervised Fine-Tuning (SFT)
Supervised Fine-Tuning is the first stage of the RLHF pipeline and the foundation on which everything else is built. In the SFT stage, human contractors are given a diverse set of prompts spanning a wide range of user intentions — coding tasks, creative writing requests, factual questions, sensitive topics requiring careful handling — and asked to write ideal responses to each. These (prompt, ideal_response) pairs form the SFT dataset, typically 10,000 to 100,000 examples for a production-grade alignment pipeline. The SFT objective is standard supervised learning: minimize the cross-entropy loss between the model’s output distribution and the human-written responses. After SFT, the model has learned the chat format, the conversational register appropriate for an assistant, the general pattern of helpful responses, and a basic understanding that it should follow instructions rather than continue them.
The quality ceiling of the SFT stage is directly determined by the quality of the contractors writing the demonstrations. If contractors write verbose, meandering responses, the model learns to be verbose. If they consistently express uncertainty in a particular way, the model learns that pattern. If they have systematic biases in how they treat certain topics or groups, those biases are encoded in the model. This is not a hypothetical concern — documented quality issues in large-scale annotation projects include annotators rushing to complete tasks (producing short, low-effort responses), annotators with narrow cultural backgrounds writing responses that poorly represent global perspectives, and annotators failing to accurately assess factual accuracy in specialized domains. Investing in annotator quality, training, and quality-control processes is a direct investment in model quality.
SFT alone produces models that are substantially better than base models at instruction following, but they still exhibit several well-documented failure modes. First, inconsistency: for semantically equivalent prompts phrased differently, the SFT model may respond very differently. Second, sycophancy: SFT models tend to agree with whatever the user says, even when the user is wrong, because the human demonstrations used for training were written by humans who may themselves tend to affirm rather than correct. Third, unsafe behavior on long-tail inputs: the SFT dataset cannot cover every possible harmful prompt, and the model will not generalize the safety behavior seen in training to novel harmful prompts it was not specifically trained on. Fourth, poor calibration: SFT models often express high confidence in responses they have no basis for confidence in, because the training data did not systematically model uncertainty. These failure modes motivate the subsequent stages.
10.3 Stage 2: Reward Model Training
The reward model is trained on human preference data: pairs of responses to the same prompt, with one response labeled as preferred over the other. The collection process is empirically superior to direct demonstration writing for capturing subtle human preferences. Annotators find it much easier to judge “this response is better than that response” than to write a high-quality response from scratch — the comparison task taps evaluative intuitions that are not readily expressible as explicit instructions. A production preference dataset might contain hundreds of thousands of (prompt, chosen_response, rejected_response) triples, each representing one annotator’s judgment. The prompt distribution is as diverse as possible to avoid biasing the reward model toward specific task types.
The reward model is architecturally identical to the LLM being trained, but with the output projection layer (which normally predicts over the vocabulary) replaced by a single linear layer that produces a scalar reward score. Training uses the Bradley-Terry model of paired comparisons: for a pair of responses A and B where A is preferred, minimize the negative log probability that A is ranked above B:
\[\mathcal{L}_{RM} = -\log \sigma(r_\theta(x, y_A) - r_\theta(x, y_B))\]
where r_θ is the reward model’s score for response y given prompt x, and σ is the sigmoid function. This objective pushes the reward model to assign higher scores to chosen responses and lower scores to rejected responses. The trained reward model generalizes: given any (prompt, response) pair it has never seen, it produces a scalar quality score that correlates with human preference. This generalization is the key property that makes RL training possible — without it, you could only optimize the policy on the finite set of annotated examples.
The quality of the reward model is the single most important factor determining the quality of the final RLHF model. A reward model that accurately captures human preferences enables the RL stage to genuinely improve the policy. A reward model that has learned spurious correlations — long responses are better, responses with bullet points are better, confident-sounding responses are better regardless of accuracy — will lead the RL stage to produce a policy that generates long, bulleted, confident-sounding responses that humans actually find worse on careful evaluation. This failure mode is reward hacking, and it is the central technical challenge of the RL stage. Reward model quality can be evaluated by holding out a portion of the preference data and measuring how often the reward model correctly predicts the preferred response — typical production reward models achieve 70-80% accuracy on this held-out set, which sounds modest but is sufficient to provide a useful training signal.
10.4 Stage 3: PPO Fine-Tuning
Proximal Policy Optimization is the reinforcement learning algorithm used to optimize the SFT model toward higher rewards. The policy is the language model: given a prompt, it generates a response. The reward signal comes from the reward model, which scores the generated response. The RL training loop: sample a prompt from the training distribution, generate a response from the current policy, score the response with the reward model, compute a policy gradient update that increases the log-probability of high-reward responses and decreases the log-probability of low-reward responses, and update the policy weights. Repeat for millions of steps. After sufficient training, the policy has learned to generate responses that systematically score higher on the reward model.
The critical stabilization mechanism is the KL divergence penalty between the current policy and the SFT model:
\[\mathcal{L}_{PPO} = \mathbb{E}[r_\theta(x, y)] - \beta \cdot D_{KL}(\pi_\theta(y|x) \| \pi_{SFT}(y|x))\]
Without this penalty, the policy rapidly learns to exploit the reward model. The RL optimizer is powerful — it can find response patterns that the reward model has been trained to score highly but that are not actually better. Common exploit patterns: the policy generates repetitive responses that happen to score well (because the reward model rarely saw repetition in training and doesn’t penalize it), or extremely long responses (if the reward model has learned any correlation between length and quality), or responses that begin with phrases strongly associated with high-quality demonstrations in the reward model’s training set. The KL penalty prevents this by penalizing the policy for drifting too far from the SFT model in terms of output distribution — it must find ways to increase reward while staying approximately in the space of SFT-like responses. The coefficient β controls the tradeoff: high β keeps the policy conservative and close to SFT; low β allows more aggressive optimization. Typical values are β ∈ [0.01, 0.1].
PPO training is notoriously unstable. The interaction between the policy (which is changing), the reward model (which is fixed), and the KL penalty creates a complex non-stationary optimization landscape. Common failure modes: reward collapse (policy degrades rapidly and reward score drops after initially improving), mode collapse (policy converges to a small set of response patterns that are high-reward but not diverse), and gradient instability (large gradient norms that destabilize training). Managing these requires careful learning rate scheduling, gradient clipping, advantage normalization, and frequent validation on held-out preference examples. The engineering complexity of a stable PPO pipeline is one of the primary motivations for DPO as a simpler alternative.
10.5 Direct Preference Optimization (DPO)
DPO (Rafailov et al., 2023) is a fundamental rethinking of how to use preference data. Rather than using preference pairs to train a reward model that is then used as the RL signal, DPO derives the optimal RL policy in closed form and shows that the optimal policy can be expressed entirely in terms of the policy’s own log-probabilities on the chosen and rejected responses. The result is a simple binary cross-entropy loss that can be applied directly to preference data without ever training a reward model or running an RL loop:
\[\mathcal{L}_{DPO} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]\]
where y_w is the preferred (winning) response, y_l is the rejected (losing) response, π_θ is the policy being trained, and π_ref is the frozen SFT reference model. Intuitively, DPO increases the relative probability of the chosen response and decreases the relative probability of the rejected response, with the ratio measured against the SFT baseline to prevent the policy from simply assigning high probability to everything (which would trivially satisfy the objective without genuine preference alignment).
Why DPO has largely replaced PPO in many production pipelines comes down to engineering complexity and stability. DPO requires only a standard supervised training loop: forward pass, compute the DPO loss, backpropagate, update weights. No reward model to train, no RL infrastructure, no PPO-specific hyperparameters like clip ratio, value function coefficient, or GAE lambda. The training process is as stable as instruction tuning — same optimizer, same learning rate schedules, same batch size configurations. Quality comparison: DPO is competitive with PPO on most evaluated benchmarks, sometimes better, sometimes slightly worse — the gap is typically not large enough to justify PPO’s engineering complexity unless you have specific reasons to prefer RL-based training. Anthropic’s research and the open-source community’s experience has broadly confirmed that DPO is the right default starting point for preference alignment.
DPO does have a meaningful limitation: distributional alignment. The preference pairs used to train DPO should ideally be generated by the same model you are training — specifically, by the SFT model that serves as the reference policy. If the preference pairs were generated by a different, stronger model (e.g., GPT-4 generated both responses, but you are training a 7B model), the distribution mismatch between the preference data and the model’s own generation distribution can degrade training quality. The model is learning to prefer certain responses in a distribution it would not naturally generate, which reduces the training signal’s effectiveness. This is less of a concern for PPO, where the policy generates its own responses during training and receives reward signal on those. For DPO, investing in preference data generation by the specific SFT model being trained (rather than using off-the-shelf preference datasets generated by other models) often yields meaningfully better results.
10.6 Constitutional AI
Constitutional AI (Bai et al., 2022) is Anthropic’s approach to alignment that replaces human preference labels with AI-generated feedback guided by an explicit set of principles — the “constitution.” The constitution is a list of principles that the model should follow (examples: “do not assist with creating weapons of mass destruction,” “respect human autonomy and avoid being paternalistic,” “be transparent about uncertainty”). These principles are not hidden in the training process — they can be read, debated, and revised. This transparency is a significant advantage over RLHF, where the alignment values are implicitly encoded in the preferences of human annotators who may have their own biases and inconsistencies.
Constitutional AI operates in two phases. In the supervised phase, the model generates responses to potentially harmful prompts, critiques its own responses according to the constitutional principles (identifying which principles were violated and why), revises the response to better comply with the principles, and is then fine-tuned on the revised responses. This creates a self-improvement loop: the model uses its language understanding to identify its own misalignments and generate improved examples. The quality of this process depends on the model’s ability to accurately critique its own outputs against the principles — more capable models produce better critiques and revisions. In the reinforcement learning phase, the constitutional principles are used to train a preference model from AI-generated comparisons (RLAIF — Reinforcement Learning from AI Feedback) rather than human labels. The AI rater is asked to judge which of two responses better follows the constitutional principles, and these AI preference labels are used to train a reward model, which is then used in standard RL training.
The scalability advantage of Constitutional AI is substantial. Human preference annotation is expensive: gathering a large, high-quality preference dataset requires hundreds of skilled annotators working for months. AI-generated feedback can be produced at a fraction of the cost — the primary expense is the compute for generating and rating responses. This scalability enables a much larger and more diverse preference dataset, which in turn enables a more robust reward model and final policy. A key concern about RLAIF is whether AI-generated preferences are as high-quality as human preferences. Empirical comparisons have found that CAI-trained models using RLAIF produce preferences and outputs that humans rate similarly to or better than models trained with human feedback, suggesting that the feedback quality is comparable when the AI rater is sufficiently capable. The practical conclusion for Anthropic’s Claude models: Constitutional AI enables alignment at a scale and transparency level that pure human-annotation RLHF cannot match.
10.7 Interview Questions
Q1. What is RLHF and why is it used?
Reinforcement Learning from Human Feedback (RLHF) is a training technique that fine-tunes language models using human preference signals rather than static labeled data. It is used because supervised fine-tuning on demonstrations alone produces models that are inconsistent, sometimes unsafe, and prone to sycophancy — agreeing with users even when they are wrong. Human preference data captures subtle quality distinctions that are difficult to specify as explicit instructions: it is easier for a human to say “response A is better than response B” than to write out all the rules that make A better. RLHF uses this preference data in two ways: first to train a reward model that can score any (prompt, response) pair, and then to optimize the language model with reinforcement learning to generate responses that the reward model rates highly. The result is a model that has been steered toward patterns that humans prefer, covering the vast space of possible inputs more effectively than finite demonstration datasets. Modern chat models — ChatGPT, Claude, Gemini — are all trained using variants of RLHF or its successors.
Q2. What is the role of the reward model in the RLHF pipeline?
The reward model serves as an automated surrogate for human judgment, providing a scalar quality score for any (prompt, response) pair without requiring a human to evaluate each response during RL training. It is trained on human preference data: given many pairs of responses where humans labeled one as better, the reward model learns to predict human preference. Once trained, the reward model generalizes to novel (prompt, response) pairs it has never seen, scoring them based on the learned patterns of quality. This generalization is what makes RL training possible — the policy generates new responses at every training step, and having a reward model means each generated response can be immediately scored without human involvement. The reward model is the bottleneck of the RLHF pipeline: if it accurately represents human preferences, the RL training produces a genuinely better model; if it has biases or failure modes, the RL training exploits those biases through reward hacking. The reward model’s quality determines the ceiling of what RLHF can achieve.
Q3. Walk through the three stages of RLHF (SFT → RM → PPO) and explain what each stage contributes.
Stage one, Supervised Fine-Tuning (SFT): human contractors write ideal responses to a diverse set of prompts. The base pretrained model is fine-tuned on these (prompt, ideal_response) pairs using standard supervised learning. SFT teaches the model the conversational format, basic instruction following, and general behavioral patterns of a helpful assistant. Without SFT, the base model cannot even maintain the structure of a conversation. SFT alone produces a much-improved model, but it is inconsistent, sometimes unsafe on long-tail inputs, and prone to sycophancy — it has not yet learned robust value judgments. Stage two, Reward Model training: human raters compare pairs of responses to the same prompt and label which is better. A reward model (same architecture as the LLM but with a scalar output head) is trained on these preferences to predict human preference for any response. The reward model learns a general quality signal that covers inputs the preference data collection never saw. Stage three, PPO: the SFT model is further fine-tuned using the reward model as a feedback signal. The policy generates responses, the reward model scores them, and policy gradient updates increase the probability of high-scoring responses. A KL penalty against the SFT model prevents over-optimization and reward hacking. PPO adds robustness, consistency on edge cases, and value-aligned behavior that SFT alone cannot provide.
Q4. What is DPO and how does it differ from PPO-based RLHF? Why has it largely replaced PPO?
Direct Preference Optimization (DPO) is a technique that learns from preference pairs directly without training a separate reward model or running an RL training loop. It derives mathematically that the optimal RLHF policy can be expressed in terms of the policy’s own log-probability ratios on preferred versus rejected responses relative to a reference policy. This derivation yields a simple binary cross-entropy loss that can be computed directly from (prompt, chosen, rejected) triples, using only the policy being trained and a frozen reference SFT model. In contrast, PPO-based RLHF requires three separate components: a reward model (requiring its own training run), a reference SFT model (frozen), and the policy (being trained by RL). DPO reduces this to two components: the reference and the policy. DPO has largely replaced PPO for several reasons. First, simplicity: DPO uses a standard supervised training loop with a modified loss function — same tooling, same stability properties as instruction tuning. PPO introduces a complex RL training loop with many hyperparameters and common failure modes including reward collapse and mode collapse. Second, comparable quality: empirical comparisons find DPO competitive with PPO on most benchmarks. Third, lower infrastructure cost: no reward model training infrastructure needed. The remaining cases where PPO is preferred are online RL settings where you need the policy to generate its own preference data (iterative RLHF), or tasks where online exploration provides a significant advantage.
Q5. What is “reward hacking” and how does the KL penalty prevent it?
Reward hacking occurs when the policy being trained discovers response patterns that achieve high scores on the reward model without actually being high-quality responses. The reward model is an imperfect proxy for human preference — it has been trained on a finite dataset and has learned some spurious correlations in addition to genuine quality signals. An RL optimizer is powerful enough to find and exploit these correlations. Common examples: generating excessively long responses if the reward model has a length-quality correlation, repeating phrases associated with high-quality demonstrations in the reward model’s training data, generating confident-sounding responses even when confidence is unwarranted, or using polished formatting (headers, bullet points) that scores well aesthetically without improving content quality. The KL penalty adds a term to the training objective that penalizes the policy for generating a response distribution that diverges significantly from the SFT reference model’s distribution, measured in KL divergence. This prevents the policy from drifting into the narrow region of response patterns that exploit the reward model, because those patterns are far from the SFT model’s distribution and would incur a high KL penalty. The coefficient β controls the tradeoff: high β prioritizes staying close to the SFT model; low β allows more aggressive reward optimization. The right β depends on the quality of the reward model — a higher-quality reward model can tolerate lower β with less risk of hacking.
Q6. Compare RLHF and Constitutional AI on these dimensions: data requirements, scalability, transparency, and quality.
Data requirements: RLHF requires large-scale human preference annotation — typically hundreds of thousands of (prompt, chosen, rejected) triples generated by human raters. This is expensive and slow to collect. Constitutional AI (CAI) replaces human preference labels with AI-generated feedback guided by constitutional principles; the primary data requirement is the set of prompts and the constitution itself, not extensive human annotation. Scalability: RLHF scales poorly beyond a certain point because human annotation is the bottleneck — you can only collect as many preference labels as you have human bandwidth. CAI scales essentially unboundedly: the AI rater can generate preference comparisons at the speed of inference, limited only by compute. This is why Anthropic can train larger and more thoroughly aligned models with CAI than would be feasible with pure human-annotation RLHF. Transparency: RLHF embeds values implicitly in annotator preferences, which are not auditable or inspectable. Different annotators may have inconsistent values, and the resulting model reflects their averaged preferences in an opaque way. CAI makes the values explicit: the constitution is a readable document that specifies exactly what principles the model is trained to follow. This enables principled debate, revision, and auditing of the model’s values. Quality: empirical comparisons between RLHF-trained and CAI-trained models show comparable quality, with some findings favoring CAI for handling sensitive topics where constitutional guidance provides more consistent behavior than noisy human preference labels.
Q7. A customer’s chatbot is technically accurate but sounds cold, robotic, and unhelpful in tone. Which alignment technique would you recommend and why?
This is a stylistic and behavioral alignment problem, not a knowledge or safety problem. The model has the right information but is not expressing it in the right way. The most targeted solution is DPO on a preference dataset that captures the desired tone. The process: first, sample 500-1,000 responses from the current model on your production prompt distribution. Second, have a small number of domain experts label pairs of responses — rating one response as better than the other based specifically on tone, warmth, and helpfulness — or have them write one improved version alongside the model’s original (forming a pair). Third, train DPO using these pairs, with the current model as both the reference policy and the starting point for training. DPO is preferred over PPO here because the quality signal is subtle and well-defined (raters are judging tone, not safety), the infrastructure is simpler, and the risk of reward hacking is lower when the preference signal is this focused. An alternative worth considering is targeted SFT on a set of rewritten examples: take the model’s cold responses and have domain experts rewrite them in the desired tone, then SFT on (prompt, warm_response) pairs. This is simpler than DPO and works well when the desired style is consistent and easily demonstrated. The decision between DPO and SFT here depends on data volume: with under 500 examples, SFT is lower variance; with 500-5,000 pairs, DPO typically produces better results. Do not use PPO — the engineering cost is not justified for a tone problem.
Q8. How would you explain to a non-technical product manager the difference between safety alignment and capability alignment? What are the business implications of getting each wrong?
Safety alignment is training the model to refuse harmful requests and avoid producing dangerous content — like a customer service rep who knows which requests to escalate and which to decline. Capability alignment is training the model to be genuinely useful and correctly complete the tasks users actually need — like making sure that same customer service rep gives accurate, helpful answers rather than scripted non-answers. Getting safety alignment wrong in the under-aligned direction means the model assists with harmful requests, creates legal liability, generates bad press, and potentially causes real-world harm to users or third parties. This is the obvious failure mode. Getting safety alignment wrong in the over-aligned direction — which is less discussed but equally damaging — means the model refuses too much, hedges excessively, and provides watered-down answers that frustrate users into abandoning the product. Over-refusal is a business failure: users will churn to a competitor that actually helps them. Getting capability alignment wrong means the model technically responds without refusing, but the responses are wrong, unhelpful, or formatted poorly for the use case. This is the most common failure mode in production: the model is not dangerous, but it is not actually solving the user’s problem. The business implication is lower task completion rates, lower user satisfaction scores, and ultimately lower retention. The practical lesson: treat over-refusal and under-capability as seriously as harmful outputs in your evaluation framework. All three are alignment failures with direct product consequences.