26  Modern Alignment: DPO Variants, GRPO & Beyond

Note

Who this chapter is for: Mid Level / Forward Deployed Engineer

What you’ll be able to answer after reading this:

  • Why DPO has a length bias and a distribution constraint that variants like IPO, KTO, SimPO, and ORPO are designed to fix
  • How GRPO replaces both the critic model and the reference model with intra-group advantage estimation
  • The practical difference between offline and online preference learning, and when distribution gap matters
  • Which alignment technique to reach for given a specific production constraint (paired data, unpaired data, compute budget, conciseness requirement)
  • Why PPO is rarely used in production fine-tuning pipelines despite being the canonical RL algorithm

26.1 DPO Recap and Its Limitations

Direct Preference Optimization, covered in the prior alignment chapter, eliminates the separate reward model and RL training loop of PPO-based RLHF by deriving a closed-form supervised objective directly from the Bradley-Terry preference model. The key insight is that a reward function and a policy are related by the KL-regularized objective, allowing the reward to be expressed entirely in terms of the policy itself. The resulting DPO loss trains the policy to increase the log-likelihood ratio between the chosen and rejected responses, weighted implicitly by the KL penalty against a frozen reference model.

This is elegant and efficient, but the derivation carries forward several constraints that show up as practical failure modes at production scale. The first is the distribution constraint: DPO is an offline algorithm. It trains on a static dataset of (prompt, chosen, rejected) triples collected before training begins. The distribution of prompts in the dataset — and the responses generated by whatever model produced the rejected samples — is fixed. As the policy drifts during training, the responses it would now generate for those prompts become increasingly different from the responses that were actually labeled. The model is optimizing a loss defined over a distribution it is no longer sampling from. This distribution gap is mild early in training but compounds as alignment improves, causing the later stages of training to be less effective.

The second limitation is the length bias. DPO objective naturally increases the probability of chosen responses and decreases the probability of rejected responses. Human preference annotations, collected without controlling for length, strongly favor longer responses — annotators perceive length as thoroughness even when the added length contains no new information. Because longer responses are systematically chosen, the DPO loss pushes the model toward generating longer outputs. This is measurable: models aligned with vanilla DPO produce responses 20-40% longer on average than the SFT baseline for the same prompts. In production settings where token cost matters — customer service bots, API-facing deployments — this is a direct operational cost, not just an aesthetic concern.

The third limitation is that DPO’s KL term provides only implicit regularization. The KL divergence between the policy and the reference model is controlled indirectly through the β hyperparameter in the objective, but in practice the policy can move arbitrarily far from the reference model on responses not well covered by the training data. Overfit models develop degenerate behavior on inputs not similar to the preference dataset. IPO and subsequent variants address each of these constraints through different mechanisms, representing a research trajectory from DPO’s elegant-but-constrained formulation toward more robust offline and online alignment objectives.

26.2 DPO Variants

26.2.1 IPO: Identity Preference Optimization

IPO addresses the overfitting problem in DPO by replacing the implicit KL regularization with an explicit regularization term. In DPO, the reference model enters through the log-ratio terms: the loss encourages the policy to assign higher probability to chosen responses relative to the reference and lower probability to rejected responses. When the policy has seen a preference pair many times, it can drive the log-ratio to large values — saturating the sigmoid in the loss — without incurring any additional penalty. This is DPO’s overfitting failure mode: the model perfectly memorizes the preference dataset while generalizing poorly.

IPO modifies the objective to directly penalize deviations from a target log-ratio. Formally, rather than minimizing cross-entropy over a Bradley-Terry model, IPO minimizes the squared difference between the policy’s log-ratio and a target value of 1/(2τ), where τ is the regularization strength. This keeps the log-ratios bounded throughout training and prevents the saturation problem. The practical effect: IPO is significantly more stable on small preference datasets where DPO would overfit aggressively. When you have fewer than 10,000 preference pairs, IPO will generally produce a more robust model. The tradeoff is a hyperparameter τ that requires tuning; DPO’s β is also a hyperparameter but is less sensitive to its precise value.

26.2.2 KTO: Kahneman-Tversky Optimization

KTO takes a fundamentally different angle: it abandons the requirement for paired (chosen, rejected) comparisons entirely. Instead of training on response pairs, KTO trains on individual examples each labeled with a binary signal — desirable or undesirable. The theoretical grounding is Kahneman and Tversky’s prospect theory, which models human decision-making as an asymmetric value function: losses are perceived as more impactful than equivalent gains. KTO translates this into an alignment objective where the utility of a model output is measured against a reference point (the KL-penalized expected value of the policy’s outputs), and where undesirable outputs are penalized more sharply than desirable outputs are rewarded.

The practical advantage is data efficiency in a specific sense: you can use any labeled data you have, whether or not it comes in matched pairs. A dataset of 10,000 individual annotations (“this response was good,” “this response was bad”) is unusable by DPO, which requires the good and bad responses to come from the same prompt. KTO can train directly on these unpaired annotations, effectively doubling the usable data from a given annotation budget. In production settings where preference annotation is expensive and response-pair collection requires coordinated labeling workflows, KTO lowers the operational bar substantially. The empirical tradeoff: on datasets where paired comparisons are available, DPO still holds a slight edge because the paired structure provides a stronger learning signal per example. KTO’s advantage is specifically on unpaired or partially-paired annotation corpora.

26.2.3 ORPO: Odds Ratio Preference Optimization

ORPO removes the reference model entirely and fuses SFT and alignment into a single training pass. The key observation is that the reference model in DPO serves as an anchor: it prevents the policy from drifting too far from the SFT baseline. But this anchor is already implicit in a combined SFT+alignment loss — the SFT term pulls the model toward the chosen responses, serving the same regularization function. ORPO’s loss adds an odds ratio penalty term to the standard SFT cross-entropy loss. The odds ratio measures how much more likely the model is to generate the chosen response than the rejected response, and the penalty increases this ratio. Because SFT and alignment happen simultaneously, there is no reference model checkpoint to store or load during training.

The compute savings are meaningful: eliminating the reference model forward pass reduces peak GPU memory by roughly 50% for a given batch. This makes ORPO practical for alignment runs on hardware without high memory bandwidth — a 70B model that would require 8 A100s for DPO can be aligned with 4 A100s using ORPO with the same batch size. The limitation is that ORPO assumes the SFT data distribution is close to the preferred response distribution, which is typically true when fine-tuning on a curated instruction dataset but may not hold when adapting a general base model to a narrow domain with preferences that differ substantially from the pretraining distribution.

26.2.4 SimPO: Simple Preference Optimization

SimPO removes the reference model by a different mechanism: it replaces the log-ratio against the reference with the average log-probability of the response under the current policy, normalized by sequence length. This normalization directly addresses DPO’s length bias. In DPO, longer chosen responses contribute more total probability mass to the numerator of the log-ratio, which biases the gradient toward length. SimPO normalizes by length, making the reward signal a per-token average log-probability. A concise, high-quality response and a verbose, padded response of equal per-token quality receive equal reward.

SimPO also introduces a target reward margin γ: the objective requires the chosen response’s average log-probability to exceed the rejected response’s average log-probability by at least γ, rather than merely being higher. This margin creates a safety buffer that improves calibration — the model does not just prefer chosen over rejected, it prefers them by a specified amount. In practice, γ is a hyperparameter typically set between 0.5 and 2.0. Models trained with SimPO consistently produce shorter, more concise responses than DPO-trained models on the same preference data, making it the preferred choice when deployment latency and token cost are primary concerns.

26.3 GRPO: Group Relative Policy Optimization

GRPO is the alignment algorithm used in DeepSeek-R1 and represents a departure from both the DPO family (offline, no RL loop) and classic PPO-based RLHF (online, requires critic model). The core idea: instead of estimating the advantage of a response by comparing it to a learned value function (PPO) or to a reference model (DPO), compare responses within a group sampled from the current policy for the same prompt.

For each training prompt, GRPO samples a group of G responses from the current policy — typically G between 4 and 16. It then computes a reward for each response using a reward function, and normalizes the rewards within the group. The normalized reward becomes the advantage estimate: a response that is better than the group average has a positive advantage; one that is worse has a negative advantage. The policy gradient update uses these within-group advantages to reinforce better-than-average responses and suppress worse-than-average ones.

\[\mathcal{L}_{GRPO} = -\mathbb{E}\left[\sum_{i=1}^{G} \frac{\hat{A}_i}{\sigma_A} \log \pi_\theta(y_i | x)\right] + \beta \cdot D_{KL}(\pi_\theta \| \pi_{ref})\]

where \(\hat{A}_i = r_i - \bar{r}\) is the centered advantage for response \(i\) within the group.

The advantage of this design: no critic model (value function) is required. In PPO, the critic model — typically another copy of the LLM — estimates the value of being in a state and is used to compute advantage via generalized advantage estimation. Training the critic adds memory and compute overhead, introduces a separate optimization problem that must be solved stably alongside the actor, and requires careful hyperparameter tuning for the value loss coefficient. GRPO sidesteps all of this by using empirical within-group statistics. The advantage estimate is noisier than a well-trained critic would produce, but it is unbiased and does not require a separate model.

The reward signal in GRPO, as used in DeepSeek-R1, is rule-based rather than a learned reward model: mathematical correctness (verified against ground-truth answers), format compliance (did the model produce a correctly structured chain-of-thought inside <think> tags), and length penalties. Rule-based rewards are deterministic and do not suffer from reward model hacking — the failure mode where a policy exploits quirks in a learned reward model to achieve high reward without actually improving in the desired way. For tasks where correctness is verifiable (math, code, structured outputs), rule-based GRPO is both simpler and more reliable than learned reward model approaches.

26.4 Online vs. Offline Preference Learning

All DPO variants discussed above are offline algorithms: they train on a fixed dataset of preference pairs that was collected before training began. The model never generates its own responses during training; it only adjusts its probabilities over responses that already exist in the dataset. This creates a structural limitation — as the model improves, the rejected responses in the training set become increasingly easy to distinguish from the model’s current outputs. The learning signal weakens. Later training steps are spent optimizing over pairs that no longer represent the model’s typical failure modes.

Online DPO closes this gap by generating new responses during training. The training loop runs as follows: sample a batch of prompts; generate two responses from the current policy; score both responses using a reward model or rule-based judge; label the higher-scoring response as chosen and the lower-scoring as rejected; add the fresh pair to the training batch; update the policy. Because the rejected responses come from the current policy rather than from a static dataset, they are always on the current model’s distribution. The gradient signal is always diagnostic — it shows the model examples of mistakes it is currently making rather than mistakes it made in a previous iteration.

Iterative DPO is a lightweight approximation to fully online DPO: alternate between (1) running the current policy on all training prompts to generate fresh responses, (2) scoring responses with a reward model, (3) building a new preference dataset, (4) running one epoch of offline DPO on the new dataset. Each outer iteration of this loop costs one full generation pass plus one training pass, so the compute overhead is roughly 2x offline DPO. The distribution gap is substantially reduced because the preference dataset is regenerated periodically. Fully online DPO, where responses are generated continuously during training, eliminates the distribution gap entirely but requires careful engineering to maintain training stability when the policy and the data distribution are co-evolving.

26.5 The Practical Alignment Stack in 2025

Production alignment pipelines in 2025 have largely converged on SFT followed by an offline or iterative DPO variant, with PPO reserved for narrow use cases. The reasons PPO fell out of favor are practical rather than theoretical. PPO requires running the actor, critic, reference model, and reward model simultaneously during training — for a 70B model, this can require 16+ A100 GPUs just for a single training step. The training loop is sensitive to hyperparameter choices (clip ratio, KL coefficient, GAE parameters), and instability manifests as reward hacking or policy collapse rather than obvious training failures. Debugging PPO runs at scale requires specialized expertise. By contrast, DPO variants are just supervised learning with a modified loss — they are stable, debuggable with standard tooling, and do not require a critic or reward model in training.

Constitutional AI and RLAIF (Reinforcement Learning from AI Feedback) address the scalability bottleneck of human annotation. Instead of hiring annotators to rank responses, you use a strong model (Claude, GPT-4) to apply a set of principles (the constitution) to critique and revise responses, and to generate preference labels on (chosen, rejected) pairs. The resulting preference dataset can be generated at a scale and cost that human annotation cannot match. The quality of the preference labels depends on the quality of the judge model and the clarity of the constitutional principles, but for most production use cases RLAIF-generated labels are indistinguishable from human labels on standard preference benchmarks.

Self-play fine-tuning (SPIN) represents another direction: eliminate the preference dataset entirely by training the model to distinguish its own current outputs from the outputs of a previous version of itself. The “winner” in each pair is always the human-written reference response; the “loser” is whatever the current policy generates. This formulation does not require any new annotation — the human-written SFT data is reused as the chosen response, and the model’s own generations serve as rejected responses. SPIN converges when the policy matches the SFT data distribution well enough that the discriminator cannot distinguish them, at which point the policy is effectively aligned with the human demonstrations. It is less capable than reward-model-based alignment at instilling values not represented in the SFT data, but it requires zero additional annotation.

26.6 Interview Questions

Entry Level

Q1. What is the difference between DPO and RLHF/PPO in plain terms?

RLHF with PPO is a three-stage process: first, fine-tune the model on demonstrations (SFT); second, train a separate reward model on human preference pairs; third, use PPO — a reinforcement learning algorithm — to update the policy so it generates responses the reward model scores highly, while staying close to the original SFT model. The policy, the reward model, and a reference copy of the policy all need to be in memory simultaneously during PPO training.

DPO achieves the same goal with a single supervised training step. The mathematical insight is that the PPO objective with a KL penalty has a closed-form optimal policy, and that optimal policy can be expressed as the ratio between the policy and the reference model times a reward function. By rearranging this relationship, DPO rewrites the reward as a function of log-probability ratios, which means you can train the policy directly to prefer chosen responses over rejected responses without ever computing a reward model explicitly. The loss is a binary cross-entropy over preference pairs, which is just supervised learning.

The practical difference: DPO needs the frozen reference model (one forward pass per example) but no reward model, no value function, and no RL loop. Training is far more stable, requires less GPU memory, and converges predictably. The tradeoff is that DPO is offline — it only trains on examples in the dataset and cannot explore new responses during training the way PPO can.

Entry Level

Q2. What does KTO offer that DPO doesn’t?

DPO requires preference pairs: for a given prompt, you need one “good” response and one “bad” response collected together. This means annotation workflows must produce matched pairs — an annotator sees two responses to the same prompt and marks one as better. If you have collected individual binary labels (“this response was acceptable,” “this response was unacceptable”) without pairing them, you cannot use DPO.

KTO trains directly on individual binary labels without pairing. Each training example is a single (prompt, response, label) triple where the label is simply desirable or undesirable. This matches a much more common annotation pattern: log a large number of real user interactions, have annotators rate each response independently, and train on the resulting dataset. No coordinated pairing workflow is needed.

The theoretical basis for KTO is prospect theory’s asymmetric value function — the model is trained with a utility function where undesirable responses are penalized more sharply than desirable responses are rewarded, mirroring how humans experience losses more strongly than gains. In practice, on a fixed annotation budget, KTO can use roughly 2x as many labeled examples as DPO because unpaired annotations are cheaper to collect. The learning signal per example is weaker than a paired comparison, but the larger dataset often compensates.

Entry Level

Q3. What is online DPO and why is it better than standard offline DPO?

Standard offline DPO trains on a preference dataset collected before training starts. The rejected responses in the dataset were generated by some earlier version of the model (or a different model entirely). As training proceeds and the model improves, the distribution gap grows: the model no longer makes the same mistakes that are represented in the training data. The learning signal weakens because the training examples are no longer representative of the model’s current failure modes.

Online DPO generates new responses during training using the current policy. For each training batch, the model generates responses to the prompts, a reward model scores them, the higher-scoring response is labeled as chosen and the lower as rejected, and the model is immediately trained on this fresh preference pair. Because the rejected responses come from the current policy, the training data always reflects what the model is currently doing wrong. The gradient is always informative.

The practical result: online DPO achieves higher final alignment quality for the same number of training steps, especially in the later stages of training where offline DPO stagnates. The cost is the overhead of running generation and scoring during training, which roughly doubles the compute per step. Iterative DPO is a middle ground: regenerate the preference dataset every few hundred steps rather than continuously, balancing the distribution gap against the extra computation.

Mid Level

Q1. Explain GRPO and why it doesn’t need a critic model, unlike PPO.

In PPO, the advantage of taking action a in state s is estimated as the actual reward received minus the value function V(s) — the expected reward from state s under the current policy. Computing V(s) requires a separate critic model, typically another copy of the LLM, that is trained to predict expected rewards. This doubles the model count in memory and introduces a second optimization loop that must remain stable and synchronized with the actor’s updates.

GRPO eliminates the critic by estimating advantage empirically within a group. For each training prompt, GRPO samples G responses from the current policy, computes a reward for each (using a rule-based verifier or a learned reward model), and then computes the advantage of each response as its reward minus the group mean, divided by the group standard deviation. This is a within-group z-score. Responses that score better than the group average get positive advantage; those that score worse get negative advantage.

The key property: this within-group normalization is an unbiased estimator of the advantage without requiring any learned value function. The intuition is that the group of G responses to the same prompt is a sample from the policy’s output distribution for that prompt, so the sample mean is an empirical estimate of the expected reward — exactly what a critic would compute. For G=8 or G=16, the estimate is noisy but sufficient for a stable policy gradient. No critic training, no value loss hyperparameter, no worry about the critic lagging the actor. The tradeoff is higher variance in the advantage estimates compared to a well-trained critic, which means GRPO may need more training samples to achieve the same update quality.

Mid Level

Q2. Compare IPO vs. DPO — what specific overfitting problem does IPO address?

DPO’s loss is a binary cross-entropy of the form \(-\log\sigma(\beta \cdot (r_\theta(y_c) - r_\theta(y_r)))\), where \(r_\theta\) is the log-ratio of the policy to the reference model for each response. As training progresses, the model can drive this log-ratio difference to very large values — the sigmoid approaches 1, the gradient approaches 0, and the loss flattens out near zero. The model has “memorized” the preference pair in the sense that it assigns arbitrarily higher probability to the chosen response over the rejected one. On a small dataset, this happens quickly, and the model begins to overfit: it correctly handles prompts that appear in the preference dataset but generalizes poorly to similar prompts not in the dataset.

IPO replaces the saturating sigmoid objective with a squared loss that penalizes the log-ratio being different from a constant target value \(1/(2\tau)\). Because the loss is a squared error around a fixed target, it does not saturate — there is always a non-zero gradient pulling the log-ratio toward the target, preventing the extreme values that cause DPO’s overfitting. The regularization is explicit and direct rather than depending on the KL term to indirectly keep probabilities bounded.

Practically: IPO is preferred over DPO when the preference dataset is small (under ~10,000 pairs) or when training for many epochs on the same data. On large datasets where overfitting is less of a concern, DPO and IPO perform similarly. The hyperparameter τ in IPO controls the strength of regularization and needs tuning — too small a τ leads to underfitting; too large allows the same saturation problem as DPO.

Mid Level

Q3. Why does DPO have a length bias and how do SimPO and other variants address it?

DPO’s loss operates on the total log-probability of a response sequence — the sum of log-probabilities over all tokens. Longer sequences have more tokens contributing to this sum, so a longer response contributes more probability mass to the log-ratio term in the DPO objective. When human annotators consistently prefer longer responses (which they do — length is a strong proxy for perceived quality in blind annotation), the DPO loss systematically reinforces length as a reward-correlated feature. The model learns that generating more tokens is reliably rewarded, independent of whether those tokens add content.

SimPO addresses this directly by normalizing the log-probability by sequence length: the reward signal for a response is its average log-probability per token, not the total log-probability. Under this normalization, a concise high-quality response and a verbose padded response of the same per-token quality receive identical reward signals. The length bias is removed at the reward level rather than through post-hoc regularization. SimPO also adds a target margin γ between chosen and rejected rewards, which further improves calibration.

ORPO addresses length bias differently: by combining the SFT objective with an odds ratio penalty in a single loss, it keeps the SFT cross-entropy on the chosen response as the primary learning signal. The SFT loss is token-averaged, not summed, so length does not compound the gradient. The odds ratio penalty provides relative preference between chosen and rejected without length-weighting. Length-normalization at the loss level (SimPO) is more explicit and controllable; the combined SFT+alignment signal (ORPO) achieves a similar effect indirectly.

Forward Deployed Engineer

Q1. A customer is fine-tuning a model for a customer support use case and wants the model to be more concise. Which alignment technique would you recommend and why?

SimPO is the most targeted solution for a conciseness requirement. The core problem is that vanilla DPO produces longer responses because human preference annotations favor length, and DPO’s loss amplifies this bias through un-normalized total log-probabilities. SimPO normalizes the reward by sequence length, making the alignment signal per-token rather than per-sequence, which directly removes the incentive to generate more tokens. The target margin γ (typically 0.5-1.5) provides an additional calibration buffer.

The implementation strategy: collect preference data for the customer support domain, keeping it realistic by including actual support queries and model responses. Label shorter, accurate responses as preferred over longer ones that contain the same information plus filler language. Use SimPO with γ tuned so the model consistently produces responses in the 50-150 token range appropriate for support interactions. Monitor average response length as a primary training metric alongside preference accuracy.

If the customer has budget constraints and wants to avoid running a separate reference model, ORPO is a strong alternative — no reference model checkpoint is needed, compute is lower, and the combined SFT+alignment objective naturally controls length through the token-averaged SFT loss. However, SimPO is the more direct intervention for the conciseness problem specifically.

Two things to validate before deployment: run an evaluation comparing response quality (correct resolution, customer satisfaction proxy) across length buckets to confirm conciseness is not being achieved at the cost of resolution quality; and ensure the preference dataset contains examples where the correct answer genuinely requires a longer response (multi-step troubleshooting), so the model does not become pathologically brief on legitimately complex queries.

Forward Deployed Engineer

Q2. A team has a dataset of 10,000 individual “good response” / “bad response” labels but no paired comparisons — which alignment approach is most practical?

KTO is the direct fit here. It is explicitly designed for unpaired binary preference labels — individual (prompt, response, label) triples where the label is desirable or undesirable, with no requirement that good and bad responses come from the same prompt. DPO, IPO, and SimPO all require matched pairs and cannot be applied to this dataset directly.

The practical path: structure the dataset as KTO expects — each row is (prompt, response, boolean label). Verify that the class balance is reasonable; KTO is theoretically grounded in the asymmetric value function of prospect theory, which expects undesirable examples to carry more weight, but extreme imbalance (90%+ good or 90%+ bad) will bias the objective. Aim for roughly 50-70% desirable labels if the data can be filtered.

If the team later wants to run DPO-style training, KTO can serve as a warmup: run KTO first to move the model in the right direction using the unpaired data, then collect a smaller set of paired comparisons (perhaps 1,000-2,000 pairs) targeting the remaining failure modes. This hybrid strategy is more data-efficient than trying to collect 10,000 paired comparisons from scratch.

One practical caution: audit the labeling criteria before training. Individual binary labels are often collected with weaker annotator guidance than paired comparisons, because annotators making pairwise judgments are forced to compare and contrast, which surfaces inconsistencies. If the labeling criteria were loose (e.g., “was this response acceptable?”), the resulting KTO model may have a calibration ceiling lower than what a clean paired dataset would produce. A calibration audit on a held-out sample before committing to a full training run is worth the investment.