8 Fine-Tuning LLMs

Note

Who this chapter is for: Mid Level What you’ll be able to answer after reading this:

The difference between full fine-tuning and instruction tuning
How to build and curate a fine-tuning dataset
What catastrophic forgetting is and how to mitigate it
When fine-tuning beats prompting, and when it doesn’t

8.1 Full Fine-Tuning

Full fine-tuning means updating every parameter in the model on your domain-specific dataset. To understand why this is computationally demanding, you need to account for every component that must reside in GPU memory during training. The model weights at FP16 consume 2 bytes per parameter — so a 7B parameter model immediately demands 14GB just to hold the model. But that is only the beginning. The Adam optimizer, which is the standard choice for transformer training, stores two additional momentum vectors per parameter — the first moment (gradient mean) and second moment (gradient variance) — both typically kept in FP32 to maintain numerical stability. That adds 8 bytes per parameter, or 56GB for a 7B model. Gradients add another 2 bytes per parameter at FP16, contributing another 14GB. Before you have processed a single training example, you need approximately 84GB, plus the activation memory that accumulates during the forward pass. This means a 7B model requires at minimum two A100 80GB GPUs with tensor parallelism, and a 70B model requires an eight-GPU cluster. These numbers are why most teams cannot afford full fine-tuning on large models.

Despite the cost, full fine-tuning is the right choice in specific circumstances. When you have tens of thousands of labeled examples — enough that the signal-to-noise ratio justifies the compute — and when the behavioral change you need is fundamental rather than stylistic, full fine-tuning is appropriate. PEFT methods like LoRA constrain updates to a low-rank subspace of each weight matrix, which means some types of deep behavioral change are simply out of reach for adapters but accessible through full fine-tuning. Latency is another consideration: adapter layers add inference overhead unless merged, and systems with microsecond latency budgets may prefer a single clean model. Typical justified use cases include retraining GPT-2 or similar smaller models from scratch on domain corpora like legal or medical documents, fully fine-tuning LLaMA-7B on a large customer support dataset where you want the model to internalize product-specific language and reasoning patterns, or adapting a code model to a proprietary internal programming language with syntax not present in pretraining.

The mechanics of full fine-tuning follow the same process as pretraining: compute the forward pass, calculate the loss against your labels (cross-entropy for language modeling or classification), backpropagate gradients through every layer, and update all weights using the optimizer. The difference from pretraining is the data distribution: pretraining used a massive heterogeneous corpus, while fine-tuning uses your narrow, labeled dataset. This narrowing of distribution is both the strength and the weakness of full fine-tuning — the model becomes highly specialized for your distribution, which is exactly what you want, but it also means the model may lose generality on tasks not represented in your fine-tuning set. The learning rate for fine-tuning is typically much lower than pretraining — on the order of 1e-5 to 5e-5 rather than 1e-3 or higher — to avoid catastrophically overwriting the pretraining knowledge with noisy updates from a smaller dataset.

A practical approach to full fine-tuning at limited compute is gradient checkpointing combined with mixed precision training. Gradient checkpointing trades compute for memory: instead of storing all activations during the forward pass, it recomputes them during backpropagation. This reduces activation memory from O(layers × sequence_length) to O(sqrt(layers × sequence_length)) at the cost of roughly 30% more compute time. Mixed precision training (BF16 or FP16 for forward/backward, FP32 for optimizer states) is now standard practice and provides the memory savings without sacrificing optimizer stability. Together, these techniques can reduce peak memory consumption enough to fit a full fine-tuning run of a 7B model onto a single A100 80GB GPU, though barely. Tools like DeepSpeed ZeRO-3 and Fully Sharded Data Parallel (FSDP) distribute optimizer states, gradients, and parameters across multiple GPUs, enabling full fine-tuning of much larger models on standard multi-GPU hardware.

8.2 Instruction Tuning

Instruction tuning is supervised fine-tuning on a dataset of (instruction, response) pairs, with the goal of teaching the model to follow natural language directions. This is the technique that transforms a base language model — which only knows how to predict the next token in internet-style text — into a model that responds helpfully to user requests. The foundational insight from the FLAN paper (Wei et al., 2021) was striking: a model fine-tuned on thousands of diverse task descriptions phrased as instructions dramatically outperformed the same-sized base model on zero-shot tasks it had never seen. The mechanism is generalization: the model learns the meta-skill of reading and following instructions, which transfers across novel instructions at inference time. This was a conceptual breakthrough — it showed that instruction following is itself a learnable capability, not something that only emerges at extreme scale.

The format of the data matters as much as the content. Two dominant formats have emerged in the open-source ecosystem. The Alpaca format uses a three-field JSON structure — {"instruction": "...", "input": "...", "output": "..."} — where instruction is the task description, input is optional context (empty string if not needed), and output is the desired response. This format is simple and widely supported. The ShareGPT format captures multi-turn conversations as a list of {from, value} objects alternating between “human” and “gpt” speakers. Multi-turn format is important because real users ask follow-up questions, and a model fine-tuned only on single-turn examples will not know how to maintain context across turns. Modern chat models like LLaMA-3 Instruct and Mistral Instruct are trained on multi-turn ShareGPT-style data and expose this through their chat templates with system/user/assistant role tokens.

OpenAI’s InstructGPT paper (Ouyang et al., 2022), which described the training procedure behind the shift from GPT-3 to ChatGPT, used SFT as the first of three stages. In the SFT stage, human contractors wrote high-quality responses to a diverse set of prompts. This produced an initial model far better at following instructions than the base GPT-3, but still inconsistent and sometimes unsafe. The subsequent stages — reward modeling and PPO — addressed those remaining gaps (covered in depth in Chapter 10). The critical lesson from InstructGPT is that SFT alone is insufficient to produce a truly aligned model, but it is a necessary precondition: the RL-based training in stage 3 requires a base policy that already knows how to produce coherent conversational responses. You cannot skip SFT and go directly to RL from the base pretrained model.

A subtle but important point about instruction tuning: it does not teach the model new knowledge. The factual knowledge in an instruction-tuned model comes from pretraining. What instruction tuning contributes is behavioral: the model learns how to express its knowledge helpfully, how to structure responses, how to signal uncertainty, and how to format output appropriately for conversational contexts. This has an important practical implication: if your use case requires the model to know facts that were not in the pretraining corpus (proprietary product information, recent events, internal documentation), instruction tuning alone will not provide that knowledge. You need either retrieval-augmented generation (RAG) to inject context at inference time, or you need to include that factual content in the instruction tuning dataset as grounded examples — ideally both. The model learns patterns from instruction tuning; it learns facts from pretraining.

8.3 Dataset Curation — Quality Over Quantity

The LIMA paper (Zhou et al., 2023) produced one of the most important empirical findings in the instruction tuning literature: 1,000 carefully selected, high-quality examples were sufficient to produce a model that matched or exceeded models trained on 50,000 examples on a range of evaluation tasks. This result challenged the prevailing assumption that more data always wins, and it pointed to a deeper truth about what instruction tuning is actually doing. The model already contains the knowledge and linguistic capability from pretraining. Instruction tuning is alignment of output format and behavioral style — and that alignment requires only enough examples to teach the pattern, not an exhaustive enumeration of every possible scenario. Redundant examples that repeat the same pattern add noise but little signal, while a carefully selected diverse set forces the model to learn the underlying meta-skill of instruction following rather than memorizing specific input-output pairs.

Quality in a fine-tuning dataset has four distinct dimensions. Diversity is the first and most important: your dataset must cover the full distribution of tasks and user intents that your application will encounter. A customer support dataset that over-represents billing questions and under-represents technical troubleshooting will produce a model that is systematically worse at troubleshooting, even if it has thousands of billing examples. Audit your dataset distribution carefully before training. Accuracy is the second dimension: every response must be factually correct and represent the desired behavior. A single incorrect example does not ruin the model, but systematic errors in a category of examples will teach the model to reproduce those errors. Format consistency is the third: if your desired output format is JSON, every example’s output should be valid JSON with the same schema. If it is markdown with headers, every response should use that structure. The model learns the format from the examples and is confused by inconsistency. Difficulty distribution is the fourth: include easy examples (which the model already handles well) to anchor training, medium examples (the target capability), and hard examples (edge cases and complex reasoning) to push the model’s ceiling.

Practical dataset construction at scale requires systematic deduplication and filtering. Near-duplicate examples are common when you collect data from multiple sources or generate synthetic examples with similar templates — they waste training compute and can cause the model to over-fit to specific phrasings. MinHash Locality-Sensitive Hashing (LSH) is the standard approach for near-duplicate detection at scale: represent each document by a MinHash signature, then use banding to find pairs with high Jaccard similarity efficiently. For quality filtering, a lightweight classifier trained to distinguish high-quality from low-quality responses can process millions of examples at low cost. Common filtering targets: very short responses that don’t answer the question, responses containing hate speech or personally identifiable information, responses with excessive repetition (a known failure mode in autoregressive generation), and responses in the wrong language. Perplexity-based filtering using a reference model is another effective approach: responses with anomalously high perplexity are often incoherent or malformed.

The most powerful dataset strategy for production systems is a data flywheel: collect real user interactions with your deployed system, filter them for quality, label the highest-quality examples, and add them to your training set. This creates a self-reinforcing loop where each model version generates interactions that improve the next version. The key operational challenge is the quality filter: not all real user interactions are good training examples. You want examples where the model’s response was effective (user continued productively, rated highly, did not immediately rephrase the question) and diverse (the example covers a case not already well-represented in the training set). Implementing this flywheel requires investment in logging infrastructure, annotation tooling, and dataset management — but it compounds over time and is a significant competitive moat for teams that build it systematically. Teams relying solely on static datasets fall behind teams with active data flywheels.

8.4 Catastrophic Forgetting

Catastrophic forgetting is the phenomenon where a neural network, when trained on a new distribution of data, loses previously acquired capabilities. In the context of LLM fine-tuning, this means that a model fine-tuned on a narrow domain — say, medical question answering — may degrade on coding tasks, mathematical reasoning, or general world knowledge that it handled well before fine-tuning. The mechanism is straightforward: gradient updates that improve performance on the fine-tuning distribution necessarily move the model’s weights away from the configurations that encoded general capabilities. The optimizer does not know or care that the weights it is modifying also encode Python syntax or historical facts — it only knows that this gradient step reduces the loss on the medical QA examples in front of it. The severity of forgetting depends on the distance between the fine-tuning distribution and the original pretraining distribution: the more specialized and narrow the fine-tuning data, the more forgetting you typically observe.

Detecting catastrophic forgetting requires evaluating on held-out general benchmarks before, during, and after fine-tuning. A standard practice is to checkpoint the model at regular intervals (every 100-500 gradient steps depending on the dataset size) and evaluate each checkpoint on a suite of benchmarks — MMLU for general knowledge breadth, HellaSwag for commonsense reasoning, HumanEval for coding ability, and any other capabilities relevant to your deployment context. Plotting these benchmark scores against training steps gives you a forgetting curve: you can see exactly when and how much general capability is being sacrificed for domain-specific improvement. The goal is to find the checkpoint that maximizes domain-specific performance while keeping benchmark scores within acceptable bounds of the pre-fine-tuning baseline. Without this monitoring, you may train to convergence on your domain task while unknowingly producing a model that fails on the general queries that make up 40% of your real production traffic.

Four mitigation strategies address catastrophic forgetting with different tradeoffs. Replay buffers are the simplest: mix a fraction of diverse pretraining-style data (typically 5-10% of each batch) into the fine-tuning dataset. This forces the model to maintain a gradient signal for general capabilities throughout training. The tradeoff is that replay data dilutes domain-specific learning — you need to tune the replay fraction to balance the two objectives. Elastic Weight Consolidation (EWC) is a more principled approach: after measuring each weight’s importance to general capabilities (via the Fisher information matrix), EWC adds a penalty term to the loss function that resists changes to important weights. This is computationally expensive to implement at scale but provides precise protection against forgetting specific capabilities. Multi-task fine-tuning trains the model simultaneously on your domain task and a diverse set of general tasks — the model is explicitly supervised to maintain performance on both. This requires curating the general task data and managing the multi-task training loop, but it produces models with the best balance of domain specialization and general capability retention. Finally, PEFT methods like LoRA inherently limit forgetting by restricting gradient updates to a small set of adapter parameters — the frozen base model weights retain all their original capabilities, and only the adapters change.

8.5 Prompting vs. Fine-Tuning Decision Framework

The decision between prompting and fine-tuning is not about what’s technically possible — a sufficiently large prompt with many examples can often match a fine-tuned model’s output quality on many tasks. It is about what is practical, economical, and reliable at production scale. Prompting should be your first choice when you are in early-stage development, when you have fewer than 100 labeled examples, when your requirements are evolving rapidly, or when the task responds cleanly to natural language instructions. Few-shot prompting with 5-20 examples in the context window can achieve high quality for many tasks and requires zero training infrastructure. The ability to iterate on a prompt in minutes rather than days is a powerful advantage when you are still discovering what your application needs to do.

Fine-tuning becomes the right choice when prompting has hit a wall. The clearest signals: the model produces the wrong output format despite explicit instructions in every few-shot example; the few-shot examples in your prompt consume more than 30% of the available context window (implying prohibitive token cost at scale); you need behavior that is consistent across thousands of edge cases and prompt-based solutions keep failing on a long tail of inputs; or the model lacks domain-specific factual knowledge that you cannot always inject through context. A common path is to start with prompting, collect failure cases, build a labeled dataset from those failures, and use that dataset to drive the fine-tuning decision. If you find yourself writing increasingly elaborate prompts with dozens of examples and still hitting consistent failure modes, that is a strong signal that fine-tuning is justified.

The economics of fine-tuning vs. prompting are increasingly favorable for fine-tuning at production scale. Every few-shot example added to a prompt increases the input token cost for every single API call. A prompt with 2,000 tokens of examples, served to 10 million requests per day, represents massive cumulative token spend. A fine-tuned model encodes those examples implicitly in its weights and may require only a short system prompt, dramatically reducing per-request cost. A rough heuristic: if your prompt exceeds ~1,000 tokens and your traffic exceeds 1 million requests per month, the compute cost of a fine-tuning run often pays for itself within weeks of deployment. This calculation is what has driven many production teams to invest in fine-tuning pipelines even when the quality improvement over prompting is modest — the cost reduction alone justifies it.

Red flags that indicate fine-tuning is needed cluster around three categories: format, knowledge, and consistency. Format failures occur when the model produces incorrect structure (wrong JSON schema, missing required fields, wrong markdown structure) despite clear instructions — the model’s pretraining distribution doesn’t include enough examples of your exact format, and prompting alone cannot overcome that. Knowledge failures occur when the model fabricates information or answers incorrectly on your domain despite being given relevant context — it may need more exposure to your domain’s vocabulary, conventions, and reasoning patterns than a few-shot prompt can provide. Consistency failures occur at the edge cases: the model handles 90% of inputs correctly with prompting but fails on a specific category of inputs in a predictable pattern. Fine-tuning on that failure category, combined with regression testing, can close the gap systematically in a way that prompt engineering cannot.

8.6 Interview Questions

Entry Level

Q1. What is the difference between pretraining and fine-tuning?

Model Answer

Pretraining is the initial large-scale training phase where a model is trained on a massive, diverse corpus of text — typically hundreds of billions to trillions of tokens scraped from the internet, books, and code repositories — with the objective of predicting the next token. This phase is where the model acquires general language understanding, factual knowledge, reasoning patterns, and coding ability. Fine-tuning is a subsequent, much shorter training phase that updates the pretrained model’s weights on a smaller, curated, task-specific dataset. Fine-tuning teaches the model new behaviors (how to follow instructions, how to answer in a specific format, how to respond appropriately to your use case), but the underlying knowledge comes from pretraining. Analogy: pretraining is a generalist education spanning years; fine-tuning is a targeted professional training course of a few weeks. You cannot fine-tune a model into knowing facts it never saw during pretraining, but you can fine-tune it to express its knowledge in precisely the way your application requires.

Q2. What is instruction tuning?

Model Answer

Instruction tuning is supervised fine-tuning on a dataset of (instruction, response) pairs, where each instruction describes a task in natural language and the response is the desired model output. The goal is to teach the model the meta-skill of following instructions — so that at inference time, it can generalize to novel instructions it has never seen. Before instruction tuning, base language models could only continue text; they had no concept of “answer this question” or “summarize this document.” After instruction tuning, the model has internalized a behavioral pattern: read an instruction, produce the appropriate response. The seminal demonstration was Google’s FLAN (2021), which showed that fine-tuning on thousands of diverse task descriptions phrased as instructions dramatically improved zero-shot performance on held-out tasks. Instruction tuning is stage one of the RLHF pipeline used to build ChatGPT and similar chat models — it provides the behavioral foundation that the subsequent reward modeling and RL stages refine.

Q3. What is catastrophic forgetting?

Model Answer

Catastrophic forgetting is the degradation of previously learned capabilities when a neural network is trained on a new, narrower data distribution. In LLM fine-tuning, this means that a model fine-tuned on a specialized domain — medical text, legal documents, a specific product’s support data — may become measurably worse at general tasks like coding, reasoning, or answering questions outside that domain. The cause is gradient-based learning: the optimizer updates weights to minimize loss on the fine-tuning examples, and these updates overwrite some of the weight configurations that encoded general capabilities. The severity scales with the narrowness of the fine-tuning distribution and the number of training steps. It can be detected by evaluating on general benchmarks (MMLU, HumanEval, HellaSwag) before and after fine-tuning. Mitigations include replay buffers (mixing general data into fine-tuning batches), elastic weight consolidation (penalizing changes to important weights), multi-task fine-tuning, and PEFT methods like LoRA (which freeze base weights entirely).

Mid Level

Q4. You have 500 labeled examples for a customer support classification task. Would you full fine-tune, use LoRA, or use few-shot prompting? Justify your choice.

Model Answer

With 500 labeled examples, LoRA fine-tuning is typically the right choice, though the specific decision depends on a few factors. Few-shot prompting is viable if your 500 examples are diverse enough to select good in-context examples from, but a classification task with fixed output categories is a strong candidate for fine-tuning because the model needs to learn a specific output schema consistently. Full fine-tuning with 500 examples risks overfitting — you have enough signal to train a LoRA adapter but likely not enough to justify updating all 7 billion parameters without aggressive regularization, and the compute cost is not warranted. LoRA gives you the benefits of fine-tuning (consistent format, lower inference-time token cost, better performance on your specific distribution) with a fraction of the training compute and a low risk of catastrophic forgetting since base weights are frozen. The practical workflow: split your 500 examples 400/100 train/validation, train a LoRA adapter on the training set, evaluate on the validation set, and compare to a few-shot baseline. If LoRA is within 2% of few-shot accuracy, consider prompting for simplicity; if it’s clearly better, deploy the adapter. For a classification task specifically, LoRA almost always wins once you have more than ~200 examples.

Q5. What makes a fine-tuning dataset high quality vs. low quality? What specific properties do you look for?

Model Answer

A high-quality fine-tuning dataset has four core properties. First, diversity: the dataset must cover the full distribution of inputs your model will encounter in production. A dataset that represents only common cases trains a model that fails systematically on edge cases. Measure diversity by clustering your examples and checking that every cluster important to your use case is well-represented. Second, accuracy: every response must be correct and represent the exact behavior you want the model to exhibit. Errors in the dataset become errors in the model — there is no noise tolerance. Third, format consistency: if your desired output is structured JSON, every response in the dataset should be valid JSON with the same schema. Inconsistency in format teaches the model that any format is acceptable. Fourth, appropriate difficulty distribution: include easy examples (to anchor learning), medium examples (the main target behavior), and hard/edge-case examples (to push the model’s ceiling). Specific red flags for low quality: duplicate or near-duplicate examples (mine entropy, not volume); very short responses that don’t fully demonstrate the desired behavior; responses that are technically correct but stylistically inconsistent; and examples where the response requires knowledge not available in the prompt (forcing the model to hallucinate during training).

Q6. The LIMA paper claimed 1,000 examples could match models trained on 50,000 examples. What does this tell us about dataset quality and data collection strategy?

Model Answer

The LIMA finding (Zhou et al., 2023) encodes a fundamental insight about what instruction tuning actually does. Instruction tuning is not teaching the model new knowledge — the model already has vast knowledge from pretraining. Instruction tuning is surface alignment: teaching the model to express that knowledge in the format and style appropriate for a helpful assistant. The amount of data needed to learn a behavioral format is much smaller than the amount needed to learn a factual corpus. A model trained on 50,000 redundant examples has mostly seen the same patterns repeated, providing limited additional signal beyond the first thousand. Carefully curated diverse examples, each demonstrating a distinct behavioral pattern, are far more efficient teachers. The practical implication for data collection strategy is: do not scale your dataset by adding more examples of things the model already handles well. Diagnose failure cases — tasks where the model’s output is wrong or misformatted — and add examples specifically for those failure modes. Invest in annotation quality over annotation quantity: one high-quality, carefully reviewed example is worth more than twenty mediocre ones. This also makes human labeling tractable — 1,000 high-quality examples is achievable for a small team, whereas 50,000 requires industrial-scale annotation infrastructure.

Q7. How would you detect catastrophic forgetting during a fine-tuning run before it’s too late?

Model Answer

The key is systematic checkpoint evaluation throughout training, not just at the end. Set up an evaluation harness that runs automatically at fixed intervals — typically every 100-500 gradient steps, depending on the total dataset size. The harness should evaluate two categories of tasks: your target domain task (the training objective) and a set of general capability benchmarks representative of what you need the model to retain. For most applications, a minimal general benchmark suite includes MMLU or a subset for factual knowledge, one coding benchmark like HumanEval, and HellaSwag for commonsense reasoning. Plot both domain performance and general benchmark scores against training steps. The forgetting signal is a diverging plot: domain performance rising while benchmark scores fall. You want the checkpoint where domain performance is high and benchmark degradation is still acceptable — this is often well before final convergence. Early warning signs: if benchmark scores start dropping before your domain task has substantially converged, your learning rate is too high, your replay fraction is too low, or your fine-tuning dataset is too narrow. Catching this at step 500 out of 5,000 lets you adjust hyperparameters and restart; catching it at the end is expensive. Also monitor validation loss on a held-out sample of general pretraining data — a rising validation loss on that held-out set is a direct signal of catastrophic forgetting.

Forward Deployed Engineer

Q8. A customer wants to fine-tune a model on their internal docs so it “knows their product.” Walk through the full data, compute, and deployment considerations you’d raise.

Model Answer

This is a common request that requires careful scoping before any training begins. On the data side, the first question is what “knows their product” actually means. If the goal is factual knowledge retrieval — answering questions about product features, pricing, policies — fine-tuning is probably the wrong solution. RAG is better suited because it can be updated as the product evolves without retraining, and it provides attribution (you can show the user which document was cited). Fine-tuning is appropriate when the goal is behavioral: adopt our terminology, respond in our brand voice, follow our specific response structure. For behavioral fine-tuning, you need (instruction, response) pairs — not just raw documents. Raw internal docs must be converted into training examples, typically by having domain experts write ideal Q&A pairs, or by using a strong model to generate synthetic examples and then having experts review and filter them. The dataset should cover the full distribution of user queries you expect, not just the easy cases documented in the FAQ. On the compute side, for most teams a LoRA fine-tune of a 7B or 13B model is the right scope — full fine-tuning is rarely justified for behavioral customization. Compute for a LoRA run on 1,000-10,000 examples takes hours on a single A100, not days. On deployment, discuss: model versioning (when the product changes, you need to retrain — what is the retraining cadence and who owns it?), evaluation (what is the success metric, and how will you know if a new model version is better?), fallback (if the fine-tuned model fails, what happens?), and data security (the training examples may contain sensitive internal information — where is training happening and who has access?).

Q9. The customer’s fine-tuned model performs well in offline evaluation but poorly in production. What are the five most likely root causes you’d investigate?

Model Answer

Distribution shift is the first and most common cause: the offline evaluation test set does not represent the real distribution of production queries. If the test set was sampled from the training data distribution or was manually curated by the team, it will be biased toward cases the model handles well. Collect a random sample of actual production queries and evaluate on those — the gap often becomes immediately apparent. Second, prompt drift: the production prompt template differs from the training template, even subtly. Fine-tuned models are sensitive to their prompt format because they learned the format from training. A missing system prompt, a changed field order, or different token spacing can degrade performance. Third, input preprocessing differences: the offline evaluation pipeline may handle tokenization, encoding, special characters, or input length differently than the production serving stack. Fourth, output postprocessing: the offline evaluation may be lenient about format compliance (accepting close matches) while production relies on exact structured output. Fifth, temperature and sampling parameters: if offline eval used greedy decoding but production uses a nonzero temperature with nucleus sampling, the output distribution shifts meaningfully. Investigation protocol: replay a sample of production requests through the offline eval pipeline with exact production parameters; compare token-by-token output between production and offline eval for the same inputs; log and inspect all production failure cases; and shadow-test the model against the old baseline on live traffic before fully deploying.

8.7 Further Reading

# Fine-Tuning LLMs ::: {.callout-note} **Who this chapter is for:** Mid Level **What you'll be able to answer after reading this:** - The difference between full fine-tuning and instruction tuning - How to build and curate a fine-tuning dataset - What catastrophic forgetting is and how to mitigate it - When fine-tuning beats prompting, and when it doesn't ::: ## Full Fine-Tuning Full fine-tuning means updating every parameter in the model on your domain-specific dataset. To understand why this is computationally demanding, you need to account for every component that must reside in GPU memory during training. The model weights at FP16 consume 2 bytes per parameter — so a 7B parameter model immediately demands 14GB just to hold the model. But that is only the beginning. The Adam optimizer, which is the standard choice for transformer training, stores two additional momentum vectors per parameter — the first moment (gradient mean) and second moment (gradient variance) — both typically kept in FP32 to maintain numerical stability. That adds 8 bytes per parameter, or 56GB for a 7B model. Gradients add another 2 bytes per parameter at FP16, contributing another 14GB. Before you have processed a single training example, you need approximately 84GB, plus the activation memory that accumulates during the forward pass. This means a 7B model requires at minimum two A100 80GB GPUs with tensor parallelism, and a 70B model requires an eight-GPU cluster. These numbers are why most teams cannot afford full fine-tuning on large models. Despite the cost, full fine-tuning is the right choice in specific circumstances. When you have tens of thousands of labeled examples — enough that the signal-to-noise ratio justifies the compute — and when the behavioral change you need is fundamental rather than stylistic, full fine-tuning is appropriate. PEFT methods like LoRA constrain updates to a low-rank subspace of each weight matrix, which means some types of deep behavioral change are simply out of reach for adapters but accessible through full fine-tuning. Latency is another consideration: adapter layers add inference overhead unless merged, and systems with microsecond latency budgets may prefer a single clean model. Typical justified use cases include retraining GPT-2 or similar smaller models from scratch on domain corpora like legal or medical documents, fully fine-tuning LLaMA-7B on a large customer support dataset where you want the model to internalize product-specific language and reasoning patterns, or adapting a code model to a proprietary internal programming language with syntax not present in pretraining. The mechanics of full fine-tuning follow the same process as pretraining: compute the forward pass, calculate the loss against your labels (cross-entropy for language modeling or classification), backpropagate gradients through every layer, and update all weights using the optimizer. The difference from pretraining is the data distribution: pretraining used a massive heterogeneous corpus, while fine-tuning uses your narrow, labeled dataset. This narrowing of distribution is both the strength and the weakness of full fine-tuning — the model becomes highly specialized for your distribution, which is exactly what you want, but it also means the model may lose generality on tasks not represented in your fine-tuning set. The learning rate for fine-tuning is typically much lower than pretraining — on the order of 1e-5 to 5e-5 rather than 1e-3 or higher — to avoid catastrophically overwriting the pretraining knowledge with noisy updates from a smaller dataset. A practical approach to full fine-tuning at limited compute is gradient checkpointing combined with mixed precision training. Gradient checkpointing trades compute for memory: instead of storing all activations during the forward pass, it recomputes them during backpropagation. This reduces activation memory from O(layers × sequence_length) to O(sqrt(layers × sequence_length)) at the cost of roughly 30% more compute time. Mixed precision training (BF16 or FP16 for forward/backward, FP32 for optimizer states) is now standard practice and provides the memory savings without sacrificing optimizer stability. Together, these techniques can reduce peak memory consumption enough to fit a full fine-tuning run of a 7B model onto a single A100 80GB GPU, though barely. Tools like DeepSpeed ZeRO-3 and Fully Sharded Data Parallel (FSDP) distribute optimizer states, gradients, and parameters across multiple GPUs, enabling full fine-tuning of much larger models on standard multi-GPU hardware. ## Instruction Tuning Instruction tuning is supervised fine-tuning on a dataset of (instruction, response) pairs, with the goal of teaching the model to follow natural language directions. This is the technique that transforms a base language model — which only knows how to predict the next token in internet-style text — into a model that responds helpfully to user requests. The foundational insight from the FLAN paper (Wei et al., 2021) was striking: a model fine-tuned on thousands of diverse task descriptions phrased as instructions dramatically outperformed the same-sized base model on zero-shot tasks it had never seen. The mechanism is generalization: the model learns the meta-skill of reading and following instructions, which transfers across novel instructions at inference time. This was a conceptual breakthrough — it showed that instruction following is itself a learnable capability, not something that only emerges at extreme scale. The format of the data matters as much as the content. Two dominant formats have emerged in the open-source ecosystem. The Alpaca format uses a three-field JSON structure — `{"instruction": "...", "input": "...", "output": "..."}` — where `instruction` is the task description, `input` is optional context (empty string if not needed), and `output` is the desired response. This format is simple and widely supported. The ShareGPT format captures multi-turn conversations as a list of `{from, value}` objects alternating between "human" and "gpt" speakers. Multi-turn format is important because real users ask follow-up questions, and a model fine-tuned only on single-turn examples will not know how to maintain context across turns. Modern chat models like LLaMA-3 Instruct and Mistral Instruct are trained on multi-turn ShareGPT-style data and expose this through their chat templates with system/user/assistant role tokens. OpenAI's InstructGPT paper (Ouyang et al., 2022), which described the training procedure behind the shift from GPT-3 to ChatGPT, used SFT as the first of three stages. In the SFT stage, human contractors wrote high-quality responses to a diverse set of prompts. This produced an initial model far better at following instructions than the base GPT-3, but still inconsistent and sometimes unsafe. The subsequent stages — reward modeling and PPO — addressed those remaining gaps (covered in depth in Chapter 10). The critical lesson from InstructGPT is that SFT alone is insufficient to produce a truly aligned model, but it is a necessary precondition: the RL-based training in stage 3 requires a base policy that already knows how to produce coherent conversational responses. You cannot skip SFT and go directly to RL from the base pretrained model. A subtle but important point about instruction tuning: it does not teach the model new knowledge. The factual knowledge in an instruction-tuned model comes from pretraining. What instruction tuning contributes is behavioral: the model learns how to express its knowledge helpfully, how to structure responses, how to signal uncertainty, and how to format output appropriately for conversational contexts. This has an important practical implication: if your use case requires the model to know facts that were not in the pretraining corpus (proprietary product information, recent events, internal documentation), instruction tuning alone will not provide that knowledge. You need either retrieval-augmented generation (RAG) to inject context at inference time, or you need to include that factual content in the instruction tuning dataset as grounded examples — ideally both. The model learns patterns from instruction tuning; it learns facts from pretraining. ## Dataset Curation — Quality Over Quantity The LIMA paper (Zhou et al., 2023) produced one of the most important empirical findings in the instruction tuning literature: 1,000 carefully selected, high-quality examples were sufficient to produce a model that matched or exceeded models trained on 50,000 examples on a range of evaluation tasks. This result challenged the prevailing assumption that more data always wins, and it pointed to a deeper truth about what instruction tuning is actually doing. The model already contains the knowledge and linguistic capability from pretraining. Instruction tuning is alignment of output format and behavioral style — and that alignment requires only enough examples to teach the pattern, not an exhaustive enumeration of every possible scenario. Redundant examples that repeat the same pattern add noise but little signal, while a carefully selected diverse set forces the model to learn the underlying meta-skill of instruction following rather than memorizing specific input-output pairs. Quality in a fine-tuning dataset has four distinct dimensions. Diversity is the first and most important: your dataset must cover the full distribution of tasks and user intents that your application will encounter. A customer support dataset that over-represents billing questions and under-represents technical troubleshooting will produce a model that is systematically worse at troubleshooting, even if it has thousands of billing examples. Audit your dataset distribution carefully before training. Accuracy is the second dimension: every response must be factually correct and represent the desired behavior. A single incorrect example does not ruin the model, but systematic errors in a category of examples will teach the model to reproduce those errors. Format consistency is the third: if your desired output format is JSON, every example's output should be valid JSON with the same schema. If it is markdown with headers, every response should use that structure. The model learns the format from the examples and is confused by inconsistency. Difficulty distribution is the fourth: include easy examples (which the model already handles well) to anchor training, medium examples (the target capability), and hard examples (edge cases and complex reasoning) to push the model's ceiling. Practical dataset construction at scale requires systematic deduplication and filtering. Near-duplicate examples are common when you collect data from multiple sources or generate synthetic examples with similar templates — they waste training compute and can cause the model to over-fit to specific phrasings. MinHash Locality-Sensitive Hashing (LSH) is the standard approach for near-duplicate detection at scale: represent each document by a MinHash signature, then use banding to find pairs with high Jaccard similarity efficiently. For quality filtering, a lightweight classifier trained to distinguish high-quality from low-quality responses can process millions of examples at low cost. Common filtering targets: very short responses that don't answer the question, responses containing hate speech or personally identifiable information, responses with excessive repetition (a known failure mode in autoregressive generation), and responses in the wrong language. Perplexity-based filtering using a reference model is another effective approach: responses with anomalously high perplexity are often incoherent or malformed. The most powerful dataset strategy for production systems is a data flywheel: collect real user interactions with your deployed system, filter them for quality, label the highest-quality examples, and add them to your training set. This creates a self-reinforcing loop where each model version generates interactions that improve the next version. The key operational challenge is the quality filter: not all real user interactions are good training examples. You want examples where the model's response was effective (user continued productively, rated highly, did not immediately rephrase the question) and diverse (the example covers a case not already well-represented in the training set). Implementing this flywheel requires investment in logging infrastructure, annotation tooling, and dataset management — but it compounds over time and is a significant competitive moat for teams that build it systematically. Teams relying solely on static datasets fall behind teams with active data flywheels. ## Catastrophic Forgetting Catastrophic forgetting is the phenomenon where a neural network, when trained on a new distribution of data, loses previously acquired capabilities. In the context of LLM fine-tuning, this means that a model fine-tuned on a narrow domain — say, medical question answering — may degrade on coding tasks, mathematical reasoning, or general world knowledge that it handled well before fine-tuning. The mechanism is straightforward: gradient updates that improve performance on the fine-tuning distribution necessarily move the model's weights away from the configurations that encoded general capabilities. The optimizer does not know or care that the weights it is modifying also encode Python syntax or historical facts — it only knows that this gradient step reduces the loss on the medical QA examples in front of it. The severity of forgetting depends on the distance between the fine-tuning distribution and the original pretraining distribution: the more specialized and narrow the fine-tuning data, the more forgetting you typically observe. Detecting catastrophic forgetting requires evaluating on held-out general benchmarks before, during, and after fine-tuning. A standard practice is to checkpoint the model at regular intervals (every 100-500 gradient steps depending on the dataset size) and evaluate each checkpoint on a suite of benchmarks — MMLU for general knowledge breadth, HellaSwag for commonsense reasoning, HumanEval for coding ability, and any other capabilities relevant to your deployment context. Plotting these benchmark scores against training steps gives you a forgetting curve: you can see exactly when and how much general capability is being sacrificed for domain-specific improvement. The goal is to find the checkpoint that maximizes domain-specific performance while keeping benchmark scores within acceptable bounds of the pre-fine-tuning baseline. Without this monitoring, you may train to convergence on your domain task while unknowingly producing a model that fails on the general queries that make up 40% of your real production traffic. Four mitigation strategies address catastrophic forgetting with different tradeoffs. Replay buffers are the simplest: mix a fraction of diverse pretraining-style data (typically 5-10% of each batch) into the fine-tuning dataset. This forces the model to maintain a gradient signal for general capabilities throughout training. The tradeoff is that replay data dilutes domain-specific learning — you need to tune the replay fraction to balance the two objectives. Elastic Weight Consolidation (EWC) is a more principled approach: after measuring each weight's importance to general capabilities (via the Fisher information matrix), EWC adds a penalty term to the loss function that resists changes to important weights. This is computationally expensive to implement at scale but provides precise protection against forgetting specific capabilities. Multi-task fine-tuning trains the model simultaneously on your domain task and a diverse set of general tasks — the model is explicitly supervised to maintain performance on both. This requires curating the general task data and managing the multi-task training loop, but it produces models with the best balance of domain specialization and general capability retention. Finally, PEFT methods like LoRA inherently limit forgetting by restricting gradient updates to a small set of adapter parameters — the frozen base model weights retain all their original capabilities, and only the adapters change. ## Prompting vs. Fine-Tuning Decision Framework The decision between prompting and fine-tuning is not about what's technically possible — a sufficiently large prompt with many examples can often match a fine-tuned model's output quality on many tasks. It is about what is practical, economical, and reliable at production scale. Prompting should be your first choice when you are in early-stage development, when you have fewer than 100 labeled examples, when your requirements are evolving rapidly, or when the task responds cleanly to natural language instructions. Few-shot prompting with 5-20 examples in the context window can achieve high quality for many tasks and requires zero training infrastructure. The ability to iterate on a prompt in minutes rather than days is a powerful advantage when you are still discovering what your application needs to do. Fine-tuning becomes the right choice when prompting has hit a wall. The clearest signals: the model produces the wrong output format despite explicit instructions in every few-shot example; the few-shot examples in your prompt consume more than 30% of the available context window (implying prohibitive token cost at scale); you need behavior that is consistent across thousands of edge cases and prompt-based solutions keep failing on a long tail of inputs; or the model lacks domain-specific factual knowledge that you cannot always inject through context. A common path is to start with prompting, collect failure cases, build a labeled dataset from those failures, and use that dataset to drive the fine-tuning decision. If you find yourself writing increasingly elaborate prompts with dozens of examples and still hitting consistent failure modes, that is a strong signal that fine-tuning is justified. The economics of fine-tuning vs. prompting are increasingly favorable for fine-tuning at production scale. Every few-shot example added to a prompt increases the input token cost for every single API call. A prompt with 2,000 tokens of examples, served to 10 million requests per day, represents massive cumulative token spend. A fine-tuned model encodes those examples implicitly in its weights and may require only a short system prompt, dramatically reducing per-request cost. A rough heuristic: if your prompt exceeds ~1,000 tokens and your traffic exceeds 1 million requests per month, the compute cost of a fine-tuning run often pays for itself within weeks of deployment. This calculation is what has driven many production teams to invest in fine-tuning pipelines even when the quality improvement over prompting is modest — the cost reduction alone justifies it. Red flags that indicate fine-tuning is needed cluster around three categories: format, knowledge, and consistency. Format failures occur when the model produces incorrect structure (wrong JSON schema, missing required fields, wrong markdown structure) despite clear instructions — the model's pretraining distribution doesn't include enough examples of your exact format, and prompting alone cannot overcome that. Knowledge failures occur when the model fabricates information or answers incorrectly on your domain despite being given relevant context — it may need more exposure to your domain's vocabulary, conventions, and reasoning patterns than a few-shot prompt can provide. Consistency failures occur at the edge cases: the model handles 90% of inputs correctly with prompting but fails on a specific category of inputs in a predictable pattern. Fine-tuning on that failure category, combined with regression testing, can close the gap systematically in a way that prompt engineering cannot. --- ## Interview Questions ::: {.callout-tip title="Entry Level"} **Q1. What is the difference between pretraining and fine-tuning?** ::: {.callout-note collapse="true" title="Model Answer"} Pretraining is the initial large-scale training phase where a model is trained on a massive, diverse corpus of text — typically hundreds of billions to trillions of tokens scraped from the internet, books, and code repositories — with the objective of predicting the next token. This phase is where the model acquires general language understanding, factual knowledge, reasoning patterns, and coding ability. Fine-tuning is a subsequent, much shorter training phase that updates the pretrained model's weights on a smaller, curated, task-specific dataset. Fine-tuning teaches the model new behaviors (how to follow instructions, how to answer in a specific format, how to respond appropriately to your use case), but the underlying knowledge comes from pretraining. Analogy: pretraining is a generalist education spanning years; fine-tuning is a targeted professional training course of a few weeks. You cannot fine-tune a model into knowing facts it never saw during pretraining, but you can fine-tune it to express its knowledge in precisely the way your application requires. ::: **Q2. What is instruction tuning?** ::: {.callout-note collapse="true" title="Model Answer"} Instruction tuning is supervised fine-tuning on a dataset of (instruction, response) pairs, where each instruction describes a task in natural language and the response is the desired model output. The goal is to teach the model the meta-skill of following instructions — so that at inference time, it can generalize to novel instructions it has never seen. Before instruction tuning, base language models could only continue text; they had no concept of "answer this question" or "summarize this document." After instruction tuning, the model has internalized a behavioral pattern: read an instruction, produce the appropriate response. The seminal demonstration was Google's FLAN (2021), which showed that fine-tuning on thousands of diverse task descriptions phrased as instructions dramatically improved zero-shot performance on held-out tasks. Instruction tuning is stage one of the RLHF pipeline used to build ChatGPT and similar chat models — it provides the behavioral foundation that the subsequent reward modeling and RL stages refine. ::: **Q3. What is catastrophic forgetting?** ::: {.callout-note collapse="true" title="Model Answer"} Catastrophic forgetting is the degradation of previously learned capabilities when a neural network is trained on a new, narrower data distribution. In LLM fine-tuning, this means that a model fine-tuned on a specialized domain — medical text, legal documents, a specific product's support data — may become measurably worse at general tasks like coding, reasoning, or answering questions outside that domain. The cause is gradient-based learning: the optimizer updates weights to minimize loss on the fine-tuning examples, and these updates overwrite some of the weight configurations that encoded general capabilities. The severity scales with the narrowness of the fine-tuning distribution and the number of training steps. It can be detected by evaluating on general benchmarks (MMLU, HumanEval, HellaSwag) before and after fine-tuning. Mitigations include replay buffers (mixing general data into fine-tuning batches), elastic weight consolidation (penalizing changes to important weights), multi-task fine-tuning, and PEFT methods like LoRA (which freeze base weights entirely). ::: ::: ::: {.callout-warning title="Mid Level"} **Q4. You have 500 labeled examples for a customer support classification task. Would you full fine-tune, use LoRA, or use few-shot prompting? Justify your choice.** ::: {.callout-note collapse="true" title="Model Answer"} With 500 labeled examples, LoRA fine-tuning is typically the right choice, though the specific decision depends on a few factors. Few-shot prompting is viable if your 500 examples are diverse enough to select good in-context examples from, but a classification task with fixed output categories is a strong candidate for fine-tuning because the model needs to learn a specific output schema consistently. Full fine-tuning with 500 examples risks overfitting — you have enough signal to train a LoRA adapter but likely not enough to justify updating all 7 billion parameters without aggressive regularization, and the compute cost is not warranted. LoRA gives you the benefits of fine-tuning (consistent format, lower inference-time token cost, better performance on your specific distribution) with a fraction of the training compute and a low risk of catastrophic forgetting since base weights are frozen. The practical workflow: split your 500 examples 400/100 train/validation, train a LoRA adapter on the training set, evaluate on the validation set, and compare to a few-shot baseline. If LoRA is within 2% of few-shot accuracy, consider prompting for simplicity; if it's clearly better, deploy the adapter. For a classification task specifically, LoRA almost always wins once you have more than ~200 examples. ::: **Q5. What makes a fine-tuning dataset high quality vs. low quality? What specific properties do you look for?** ::: {.callout-note collapse="true" title="Model Answer"} A high-quality fine-tuning dataset has four core properties. First, diversity: the dataset must cover the full distribution of inputs your model will encounter in production. A dataset that represents only common cases trains a model that fails systematically on edge cases. Measure diversity by clustering your examples and checking that every cluster important to your use case is well-represented. Second, accuracy: every response must be correct and represent the exact behavior you want the model to exhibit. Errors in the dataset become errors in the model — there is no noise tolerance. Third, format consistency: if your desired output is structured JSON, every response in the dataset should be valid JSON with the same schema. Inconsistency in format teaches the model that any format is acceptable. Fourth, appropriate difficulty distribution: include easy examples (to anchor learning), medium examples (the main target behavior), and hard/edge-case examples (to push the model's ceiling). Specific red flags for low quality: duplicate or near-duplicate examples (mine entropy, not volume); very short responses that don't fully demonstrate the desired behavior; responses that are technically correct but stylistically inconsistent; and examples where the response requires knowledge not available in the prompt (forcing the model to hallucinate during training). ::: **Q6. The LIMA paper claimed 1,000 examples could match models trained on 50,000 examples. What does this tell us about dataset quality and data collection strategy?** ::: {.callout-note collapse="true" title="Model Answer"} The LIMA finding (Zhou et al., 2023) encodes a fundamental insight about what instruction tuning actually does. Instruction tuning is not teaching the model new knowledge — the model already has vast knowledge from pretraining. Instruction tuning is surface alignment: teaching the model to express that knowledge in the format and style appropriate for a helpful assistant. The amount of data needed to learn a behavioral format is much smaller than the amount needed to learn a factual corpus. A model trained on 50,000 redundant examples has mostly seen the same patterns repeated, providing limited additional signal beyond the first thousand. Carefully curated diverse examples, each demonstrating a distinct behavioral pattern, are far more efficient teachers. The practical implication for data collection strategy is: do not scale your dataset by adding more examples of things the model already handles well. Diagnose failure cases — tasks where the model's output is wrong or misformatted — and add examples specifically for those failure modes. Invest in annotation quality over annotation quantity: one high-quality, carefully reviewed example is worth more than twenty mediocre ones. This also makes human labeling tractable — 1,000 high-quality examples is achievable for a small team, whereas 50,000 requires industrial-scale annotation infrastructure. ::: **Q7. How would you detect catastrophic forgetting during a fine-tuning run before it's too late?** ::: {.callout-note collapse="true" title="Model Answer"} The key is systematic checkpoint evaluation throughout training, not just at the end. Set up an evaluation harness that runs automatically at fixed intervals — typically every 100-500 gradient steps, depending on the total dataset size. The harness should evaluate two categories of tasks: your target domain task (the training objective) and a set of general capability benchmarks representative of what you need the model to retain. For most applications, a minimal general benchmark suite includes MMLU or a subset for factual knowledge, one coding benchmark like HumanEval, and HellaSwag for commonsense reasoning. Plot both domain performance and general benchmark scores against training steps. The forgetting signal is a diverging plot: domain performance rising while benchmark scores fall. You want the checkpoint where domain performance is high and benchmark degradation is still acceptable — this is often well before final convergence. Early warning signs: if benchmark scores start dropping before your domain task has substantially converged, your learning rate is too high, your replay fraction is too low, or your fine-tuning dataset is too narrow. Catching this at step 500 out of 5,000 lets you adjust hyperparameters and restart; catching it at the end is expensive. Also monitor validation loss on a held-out sample of general pretraining data — a rising validation loss on that held-out set is a direct signal of catastrophic forgetting. ::: ::: ::: {.callout-important title="Forward Deployed Engineer"} **Q8. A customer wants to fine-tune a model on their internal docs so it "knows their product." Walk through the full data, compute, and deployment considerations you'd raise.** ::: {.callout-note collapse="true" title="Model Answer"} This is a common request that requires careful scoping before any training begins. On the data side, the first question is what "knows their product" actually means. If the goal is factual knowledge retrieval — answering questions about product features, pricing, policies — fine-tuning is probably the wrong solution. RAG is better suited because it can be updated as the product evolves without retraining, and it provides attribution (you can show the user which document was cited). Fine-tuning is appropriate when the goal is behavioral: adopt our terminology, respond in our brand voice, follow our specific response structure. For behavioral fine-tuning, you need (instruction, response) pairs — not just raw documents. Raw internal docs must be converted into training examples, typically by having domain experts write ideal Q&A pairs, or by using a strong model to generate synthetic examples and then having experts review and filter them. The dataset should cover the full distribution of user queries you expect, not just the easy cases documented in the FAQ. On the compute side, for most teams a LoRA fine-tune of a 7B or 13B model is the right scope — full fine-tuning is rarely justified for behavioral customization. Compute for a LoRA run on 1,000-10,000 examples takes hours on a single A100, not days. On deployment, discuss: model versioning (when the product changes, you need to retrain — what is the retraining cadence and who owns it?), evaluation (what is the success metric, and how will you know if a new model version is better?), fallback (if the fine-tuned model fails, what happens?), and data security (the training examples may contain sensitive internal information — where is training happening and who has access?). ::: **Q9. The customer's fine-tuned model performs well in offline evaluation but poorly in production. What are the five most likely root causes you'd investigate?** ::: {.callout-note collapse="true" title="Model Answer"} Distribution shift is the first and most common cause: the offline evaluation test set does not represent the real distribution of production queries. If the test set was sampled from the training data distribution or was manually curated by the team, it will be biased toward cases the model handles well. Collect a random sample of actual production queries and evaluate on those — the gap often becomes immediately apparent. Second, prompt drift: the production prompt template differs from the training template, even subtly. Fine-tuned models are sensitive to their prompt format because they learned the format from training. A missing system prompt, a changed field order, or different token spacing can degrade performance. Third, input preprocessing differences: the offline evaluation pipeline may handle tokenization, encoding, special characters, or input length differently than the production serving stack. Fourth, output postprocessing: the offline evaluation may be lenient about format compliance (accepting close matches) while production relies on exact structured output. Fifth, temperature and sampling parameters: if offline eval used greedy decoding but production uses a nonzero temperature with nucleus sampling, the output distribution shifts meaningfully. Investigation protocol: replay a sample of production requests through the offline eval pipeline with exact production parameters; compare token-by-token output between production and offline eval for the same inputs; log and inspect all production failure cases; and shadow-test the model against the old baseline on live traffic before fully deploying. ::: ::: ## Further Reading - [FLAN: Finetuned Language Models Are Zero-Shot Learners](https://arxiv.org/abs/2109.01652) - [Self-Instruct: Aligning LMs with Self-Generated Instructions](https://arxiv.org/abs/2212.10560) - [LIMA: Less Is More for Alignment (Zhou et al., 2023)](https://arxiv.org/abs/2305.11206) - [InstructGPT: Training Language Models to Follow Instructions with Human Feedback](https://arxiv.org/abs/2203.02155) - [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)