1 What Is Generative AI?
Who this chapter is for: All levels — Entry, Mid, FDE What you’ll be able to answer after reading this:
- What distinguishes generative models from discriminative models
- Why the transformer architecture catalyzed the current GenAI wave
- How the modern AI stack is layered (foundation models → APIs → applications)
- What interviewers mean when they ask “explain GenAI to a non-technical stakeholder”
1.1 Generative vs. Discriminative Models
To understand generative AI, you first need to understand what it is not. Traditional machine learning systems — the kind that dominated the field from the 1990s through the early 2010s — are largely discriminative. A discriminative model learns to draw a boundary between categories in the data. Given a feature vector representing an email, a spam classifier learns where the boundary lies between “spam” and “not spam.” It outputs a probability that a given input belongs to a particular class: \(P(y \mid x)\). The model never needs to understand what a legitimate email looks like in full — it only needs to know which side of the boundary any given email falls on. This is a powerful and efficient approach for classification and regression tasks, and it underpins everything from logistic regression to modern convolutional neural networks like ResNet.
A generative model operates with a fundamentally different objective. Instead of learning the boundary between classes, it learns the underlying distribution of the data itself — the joint probability \(P(x, y)\), or in the case of unconditional generation, just \(P(x)\). A model that has learned \(P(x)\) understands, in a statistical sense, what “data in this domain” looks like. From that understanding, it can produce new examples that are consistent with the learned distribution. This is the core meaning of “generative”: the model can generate new instances that look like they came from the same distribution as its training data. An email-generating model doesn’t just classify — it could write you a new email from scratch.
The spam filter versus email writer analogy captures this distinction precisely. A discriminative spam filter (say, a fine-tuned BERT model) reads your email and outputs a label — spam or not-spam. It has no concept of what a well-written email looks like; it only knows how to classify. GPT-4, by contrast, can write you a persuasive sales email, continue an unfinished draft, or translate your brusque notes into professional prose. It learned to model the distribution of all text it was trained on, so it can produce coherent, fluent text that fits that distribution. ResNet (discriminative) classifies an image as “cat” or “dog.” DALL-E (generative) produces an image of a cat riding a dog in the style of Van Gogh. The inputs and outputs are inverted: discriminative models consume data and produce labels; generative models consume a specification or partial context and produce new data.
A crucial nuance: “generate” does not mean “hallucinate,” though hallucination is a real failure mode of generative models. Generating text means sampling from a learned probability distribution over sequences. When that distribution is well-calibrated and the model is prompted appropriately, the outputs are accurate, coherent, and useful. Hallucination occurs when the model assigns high probability to plausible-sounding but factually incorrect content — a failure of calibration, not an inherent property of generation. BERT versus GPT illustrates both sides of the paradigm within the transformer family: BERT uses a bidirectional encoder fine-tuned for discriminative tasks like question answering and classification; GPT uses an autoregressive decoder trained to generate the next token — a generative objective that, at sufficient scale, produces remarkably capable systems.
1.2 Why Now? The Three Ingredients
The techniques underlying modern generative AI are not entirely new. Artificial neural networks date to the 1950s. Recurrent networks capable of processing sequences were well-established by the 1990s. Language modeling — predicting the probability of the next word — is a decades-old problem in NLP. So why did the current wave of capable generative AI emerge specifically in 2022 and not 2002 or 2012? The answer lies in the convergence of three distinct enabling factors: a new architecture, an explosion in compute, and the availability of web-scale data. Each was necessary but not sufficient — only when all three arrived simultaneously did the current wave become possible.
The first ingredient is the transformer architecture, introduced by Vaswani et al. in the 2017 paper “Attention Is All You Need.” Before transformers, the dominant architectures for sequential data were recurrent neural networks (RNNs) and their gated variants (LSTMs, GRUs). RNNs process sequences one token at a time, maintaining a hidden state that is updated at each step. This sequential dependency means you cannot parallelize training across time steps — you must process token 1 before token 2, token 2 before token 3, and so on. This made training slow and gradient signals weak over long sequences. The transformer replaced recurrence entirely with self-attention: every token in the sequence attends to every other token simultaneously, with learned attention weights determining how much each position influences every other. Because there is no sequential dependency in the core computation, the entire sequence can be processed in parallel across GPU cores. This single change unlocked a 10-100× improvement in training efficiency and enabled the enormous model scales that followed.
The second ingredient is compute. The 2012 AlexNet moment — where a deep convolutional network trained on GPUs dramatically outperformed all competitors on the ImageNet benchmark — signaled that GPU-accelerated training could make previously infeasible model scales tractable. In the decade that followed, the AI hardware ecosystem exploded: NVIDIA’s A100 and H100 GPUs deliver tens of teraflops of half-precision performance; tensor cores are designed specifically for the matrix multiplications at the heart of neural network training; hyperscale cloud providers assembled clusters of thousands of these chips connected by high-bandwidth interconnects. A training run for GPT-3 in 2020 cost an estimated $4-12 million in compute alone and required thousands of A100-equivalent GPUs running for months. That kind of infrastructure simply did not exist before roughly 2018-2020, and it is what made 100-billion-parameter models possible.
The third ingredient is data. Transformers are data-hungry in a way that RNNs were not, because their capacity scales with training data as well as parameter count. The internet, by 2020, had produced approximately 1 trillion tokens of publicly crawlable English text in Common Crawl alone — plus billions of lines of code on GitHub, millions of books, and the entirety of Wikipedia in dozens of languages. The combination of web-scale data with efficient transformer training on powerful GPU clusters produced the first truly striking result at scale: GPT-3 in 2020, which demonstrated that a single model trained with no task-specific fine-tuning could perform competitively on dozens of NLP benchmarks through few-shot prompting alone. The 2022 ChatGPT moment added one more ingredient — RLHF (Reinforcement Learning from Human Feedback) to align model outputs to human preferences — but the fundamental capability had already emerged from the three-way convergence of architecture, compute, and data.
1.3 The Modern AI Stack
Understanding where different roles operate in the AI ecosystem requires a clear mental model of the layered stack that has emerged around foundation models. At the base of the stack are foundation models: large-scale pretrained models produced by a small number of organizations with the compute and data resources to train them from scratch. This tier includes GPT-4 and o1 from OpenAI, Claude from Anthropic, Gemini from Google DeepMind, and open-weight models like Meta’s Llama family and Mistral. Training a state-of-the-art foundation model in 2024-2025 requires millions of dollars of compute, teams of dozens of researchers and engineers, and proprietary data curation pipelines. The number of organizations that can operate at this layer is small and is unlikely to grow rapidly — the barrier to entry is simply too high.
Sitting directly on top of foundation models are APIs and SDKs — the interfaces through which most developers interact with model capabilities. The OpenAI API, Anthropic API, and Google Gemini API expose foundation model capabilities through simple HTTP endpoints. You send text in and get text out. These APIs also provide abstractions like system prompts, function/tool calling, and structured output modes that make it practical to build production applications. The model providers also expose fine-tuning endpoints that let customers adapt the base model to domain-specific behavior without retraining from scratch. This layer is where the commercial value from foundation models is currently captured: API pricing ranges from fractions of a cent to several dollars per million tokens depending on model capability.
The orchestration and tooling layer sits above raw APIs. Libraries like LangChain and LlamaIndex provide abstractions for common GenAI application patterns: retrieval-augmented generation (RAG), multi-step chains, agent loops, structured output parsing, and memory management. These frameworks abstract away boilerplate and provide tested implementations of complex patterns. Frameworks like DSPy take a different approach, treating prompt engineering as a compilation and optimization problem. Vector databases (Pinecone, Weaviate, Chroma, pgvector) occupy an adjacent position in this layer, providing the semantic search infrastructure that RAG pipelines require. Engineers who work at this layer are building the infrastructure that makes GenAI applications composable and maintainable.
At the top of the stack are applications — the products that end users actually interact with. This tier includes developer tools (Cursor, GitHub Copilot), search and knowledge products (Perplexity, Glean), legal and professional services platforms (Harvey, Casetext), enterprise productivity tools (Microsoft Copilot, Google Workspace), and thousands of vertical SaaS products integrating AI into domain-specific workflows. The application layer is where domain expertise and user experience design matter most, and where the commercial opportunity is arguably largest. A Forward Deployed Engineer (FDE) role sits primarily at the orchestration and application layers: you are not training models, but you are deploying them into enterprise environments, customizing them through prompting and fine-tuning, integrating them with enterprise data sources, and ensuring the resulting system meets production reliability, latency, and accuracy requirements. Understanding the full stack — even the layers you do not directly build — is essential for diagnosing problems and communicating tradeoffs to customers and internal teams.
1.4 Explaining GenAI to Non-Technical Stakeholders
One of the most important skills tested in FDE interviews — and often asked explicitly — is the ability to explain complex AI concepts clearly to non-technical audiences. This is not merely a communication exercise; it reveals whether you understand the technology deeply enough to translate it. Shallow understanding produces jargon-filled explanations that confuse rather than clarify. Deep understanding allows you to choose the right analogy and convey genuine insight in plain language. Interviewers ask this question because FDEs spend a large fraction of their time communicating with customer executives, legal and compliance teams, and business stakeholders who are skeptical, curious, or both — and your ability to build their confidence and set accurate expectations is as commercially important as your technical ability.
Two framings work particularly well for explaining LLMs to non-technical audiences. The first is “auto-complete for everything.” Your phone’s keyboard predicts the next word as you type — ChatGPT does the same thing, but at a scale of trillions of words of training data, with a model containing billions of learned parameters, and completing not just the next word but entire documents, analyses, and conversations. This framing is immediately intuitive because everyone has experienced phone keyboard prediction. It also naturally conveys the statistical nature of the output: just as your phone might sometimes suggest a wrong or awkward word, language models can produce incorrect or strange text. The limitation of this analogy is that it undersells the emergent capabilities — people’s phone keyboards can’t write a Python script — so it needs a second framing to complement it.
The “extremely well-read intern” framing addresses this gap. Imagine an intern who has read virtually everything ever published — every textbook, every Wikipedia article, every news story, every codebase on GitHub — and has extraordinarily good pattern recognition for how language and ideas connect. They can draft a contract, explain a scientific concept, write code, or summarize a document faster than any human. But this intern has a critical limitation: they are not connected to live information, they cannot look things up, and they can sometimes confuse things they half-remember from their reading. They will answer confidently even when they are wrong, because confident fluency is exactly what they learned from their training data. This framing conveys the genuine power (broad coverage, fast drafting, pattern synthesis) and the genuine risk (confident errors, no real-time grounding) in terms any executive can understand.
The key technical facts to convey accurately — even in a non-technical explanation — are: LLMs predict text, they do not retrieve it from a database; they can be wrong confidently, which is different from being obviously uncertain; they do not have live access to information unless explicitly connected to external tools; and they are not search engines, calculators, or reasoning systems in the traditional sense. What they are genuinely excellent at is language — generation, transformation, summarization, classification — at a quality and speed that was previously impossible to automate. Setting these expectations accurately in the first conversation with a customer is the difference between a successful deployment and a failed one.
1.5 Interview Questions
Q1. What is generative AI and how does it differ from traditional machine learning?
Traditional machine learning models are predominantly discriminative — they learn to map inputs to outputs, typically by learning a decision boundary between classes. A spam classifier, for example, learns \(P(\text{spam} \mid \text{email features})\) and outputs a label. It never needs to understand what a complete email looks like; it only needs to classify.
Generative AI models learn the underlying distribution of the data itself — \(P(x)\) or \(P(x, y)\) — and can sample new examples from that distribution. A generative email model could write you a new email from scratch because it has learned what emails look like at a statistical level.
The practical distinction is one of inputs and outputs: discriminative models consume data and produce labels or predictions; generative models consume a specification, partial input, or prompt and produce new data — text, images, audio, or code. Examples: ResNet is discriminative (classifies images); DALL-E is generative (creates images). BERT is discriminative (classifies or extracts from text); GPT-4 is generative (produces text). The key insight for interviews is that generative models are not “magic” — they are sophisticated statistical models that assign probabilities to sequences of tokens and can be sampled from to produce new sequences.
Q2. Name three real-world applications of generative AI and the type of model behind each.
First, code completion in tools like GitHub Copilot and Cursor is powered by large causal language models (decoder-only transformers like GPT-4 or specialized code models like Codex). These models are trained on billions of lines of code and predict the next token in a code context, which manifests as full function completions and docstring generation.
Second, text-to-image generation in tools like DALL-E 3, Midjourney, and Stable Diffusion uses diffusion models — a class of generative model that learns to reverse a gradual noising process. The model starts with random noise and iteratively denoises it, conditioned on a text prompt, until a coherent image emerges.
Third, enterprise document summarization and chat products (like Glean, Notion AI, or custom RAG systems built on Claude or GPT-4) use large decoder-only language models augmented with retrieval — the model reads retrieved document chunks and synthesizes a coherent response. Understanding that different applications often combine multiple model types (a language model for generation, an embedding model for retrieval, a vision model for images) is important for mid-level and FDE candidates.
Q3. What is a foundation model?
A foundation model is a large-scale machine learning model trained on broad, diverse data at significant compute expense, which serves as a starting point (a “foundation”) that can be adapted to a wide range of downstream tasks. The term was coined by the Stanford HAI group in their 2021 “On the Opportunities and Risks of Foundation Models” paper.
The defining characteristics are scale, generality, and adaptability. Scale means the model has billions of parameters and was trained on trillions of tokens or equivalent data. Generality means it was not designed for a single task — a foundation model trained on text can answer questions, write code, translate languages, and summarize documents without task-specific retraining. Adaptability means it can be efficiently specialized to specific tasks or domains through fine-tuning or prompting, at a fraction of the cost of training from scratch.
Current examples include GPT-4 and o1 (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Pro (Google), and the open-weight Llama 3 family (Meta). The economic and technical significance of foundation models is that they shift the cost structure of AI: instead of every organization training task-specific models from scratch, most organizations build on top of shared foundation models through APIs or fine-tuning.
Q4. What was the significance of the Transformer paper (2017)?
Vaswani et al.’s “Attention Is All You Need” (2017) introduced the transformer architecture, which replaced recurrent neural networks as the dominant architecture for sequence modeling. The core innovation was self-attention: instead of processing sequences one token at a time (as RNNs do), every token in the sequence attends to every other token simultaneously using learned attention weights. This eliminated the sequential dependency in RNNs that prevented parallelization.
The practical consequence was enormous: transformer training could be parallelized across the entire sequence length, making it feasible to train on orders of magnitude more data using GPU clusters. RNNs also struggled with long-range dependencies — gradients vanished over long sequences — while transformers handle long-range context through direct attention connections between any two positions.
The significance for GenAI is that transformers made scale tractable. GPT-2 (2019), GPT-3 (2020), and every subsequent large language model are transformer-based. Without the parallelizability of the transformer, training a 175-billion-parameter model would have taken years instead of months even with the same hardware. The 2017 paper is effectively the inflection point of the modern AI era: every major LLM in production today is a descendant of that architecture, making it perhaps the most consequential ML paper of the decade.
Q5. Explain the difference between generative and discriminative models using probability notation.
A discriminative model directly models the conditional probability \(P(y \mid x)\) — the probability of a label \(y\) given observed features \(x\). Given an input (say, an email’s feature vector), it outputs a distribution over possible labels (spam / not-spam). It learns the boundary between classes without needing to model what typical spam or legitimate emails look like in full feature space. Logistic regression, SVMs, BERT fine-tuned for classification, and ResNet all fall in this category.
A generative model models the joint probability \(P(x, y)\) or, in the unsupervised case, the marginal \(P(x)\). From the joint, you can derive \(P(y \mid x) = P(x, y) / P(x)\) via Bayes’ theorem — so generative models can technically be used for classification too, but their primary capability is sampling: given the learned distribution \(P(x)\), you can draw new samples \(x^* \sim P(x)\) that look like training data.
GPT-style models are trained to maximize \(\sum_t \log P(x_t \mid x_{<t})\), which is equivalent to modeling \(P(x)\) over sequences via the chain rule of probability: \(P(x_1, \ldots, x_n) = \prod_t P(x_t \mid x_{<t})\). This is precisely why they can generate novel text — they have learned a distribution over sequences from which new sequences can be sampled. The interviewer distinction to nail: discriminative models are often more accurate for their specific task, but they cannot produce new data; generative models can produce new data at the cost of modeling a harder distribution.
Q6. What are the three technical developments that made the current LLM wave possible, and why did they need to converge?
The three developments are: (1) the transformer architecture (2017), enabling parallelizable training on long sequences; (2) GPU-scale compute, specifically the availability of large GPU clusters with high-bandwidth interconnects allowing training runs of tens of thousands of GPU-hours; and (3) web-scale text data, including Common Crawl (hundreds of terabytes of web text), GitHub (billions of lines of code), and curated corpora like The Pile and RedPajama.
Each was necessary but insufficient alone. Transformers without compute produce small models with modest capabilities — GPT-1 (2018) had 117M parameters and was impressive but not transformative. Compute without transformers would have just scaled up slow, hard-to-parallelize RNNs, which plateau in quality because of gradient issues over long sequences. Data without architecture and compute produces an untrainable dataset. The 2020 GPT-3 paper was the first demonstration of what happens when all three converge at scale: 175B parameters trained on ~300B tokens produced emergent capabilities including few-shot learning that were qualitatively not present in smaller models.
The convergence also had a timing component: Common Crawl had been accumulating web text since 2008; GPU hardware hit the necessary performance/cost ratio around 2018-2020; and the transformer architecture arrived in 2017. Before 2017, the data was there but the architecture and hardware weren’t. By 2020, all three were available simultaneously. This is why the “why now?” question has a precise answer rather than a vague “things improved gradually.”
Q7. What does it mean when we say an LLM “predicts the next token”? How does that mechanism lead to seemingly intelligent outputs?
At each inference step, an LLM takes the sequence of tokens it has processed so far and produces a probability distribution over its entire vocabulary (typically 50,000-100,000 tokens). The “prediction” is this distribution: each token gets a probability, and together they sum to 1. The token selected from this distribution becomes the next output, is appended to the sequence, and the process repeats. Training objective: maximize the probability assigned to the actual next token in the training corpus, which is equivalent to minimizing cross-entropy loss.
The emergence of intelligence from this seemingly simple objective is one of the most surprising empirical discoveries in machine learning. To predict the next token accurately across a diverse corpus — news articles, scientific papers, code, fiction, legal documents — a model must implicitly learn the grammar, vocabulary, facts, reasoning patterns, and discourse structure of every domain represented in its training data. A model that can accurately predict the next token in a physics textbook has necessarily learned something about physics. One that predicts the next line of Python code has learned programming syntax and idioms.
The key insight is that next-token prediction is a deceptively general proxy task. Unlike supervised tasks that only teach a specific mapping, predicting text requires learning virtually everything that makes text coherent. The “intelligence” is not programmed — it emerges from the pressure to minimize prediction loss across a distribution so broad that the only way to succeed is to internalize deep regularities about language, knowledge, and reasoning. This is also why the model can be wrong: the objective is probabilistic accuracy, not factual correctness, and the model will confidently predict a plausible-sounding but incorrect token if that is what the training distribution supports.
Q8. A customer’s CEO asks: “How does ChatGPT actually work?” Give a 2-minute explanation for a non-technical executive.
I would say something like: “Think about the autocomplete on your phone — it suggests the next word as you type, based on what words typically follow what you’ve already written. ChatGPT does essentially the same thing, but at a scale that’s hard to imagine. It was trained on roughly a trillion words of text from the internet, books, code, and academic papers — more than any human could read in thousands of lifetimes. In doing so, it learned the patterns of how language works across every topic covered in that data.
When you ask it a question, it’s not looking up your answer in a database the way Google does. It’s generating an answer word-by-word, picking what word is most likely to come next given everything you’ve said and everything it learned. Because it learned from so much text on so many topics, those generated words turn into surprisingly accurate, coherent answers.
The limitation is exactly what you’d expect: it learned from text, not from verified facts. So it can write confidently even when it’s wrong — like a very eloquent person who sometimes misremembers things. It also has no live information unless you give it access to external tools. For your business, this means it’s extraordinarily powerful for tasks where you need to generate, summarize, or transform text — but for tasks where precision matters, we build guardrails and verification steps into the workflow.”
The key elements: intuitive analogy, honest about limitations, tied to business implications.
Q9. A skeptical customer says “AI just makes stuff up — why should we trust it for our business?” How do you respond?
I would validate the concern before addressing it — dismissing it makes you seem naive. “You’re right that hallucination is a real problem, and I wouldn’t recommend deploying AI in any business-critical context without accounting for it. The question isn’t whether AI can be wrong — it can — but whether we can engineer a system where the failure rate and failure mode are acceptable for your specific use case.”
Then I would distinguish use cases: for tasks where the AI is summarizing documents the customer already owns, extracting structured data from forms, or drafting text a human will review before it goes out, the hallucination risk is manageable — humans stay in the loop for the high-stakes decisions. For tasks like generating patient diagnoses or executing financial trades autonomously, the bar is much higher and requires different architecture choices.
I would then explain the architectural mitigations available: retrieval-augmented generation (RAG) grounds the model’s answers in retrieved source documents, dramatically reducing hallucination on factual queries; fine-tuning on domain-specific data shifts the model’s behavior toward your domain; confidence scoring and abstention mechanisms can flag uncertain responses for human review; and evaluation pipelines can measure hallucination rates on representative samples before deployment. I close with: “The companies using AI successfully in production aren’t trusting it blindly — they’re engineering systems that use AI for what it’s genuinely good at while keeping humans responsible for what it isn’t. Let me show you how that looks for your use case specifically.”
Q10. You’re scoping a GenAI project for a customer. What questions do you ask in the first 30 minutes to determine if GenAI is even the right tool?
I start by asking what the customer is actually trying to achieve and what their current process looks like — not what AI solution they want, because customers often arrive with a solution in mind before the problem is properly defined. Specifically, I want to understand: What is the input and what is the desired output? Is the output language-based, or does it involve structured data, images, or decisions? Is the task generative (creating new content), extractive (finding or summarizing existing content), or classificatory (routing or labeling)?
Then I probe the quality and cost-of-error requirements: What happens if the system produces a wrong answer? Is there a human in the loop for review, or does this output go directly to a customer or a downstream automated system? High-stakes, low-review workflows demand very different architecture than internal productivity tools. I also ask about data: What data do you have that’s relevant to this task? Is it structured or unstructured? Is there labeled ground truth, which would enable evaluation?
I ask about the existing baseline: Is there a current manual process? What does it cost in time and headcount? This both sizes the value of automation and establishes the quality bar (the AI must be better or cheaper than the current approach to be worth deploying). Finally, I ask whether this is actually a language problem: if the customer wants to predict numerical outcomes from structured tabular data, a classical ML model (XGBoost, logistic regression) will almost certainly outperform an LLM and cost a fraction of the price. GenAI is the right tool when the input or output is language, or when the task requires broad world knowledge that would be expensive to encode in a rules-based system.