16 Common GenAI Customer Use Cases

Note

Who this chapter is for: FDE What you’ll be able to answer after reading this:

The canonical GenAI use cases and the architecture patterns behind each
How to match a customer’s problem to the right GenAI pattern
What “good” vs. “bad” looks like for each use case type
How to quickly prototype and de-risk a solution

16.1 Use Case Taxonomy

Use Case	Core Pattern	Key Technical Challenge
Document Q&A	RAG	Retrieval quality, context length
Internal search	Embeddings + semantic search	Index freshness, hybrid retrieval
Code generation / copilot	Fine-tuned decoder	Security, hallucination in code
Customer support chatbot	RAG + guardrails	Hallucination, escalation logic
Data extraction / classification	Structured output	Consistency, edge cases
Summarization	Direct generation	Length control, faithfulness
Report generation	Agents + RAG	Multi-step reasoning, citations

16.2 Document Q&A

The most common enterprise use case. A user asks a natural-language question; the system retrieves relevant document sections and generates a grounded answer with citations.

Architecture: 1. Ingest documents → chunk → embed → store in vector DB 2. At query time: embed query → retrieve top-k chunks → inject into LLM prompt 3. LLM generates answer citing specific retrieved passages

The questions that determine architecture: - How large is the document corpus? (determines vector DB choice) - How often do documents change? (determines ingestion pipeline complexity) - What’s the acceptable latency? (determines whether re-ranking is affordable) - What access control is needed? (document-level auth in the retrieval layer)

Key failure modes: retrieving irrelevant chunks (fix: re-ranking + hybrid search), hallucinating beyond the retrieved context (fix: explicit grounding instruction + faithfulness eval), stale answers from outdated index (fix: event-driven re-ingestion).

16.3 Enterprise Search

Replacing keyword search with semantic + hybrid search. Users often don’t know the exact terminology — they search by concept, not by the specific words a document uses.

Architecture: - Index documents with both dense embeddings and BM25 sparse index - At query time: execute both, fuse results with Reciprocal Rank Fusion (RRF) - Optional: add faceted filters (department, date range, document type) on top of semantic results

When keyword search still wins: compliance lookups (“find all documents citing regulation 45 CFR 164.514”), exact product code lookups, anything where the user knows the precise string.

16.4 Code Copilots

In-editor autocomplete (GitHub Copilot model) vs. chat interface (Cursor model) are architecturally different products.

Autocomplete requires sub-100ms latency — speculative decoding, small fine-tuned models, prefix caching. Chat can afford 2–5 second responses and benefits from full repository context.

Security considerations: the model sees proprietary source code. For enterprise customers: on-prem deployment or VPC-deployed API, audit logging of all completions, output scanning for secrets and PII.

Evaluation: run the model on your actual codebase, measure pass@k on unit tests the model doesn’t see, track whether suggestions require edits before acceptance.

16.5 Customer Support Automation

Architecture:

User message
  → Intent classifier (is this in-scope?)
  → RAG retrieval (relevant policy/product docs)
  → LLM response generation
  → Output safety check
  → If confidence < threshold: human escalation

The escalation path is not optional — it’s the core reliability mechanism. A system that handles 80% of tickets accurately but fails silently on 20% is worse than no automation.

Tone and safety guardrails: input classifiers for harmful content, output classifiers for policy violations, explicit prohibition on the model making commitments the business can’t keep (refund amounts, timelines, guarantees).

16.6 Data Extraction and Structured Output

The pattern: give the LLM an unstructured document and a JSON schema, ask it to fill the schema. LLMs are surprisingly good at this for complex documents (contracts, medical records, financial filings) that rule-based extractors fail on.

Validation loop: 1. LLM generates candidate JSON 2. Schema validation (Pydantic, jsonschema) — catch structural errors 3. Business logic validation — catch semantic errors (“end date before start date”) 4. If validation fails: retry with error message in context (typically 1–2 retries resolves 95%+ of failures)

Confidence signals: ask the model to include a confidence field per extracted item. Items below threshold route to human review rather than auto-acceptance.

16.7 Case Studies: Real Projects

16.7.1 Case Study 1: Marketing Automation Agent (CMO Agent)

Problem: Small engineering teams build technically superior products that fail commercially because they have no marketing bandwidth. Hiring a full CMO is out of budget.

Solution: A multi-agent marketing system that automates four functions:

Function	Agent	Tools
Competitive intelligence	Research Agent	Web scraping (Firecrawl), price monitoring
Positioning & messaging	Messaging Agent	Copy generation, A/B variant drafting
Launch planning	Launch Agent	Timeline coordination, channel strategy
Analytics synthesis	Analytics Agent	Mixpanel/Amplitude integration

Critical design decision: Human approval gates on brand voice, pricing decisions, and positioning direction. The agents propose; humans approve. The value is in the pattern-matching work (competitor monitoring, copy generation, launch checklists) — not in strategic judgment calls that require human context.

Framework used: CrewAI for role-based agent personas; LangGraph for deterministic sequential workflows.

What it can’t do: strategic pivots, crisis judgment (“do we apologize or stay silent?”), category-redefining creative leaps.

Lesson for FDEs: when scoping a marketing automation engagement, diagnose which of the four functions the customer actually needs. Most customers need competitive intelligence and messaging consistency — not a full CMO replacement.

16.7.2 Case Study 2: Personalized Music Library Organization (Full-Stack AI Tool)

Problem: Spotify’s genre tags are unreliable, especially for non-Western music. A 2,000-song library is effectively unsearchable.

Solution: Full-stack TypeScript app that batches tracks to an LLM for classification across six custom dimensions (genre, mood, language, occasion, era, energy level), presents results in a human-review dashboard, then pushes approved playlists to Spotify.

Architecture decisions worth noting:

Batch classification, not per-track: Sending 25 tracks per LLM request with strict JSON output (“the array must have exactly as many elements as tracks given, in the same order”) makes parsing trivial and reduces cost ~25x vs. per-track.

Custom taxonomy enforcement: Allowed values per dimension get appended to the system prompt. The LLM classifies into your categories, not its own — ensuring every track lands in a user-defined bucket.

Human-in-the-loop by default: Nothing pushes to Spotify until the user reviews and approves. The rename-before-push step converts machine taxonomy (“Study”) into human-friendly playlist names.

Background processing without a job queue: 2,000 tracks takes minutes — too long for HTTP. Rather than adding BullMQ infrastructure, the implementation discards the classification promise intentionally and uses database polling for progress. Practical over perfect.

Cost: $0.30–$0.50 in LLM API tokens to classify 2,000 songs. Trivial.

Lesson for FDEs: the most elegant GenAI solutions often have a simple core pattern (batch → classify → review → act) with the complexity in the integration layer (OAuth, background jobs, UI). Don’t over-architect the AI part.

16.8 Interview Questions

Forward Deployed Engineer

Q1. A legal firm wants to use GenAI to answer questions about their contract database. Design the solution.

Model Answer

This is a RAG use case with elevated requirements around accuracy, access control, and citation — legal users need to trust the answer and verify the source.

Architecture:

Ingestion pipeline: extract text from contract PDFs (legal documents often have complex layouts — evaluate Azure Document Intelligence or AWS Textract against the actual document set, not generic benchmarks). Preserve document metadata: contract ID, counterparty, execution date, contract type, and access-control group. Chunk at paragraph boundaries with 10–15% overlap (legal text is dense; sentence-level chunks lose too much context). Embed with a legal-domain model (LegalBERT or a general model fine-tuned on contracts).

Retrieval: hybrid search — BM25 handles exact legal citations (“Section 8.2(a),” specific clause references) while dense search handles conceptual queries (“what’s our indemnification exposure with Acme?”). Apply document-level ACL filtering before returning results — users should only retrieve contracts they’re authorized to see.

Generation: instruct the LLM to answer only from retrieved context and cite specific clause locations: “Based on Section 4.1 of the Master Services Agreement with Acme Corp (executed 2023-01-15)…” Explicit grounding instruction: “Do not infer or extrapolate beyond the provided contract text. If the answer is not in the retrieved context, say so.”

Evaluation: build a test set of 50–100 known query-answer pairs from existing contracts before deploying. Measure faithfulness (is every claim in the answer sourced from retrieved text?) and retrieval recall (are the right clauses retrieved?). Legal hallucination is a liability risk — quality gates are non-negotiable.

Security: on-premise or VPC-deployed to prevent contract text from leaving the firm’s infrastructure.

Q2. A customer wants to replace their internal search tool with an AI-powered alternative. What questions do you ask before writing any code?

Model Answer

Before writing code, I need to understand the existing system’s failure modes, the data landscape, and what success looks like. My questions:

About the current search: - What does the current tool do well and poorly? (If keyword search is failing for conceptual queries, semantic search helps. If it’s failing because the index is stale, the problem is a data pipeline problem, not a semantic search problem.) - What does “a good result” look like to users? Can they articulate it? This determines how you’ll evaluate the new system.

About the data: - What document types and formats are in the corpus? (PDFs, Word docs, Confluence pages, Salesforce records — each requires different parsing and has different update patterns.) - How many documents and how often do they change? (10,000 static PDFs vs. 1 million records updated hourly have completely different infrastructure requirements.) - Is there any access control — documents that only certain users/roles should see? (Mandatory question — access control must be built into retrieval, not bolted on after.)

About the query distribution: - What are the top 20 queries users actually run? (Exact keyword lookups need BM25; conceptual queries need dense search; most real-world distributions need both.) - Do users search for exact strings (product codes, document IDs) or concepts?

About success criteria: - How will you measure if the new system is better? Do you have historical click-through data from the current system as a baseline? - What’s the acceptable latency for a search result?

About organizational constraints: - Where does the data live and can it leave those systems? (Data residency requirements often constrain vector DB choice.) - Who will maintain the ingestion pipeline when it breaks?

These questions take 30 minutes in a discovery call and prevent weeks of rework.

Q3. A software company wants to build a code review copilot. What are the top 3 risks you’d want to mitigate before shipping?

Model Answer

Risk 1: False confidence in generated suggestions. Developers will trust a code review copilot — it looks authoritative and is always grammatically confident. If the model suggests a fix that introduces a security vulnerability, breaks an API contract, or silently changes behavior, developers may accept it without deep scrutiny. The mitigation: make uncertainty explicit in the UI (“High confidence,” “Verify before applying”), require the model to explain its reasoning rather than just stating a fix, and never have the copilot auto-apply changes without human confirmation. Build an eval that tests suggestions against unit tests the model doesn’t see — if suggestions cause test failures at >5%, don’t ship.

Risk 2: Proprietary code exposure. Every code snippet the model sees is sent to an API — potentially the entire codebase over time. For a software company with proprietary algorithms, this is an IP and competitive risk. Questions to answer before shipping: which API are you using, where is data processed and stored, does the provider’s data usage policy allow training on customer inputs, and does the customer’s legal team require a DPA? Mitigation options: on-premise model deployment, API providers with contractual no-training commitments (Anthropic and OpenAI both offer this), or restricting the copilot to non-sensitive file types.

Risk 3: Degraded code quality from automation bias. When developers use the copilot, review depth may decrease — if a suggestion passes the copilot, reviewers may assume it’s fine. This can mask bugs the model confidently missed. The mitigation is not to disable the copilot but to track: are post-copilot PRs passing unit tests at the same rate? Is production incident rate stable? Instrument these metrics from launch so you can detect degradation before it compounds.

Q4. Walk me through how you’d de-risk a document Q&A POC in a two-week sprint.

Model Answer

A two-week POC has one goal: determine whether this use case is technically viable for this customer’s documents and queries — before committing to full build. Risk-reduction is the output, not a polished product.

Days 1–2: Scope and data access. Get 50–100 representative documents from the actual corpus — not a cleaned demo set, the messy real ones. Get 20–30 representative questions from real users (email the people who will use this system, not the project sponsor). Identify access control requirements. Everything else depends on these inputs.

Days 3–5: Baseline pipeline. Build the minimal viable RAG pipeline: parse documents → fixed-size chunking (256 tokens, 20% overlap) → embed with all-MiniLM-L6-v2 → store in Chroma (local, no infrastructure setup) → retrieval → generate with GPT-4o or Claude Sonnet. The specific choices don’t matter yet — you want a working baseline to measure from.

Days 6–8: Measure retrieval and generation quality. For each of the 20–30 questions, manually evaluate: (a) are the retrieved chunks relevant? (b) is the generated answer correct and sourced from the retrieved chunks? Build a simple scorecard. This tells you exactly where the system fails — retrieval quality, parsing quality, or generation quality — and whether the failures are fixable.

Days 9–11: Address top failure modes. If retrieval is poor: add BM25 hybrid search, try a better embedding model, fix chunk size. If parsing is dropping content: try a better PDF parser on the failing document types. Don’t fix everything — fix the failure modes that affect the most questions.

Days 12–14: Demo and recommendation. Demo on real questions. Present: accuracy on the 30-question eval set, the 3–5 remaining failure modes and their root causes, a rough infrastructure estimate for production, and a recommendation: go/no-go with specific conditions. The POC’s value is the recommendation, not the demo.

# Common GenAI Customer Use Cases ::: {.callout-note} **Who this chapter is for:** FDE **What you'll be able to answer after reading this:** - The canonical GenAI use cases and the architecture patterns behind each - How to match a customer's problem to the right GenAI pattern - What "good" vs. "bad" looks like for each use case type - How to quickly prototype and de-risk a solution ::: ## Use Case Taxonomy | Use Case | Core Pattern | Key Technical Challenge | |---|---|---| | Document Q&A | RAG | Retrieval quality, context length | | Internal search | Embeddings + semantic search | Index freshness, hybrid retrieval | | Code generation / copilot | Fine-tuned decoder | Security, hallucination in code | | Customer support chatbot | RAG + guardrails | Hallucination, escalation logic | | Data extraction / classification | Structured output | Consistency, edge cases | | Summarization | Direct generation | Length control, faithfulness | | Report generation | Agents + RAG | Multi-step reasoning, citations | ## Document Q&A The most common enterprise use case. A user asks a natural-language question; the system retrieves relevant document sections and generates a grounded answer with citations. **Architecture:** 1. Ingest documents → chunk → embed → store in vector DB 2. At query time: embed query → retrieve top-k chunks → inject into LLM prompt 3. LLM generates answer citing specific retrieved passages **The questions that determine architecture:** - How large is the document corpus? (determines vector DB choice) - How often do documents change? (determines ingestion pipeline complexity) - What's the acceptable latency? (determines whether re-ranking is affordable) - What access control is needed? (document-level auth in the retrieval layer) **Key failure modes:** retrieving irrelevant chunks (fix: re-ranking + hybrid search), hallucinating beyond the retrieved context (fix: explicit grounding instruction + faithfulness eval), stale answers from outdated index (fix: event-driven re-ingestion). ## Enterprise Search Replacing keyword search with semantic + hybrid search. Users often don't know the exact terminology — they search by concept, not by the specific words a document uses. **Architecture:** - Index documents with both dense embeddings and BM25 sparse index - At query time: execute both, fuse results with Reciprocal Rank Fusion (RRF) - Optional: add faceted filters (department, date range, document type) on top of semantic results **When keyword search still wins:** compliance lookups ("find all documents citing regulation 45 CFR 164.514"), exact product code lookups, anything where the user knows the precise string. ## Code Copilots In-editor autocomplete (GitHub Copilot model) vs. chat interface (Cursor model) are architecturally different products. **Autocomplete** requires sub-100ms latency — speculative decoding, small fine-tuned models, prefix caching. **Chat** can afford 2–5 second responses and benefits from full repository context. **Security considerations:** the model sees proprietary source code. For enterprise customers: on-prem deployment or VPC-deployed API, audit logging of all completions, output scanning for secrets and PII. **Evaluation:** run the model on your actual codebase, measure pass@k on unit tests the model doesn't see, track whether suggestions require edits before acceptance. ## Customer Support Automation **Architecture:** ``` User message → Intent classifier (is this in-scope?) → RAG retrieval (relevant policy/product docs) → LLM response generation → Output safety check → If confidence < threshold: human escalation ``` The escalation path is not optional — it's the core reliability mechanism. A system that handles 80% of tickets accurately but fails silently on 20% is worse than no automation. **Tone and safety guardrails:** input classifiers for harmful content, output classifiers for policy violations, explicit prohibition on the model making commitments the business can't keep (refund amounts, timelines, guarantees). ## Data Extraction and Structured Output The pattern: give the LLM an unstructured document and a JSON schema, ask it to fill the schema. LLMs are surprisingly good at this for complex documents (contracts, medical records, financial filings) that rule-based extractors fail on. **Validation loop:** 1. LLM generates candidate JSON 2. Schema validation (Pydantic, jsonschema) — catch structural errors 3. Business logic validation — catch semantic errors ("end date before start date") 4. If validation fails: retry with error message in context (typically 1–2 retries resolves 95%+ of failures) **Confidence signals:** ask the model to include a `confidence` field per extracted item. Items below threshold route to human review rather than auto-acceptance. --- ## Case Studies: Real Projects ### Case Study 1: Marketing Automation Agent (CMO Agent) **Problem:** Small engineering teams build technically superior products that fail commercially because they have no marketing bandwidth. Hiring a full CMO is out of budget. **Solution:** A multi-agent marketing system that automates four functions: | Function | Agent | Tools | |---|---|---| | Competitive intelligence | Research Agent | Web scraping (Firecrawl), price monitoring | | Positioning & messaging | Messaging Agent | Copy generation, A/B variant drafting | | Launch planning | Launch Agent | Timeline coordination, channel strategy | | Analytics synthesis | Analytics Agent | Mixpanel/Amplitude integration | **Critical design decision:** Human approval gates on brand voice, pricing decisions, and positioning direction. The agents propose; humans approve. The value is in the pattern-matching work (competitor monitoring, copy generation, launch checklists) — not in strategic judgment calls that require human context. **Framework used:** CrewAI for role-based agent personas; LangGraph for deterministic sequential workflows. **What it can't do:** strategic pivots, crisis judgment ("do we apologize or stay silent?"), category-redefining creative leaps. **Lesson for FDEs:** when scoping a marketing automation engagement, diagnose which of the four functions the customer actually needs. Most customers need competitive intelligence and messaging consistency — not a full CMO replacement. ### Case Study 2: Personalized Music Library Organization (Full-Stack AI Tool) **Problem:** Spotify's genre tags are unreliable, especially for non-Western music. A 2,000-song library is effectively unsearchable. **Solution:** Full-stack TypeScript app that batches tracks to an LLM for classification across six custom dimensions (genre, mood, language, occasion, era, energy level), presents results in a human-review dashboard, then pushes approved playlists to Spotify. **Architecture decisions worth noting:** *Batch classification, not per-track:* Sending 25 tracks per LLM request with strict JSON output ("the array must have exactly as many elements as tracks given, in the same order") makes parsing trivial and reduces cost ~25x vs. per-track. *Custom taxonomy enforcement:* Allowed values per dimension get appended to the system prompt. The LLM classifies into your categories, not its own — ensuring every track lands in a user-defined bucket. *Human-in-the-loop by default:* Nothing pushes to Spotify until the user reviews and approves. The rename-before-push step converts machine taxonomy ("Study") into human-friendly playlist names. *Background processing without a job queue:* 2,000 tracks takes minutes — too long for HTTP. Rather than adding BullMQ infrastructure, the implementation discards the classification promise intentionally and uses database polling for progress. Practical over perfect. **Cost:** $0.30–$0.50 in LLM API tokens to classify 2,000 songs. Trivial. **Lesson for FDEs:** the most elegant GenAI solutions often have a simple core pattern (batch → classify → review → act) with the complexity in the integration layer (OAuth, background jobs, UI). Don't over-architect the AI part. --- ## Interview Questions ::: {.callout-important title="Forward Deployed Engineer"} **Q1. A legal firm wants to use GenAI to answer questions about their contract database. Design the solution.** ::: {.callout-note collapse="true" title="Model Answer"} This is a RAG use case with elevated requirements around accuracy, access control, and citation — legal users need to trust the answer and verify the source. **Architecture:** *Ingestion pipeline:* extract text from contract PDFs (legal documents often have complex layouts — evaluate Azure Document Intelligence or AWS Textract against the actual document set, not generic benchmarks). Preserve document metadata: contract ID, counterparty, execution date, contract type, and access-control group. Chunk at paragraph boundaries with 10–15% overlap (legal text is dense; sentence-level chunks lose too much context). Embed with a legal-domain model (LegalBERT or a general model fine-tuned on contracts). *Retrieval:* hybrid search — BM25 handles exact legal citations ("Section 8.2(a)," specific clause references) while dense search handles conceptual queries ("what's our indemnification exposure with Acme?"). Apply document-level ACL filtering before returning results — users should only retrieve contracts they're authorized to see. *Generation:* instruct the LLM to answer only from retrieved context and cite specific clause locations: "Based on Section 4.1 of the Master Services Agreement with Acme Corp (executed 2023-01-15)..." Explicit grounding instruction: "Do not infer or extrapolate beyond the provided contract text. If the answer is not in the retrieved context, say so." *Evaluation:* build a test set of 50–100 known query-answer pairs from existing contracts before deploying. Measure faithfulness (is every claim in the answer sourced from retrieved text?) and retrieval recall (are the right clauses retrieved?). Legal hallucination is a liability risk — quality gates are non-negotiable. *Security:* on-premise or VPC-deployed to prevent contract text from leaving the firm's infrastructure. ::: **Q2. A customer wants to replace their internal search tool with an AI-powered alternative. What questions do you ask before writing any code?** ::: {.callout-note collapse="true" title="Model Answer"} Before writing code, I need to understand the existing system's failure modes, the data landscape, and what success looks like. My questions: **About the current search:** - What does the current tool do well and poorly? (If keyword search is failing for conceptual queries, semantic search helps. If it's failing because the index is stale, the problem is a data pipeline problem, not a semantic search problem.) - What does "a good result" look like to users? Can they articulate it? This determines how you'll evaluate the new system. **About the data:** - What document types and formats are in the corpus? (PDFs, Word docs, Confluence pages, Salesforce records — each requires different parsing and has different update patterns.) - How many documents and how often do they change? (10,000 static PDFs vs. 1 million records updated hourly have completely different infrastructure requirements.) - Is there any access control — documents that only certain users/roles should see? (Mandatory question — access control must be built into retrieval, not bolted on after.) **About the query distribution:** - What are the top 20 queries users actually run? (Exact keyword lookups need BM25; conceptual queries need dense search; most real-world distributions need both.) - Do users search for exact strings (product codes, document IDs) or concepts? **About success criteria:** - How will you measure if the new system is better? Do you have historical click-through data from the current system as a baseline? - What's the acceptable latency for a search result? **About organizational constraints:** - Where does the data live and can it leave those systems? (Data residency requirements often constrain vector DB choice.) - Who will maintain the ingestion pipeline when it breaks? These questions take 30 minutes in a discovery call and prevent weeks of rework. ::: **Q3. A software company wants to build a code review copilot. What are the top 3 risks you'd want to mitigate before shipping?** ::: {.callout-note collapse="true" title="Model Answer"} **Risk 1: False confidence in generated suggestions.** Developers will trust a code review copilot — it looks authoritative and is always grammatically confident. If the model suggests a fix that introduces a security vulnerability, breaks an API contract, or silently changes behavior, developers may accept it without deep scrutiny. The mitigation: make uncertainty explicit in the UI ("High confidence," "Verify before applying"), require the model to explain its reasoning rather than just stating a fix, and never have the copilot auto-apply changes without human confirmation. Build an eval that tests suggestions against unit tests the model doesn't see — if suggestions cause test failures at >5%, don't ship. **Risk 2: Proprietary code exposure.** Every code snippet the model sees is sent to an API — potentially the entire codebase over time. For a software company with proprietary algorithms, this is an IP and competitive risk. Questions to answer before shipping: which API are you using, where is data processed and stored, does the provider's data usage policy allow training on customer inputs, and does the customer's legal team require a DPA? Mitigation options: on-premise model deployment, API providers with contractual no-training commitments (Anthropic and OpenAI both offer this), or restricting the copilot to non-sensitive file types. **Risk 3: Degraded code quality from automation bias.** When developers use the copilot, review depth may decrease — if a suggestion passes the copilot, reviewers may assume it's fine. This can mask bugs the model confidently missed. The mitigation is not to disable the copilot but to track: are post-copilot PRs passing unit tests at the same rate? Is production incident rate stable? Instrument these metrics from launch so you can detect degradation before it compounds. ::: **Q4. Walk me through how you'd de-risk a document Q&A POC in a two-week sprint.** ::: {.callout-note collapse="true" title="Model Answer"} A two-week POC has one goal: determine whether this use case is technically viable for this customer's documents and queries — before committing to full build. Risk-reduction is the output, not a polished product. **Days 1–2: Scope and data access.** Get 50–100 representative documents from the actual corpus — not a cleaned demo set, the messy real ones. Get 20–30 representative questions from real users (email the people who will use this system, not the project sponsor). Identify access control requirements. Everything else depends on these inputs. **Days 3–5: Baseline pipeline.** Build the minimal viable RAG pipeline: parse documents → fixed-size chunking (256 tokens, 20% overlap) → embed with all-MiniLM-L6-v2 → store in Chroma (local, no infrastructure setup) → retrieval → generate with GPT-4o or Claude Sonnet. The specific choices don't matter yet — you want a working baseline to measure from. **Days 6–8: Measure retrieval and generation quality.** For each of the 20–30 questions, manually evaluate: (a) are the retrieved chunks relevant? (b) is the generated answer correct and sourced from the retrieved chunks? Build a simple scorecard. This tells you exactly where the system fails — retrieval quality, parsing quality, or generation quality — and whether the failures are fixable. **Days 9–11: Address top failure modes.** If retrieval is poor: add BM25 hybrid search, try a better embedding model, fix chunk size. If parsing is dropping content: try a better PDF parser on the failing document types. Don't fix everything — fix the failure modes that affect the most questions. **Days 12–14: Demo and recommendation.** Demo on real questions. Present: accuracy on the 30-question eval set, the 3–5 remaining failure modes and their root causes, a rough infrastructure estimate for production, and a recommendation: go/no-go with specific conditions. The POC's value is the recommendation, not the demo. ::: :::