17 Enterprise Integration Patterns

Note

Who this chapter is for: FDE (Mid+ background assumed) What you’ll be able to answer after reading this:

How to design a GenAI API layer that fits into an existing enterprise architecture
Data pipeline patterns for keeping a RAG index fresh
Authentication, rate limiting, and cost governance for enterprise deployments
Common integration anti-patterns and how to avoid them

17.1 API Design for GenAI Features

GenAI features have fundamentally different API characteristics than traditional REST endpoints, and trying to design them as though they were conventional CRUD services leads to poor user experience, brittle systems, and frustrated customers. The three major differences are latency, probabilistic output, and cost structure. A traditional database query takes milliseconds; an LLM generation takes one to thirty seconds depending on output length and model size. A traditional API call is deterministic; the same LLM prompt can return different outputs on successive calls. A traditional API has negligible per-call cost; LLM calls cost money proportional to input and output token counts, which makes cost governance a first-class engineering concern rather than an afterthought.

The most important API design decision for user-facing GenAI features is streaming versus synchronous response. When a user submits a question and waits ten seconds staring at a blank screen before text appears, the experience feels broken — even if the final answer is excellent. Streaming via Server-Sent Events (SSE) allows tokens to appear progressively as the model generates them, creating the impression of a fast response even when total generation time is long. For web applications, SSE is simpler to implement than WebSockets and sufficient for unidirectional model-to-client streaming. For bidirectional or real-time applications (voice, interactive agents), WebSockets may be preferable. The implementation pattern on the API layer is straightforward: call the LLM provider’s streaming endpoint, pipe the token stream through your middleware, and emit SSE events to the client. Error handling becomes more complex with streaming because you cannot return an HTTP error code after you have already started sending a 200 response; errors mid-stream must be communicated as a special event type in the stream protocol.

Idempotency is a common design principle in traditional APIs — if the same request is submitted twice, the second should be a no-op or return the same result. This principle does not transfer cleanly to LLM APIs. LLM calls are inherently non-idempotent: the same prompt submitted twice will generally return different outputs. This has implications for retry logic. If your API layer blindly retries a failed LLM call, you may generate duplicate responses that are both sent to the user, or you may waste tokens generating a response you never use. The right pattern is to implement idempotency at the application layer where it makes semantic sense: for example, if the user submits the same question twice because the first response never reached them, you might return a cached response rather than re-generating, but only if your use case allows for cached responses. Blind infrastructure-level retries with exponential backoff are appropriate for transient provider errors (503s), but not as a general pattern.

For long-running generations — document analysis, batch summarization, complex multi-step agent runs — synchronous HTTP is the wrong pattern entirely. The request will time out before the generation completes, and the client has no visibility into progress. The correct pattern is async: accept the request, return a job ID immediately with HTTP 202, process the generation asynchronously, and notify the client via webhook or allow polling against a status endpoint. This is more complex to implement but is the only reliable pattern for tasks that take more than thirty seconds. Production systems should implement both mechanisms: a webhook URL the customer can register for push notification, and a GET /jobs/{id} endpoint for polling when webhooks are not feasible. Include enough information in the status response that clients can display meaningful progress to users: “Analyzing document 3 of 12” is better than a spinner.

Prompt versioning is a practice that most teams adopt only after their first production debugging nightmare. When an LLM feature starts behaving differently in production, the first question is: did the model change, or did the prompt change? Without versioning, that question is unanswerable. Treat your prompt templates as versioned artifacts with the same discipline as code: store them in version control, include a prompt_version field in every API response (both stored in your logs and optionally surfaced to the caller), and enforce that prompt changes go through the same review and deployment process as code changes. When a customer reports a quality regression, you can immediately identify whether it correlates with a prompt version change. The investment in prompt versioning is small; the debugging leverage it provides is large.

17.2 Data Pipeline Patterns for Index Freshness

A RAG system is only as useful as the freshness of its index. Serving answers grounded in documents that were last indexed six months ago is often worse than serving no answer at all — the system confidently references outdated information, and users learn not to trust it. Index freshness is one of the most commonly underspecified requirements in RAG deployments, and FDEs who raise it explicitly during scoping demonstrate experience that general-purpose ML engineers often lack.

Three pipeline architectures cover most enterprise use cases, differentiated primarily by required freshness latency. Batch ingestion, the simplest pattern, runs on a schedule — typically nightly or weekly. A cron job fetches all documents from the source system, chunks them, generates embeddings, and either replaces or incrementally updates the vector index. Batch ingestion is appropriate for content that changes infrequently: product documentation, policy handbooks, HR procedures, archived research reports. The primary advantages are simplicity (a single script that can be debugged in isolation) and cost-effectiveness (embeddings generated once per batch, not per change). The limitation is that content changed at 9am on Monday may not appear in search results until 9am on Tuesday. For many enterprise knowledge base applications, a 24-hour lag is entirely acceptable and not worth the engineering complexity of something faster.

Event-driven ingestion reduces the freshness lag to minutes or hours. The trigger is a document change event: a webhook from SharePoint or Confluence when a page is edited, a database Change Data Capture (CDC) stream when a row is updated, or a message on a Kafka topic when a file is uploaded to blob storage. The pipeline consumes the event, identifies the changed document, re-processes only that document (fetch, chunk, embed), and updates the corresponding entries in the vector index. This pattern is significantly more complex to implement reliably than batch ingestion — you need to handle event ordering, duplicate events, failed processing retries, and partial index updates that leave the index in an inconsistent state. Use an idempotent document key (typically the document URL or database primary key) so that re-processing the same document multiple times converges to the correct final state. Event-driven ingestion is the right choice when freshness requirements are in the minutes-to-hours range and the source system can emit change events.

Real-time streaming ingestion targets freshness of seconds and is appropriate for high-velocity content: Slack messages, customer support tickets, live news feeds, monitoring alerts. The architecture typically involves a message streaming system (Kafka, Kinesis, Pub/Sub) as the backbone, with consumers that process documents in near-real-time, mini-batching to control embedding API call costs, and index writes optimized for high throughput. The engineering cost is substantial: real-time pipelines require careful attention to backpressure (what happens when the index writer is slower than the document stream), exactly-once semantics (ensuring each document is indexed exactly once even under failure), and index write latency (many vector databases have write latency that makes sub-second indexing infeasible). A good practice is to define “index lag SLA” explicitly with the customer — the maximum acceptable time between a document being created or changed and that content being searchable. Define it, measure it, and alert when it is violated.

Regardless of which pipeline pattern you choose, document deletion is a frequently overlooked edge case. When a document is deleted from the source system, the corresponding chunks must also be removed from the vector index. Failure to handle deletions means your RAG system will continue retrieving and citing documents that no longer exist or have been retracted — a correctness and potentially a compliance problem. The implementation requires maintaining a mapping from source document identifiers to vector index chunk IDs, so that deletion of a document triggers deletion of all its corresponding chunks. Soft deletion (marking chunks as inactive without removing them) is an option that simplifies implementation and allows recovery from accidental deletions, but requires filtering inactive chunks at query time.

The gap between document creation or modification and search availability is your “index freshness SLA.” Define it explicitly during scoping. Put it in writing. Measure it in production and alert when it is violated. Customers who build workflows on top of your RAG system will develop assumptions about freshness — either from what you tell them or from observing the system. Unmanaged freshness expectations are a frequent source of enterprise support tickets and customer dissatisfaction.

17.3 Authentication and Authorization in GenAI Systems

Standard web application authentication — OAuth tokens, session cookies, API keys, JWTs — handles the question of user identity in GenAI systems the same way it handles it elsewhere. What is new and subtle about GenAI systems is document-level authorization during retrieval. If a company has a mixed knowledge base where some documents are accessible to all employees and some are accessible only to specific departments, the RAG system must enforce those access controls at query time. Failing to do so results in the system leaking confidential information — financial projections to people who should not see them, HR investigations to employees not involved, legal advice to unauthorized parties. This is not a theoretical concern; it is one of the first compliance questions a CISO will ask about a RAG deployment.

The pre-filter pattern is the recommended approach for most enterprise deployments. At query time, before the vector similarity search, filter the search to only the documents the querying user has permission to access. Most production vector databases (Pinecone, Weaviate, Qdrant, pgvector) support metadata filtering that can be applied alongside vector similarity search. You store each document chunk with a metadata field indicating which permission groups can access it, and at query time you inject a filter specifying the current user’s permission groups. The limitation is performance: filtering significantly reduces the effective index size before similarity search, which can degrade recall if the user’s permission set is small. Optimizing this may require separate sub-indexes for major permission tiers, with the query layer routing to the appropriate sub-index.

The post-filter pattern retrieves broadly without access control filtering and then removes unauthorized results from the retrieved set before passing context to the LLM. This is simpler to implement — no changes to the vector search query — but has meaningful downsides. You waste compute on retrieving documents you will discard. More critically, if many of the top-k results are unauthorized, the filtered set may not have enough relevant documents to answer the question well. A user with narrow permissions who submits a query that would be well-answered by an unauthorized document will receive a poor answer or an “I don’t know” — not because the knowledge doesn’t exist in the system, but because the retrieval strategy can’t surface it without also surfacing unauthorized content. For fine-grained permissions where individual users have unique access sets, post-filtering becomes impractical.

A third pattern — separate indexes per permission group — trades storage cost for simplicity. If the company has five major data classification levels (public, internal, confidential, restricted, executive), maintain five separate vector indexes. At query time, query only the indexes the user is authorized to access. This is the simplest access control model to implement and reason about, and it performs well when permission groups are coarse-grained and stable. It becomes expensive and complex when permission groups are fine-grained (per-document permissions) or change frequently (employees joining, leaving, changing roles). For large enterprises with hundreds of permission groups, this approach requires an indexing infrastructure management layer that adds significant operational overhead.

For agentic systems that execute tool calls — querying databases, writing to systems of record, sending notifications — authorization becomes even more critical. Each tool call executed by an agent should be made with the permissions of the user who initiated the agent, not with a service account. An agent that queries a database as a service account with broad permissions is effectively a privilege escalation vector: the user asks the agent something they could not query directly, the agent queries the database with its elevated permissions, and the answer is returned to the user. This is a subtle but serious security issue. The implementation requires threading the user’s credentials or session token through the agent’s tool call layer, and ensuring each tool validates permissions before execution. It is more complex than using a single service account for all agent operations, but it is the only architecturally correct approach for production enterprise systems.

17.4 Cost Governance

Enterprise LLM costs can scale in ways that surprise engineers who are accustomed to compute costs that scale predictably with infrastructure size. Consider the arithmetic: a model that costs $0.003 per 1,000 input tokens sounds negligible until you work out the enterprise math. Ten thousand employees each making 100 queries per day, each query with a 3,000-token context (system prompt plus retrieved documents plus question), produces $9,000 per day or $270,000 per month in input token costs alone — before counting output tokens. A poorly governed deployment where one team launches a high-volume application without budget awareness can consume the entire company’s LLM budget in a week. Cost governance is not optional for enterprise deployments; it is a first-class system requirement with the same priority as security and reliability.

Token budgets are the foundational governance mechanism. Each team or application is allocated a monthly token budget, tracked in a persistent store (a database table with team_id, budget_allocated, budget_used, budget_period). Every LLM call records its token consumption against the appropriate team’s budget. When consumption approaches the budget threshold (commonly 80% for warning, 100% for enforcement), the system either sends an alert or begins rate limiting. Implementation lives in an API gateway or middleware layer that wraps all LLM calls: the gateway looks up the team’s budget, records the call, and enforces limits. This architecture means budget policy is enforced consistently regardless of which application or code path makes the LLM call. Budget resets, overrides, and emergency allocations should be manageable through an admin interface without code deploys.

Cost attribution tagging enables the budget system and also provides the diagnostic signal needed to understand cost drivers. Every LLM call should be tagged with: the application identifier (what feature or product generated this call), the team identifier (who is accountable for this cost), the user identifier (for abuse detection and per-user limits), the use case type (chat, summarization, extraction, classification — different use cases have different expected token profiles), and a session or request ID for correlation with application-level logs. These tags are attached as metadata to the LLM request and recorded in your observability system alongside the token count and cost. When a cost anomaly occurs — a spike in usage that does not correspond to a known traffic increase — the tags allow you to identify which application, which team, and which request pattern is responsible within minutes rather than hours of investigation.

Model routing is a cost optimization strategy that can reduce LLM spend by 50-80% without material quality degradation for most enterprise applications. The insight is that not all queries require the most powerful and expensive model. Simple classification tasks (“is this message a complaint or a question?”), short extraction tasks (“what is the invoice number in this document?”), and straightforward question-answering over well-structured content can be handled effectively by smaller, cheaper models. Complex reasoning, nuanced writing, multi-step analysis, and tasks requiring broad world knowledge benefit from larger, more expensive models. Implementing a routing layer — a lightweight classifier that assigns each incoming query to a cost tier, then directs it to the appropriate model — requires an upfront investment in building and validating the classifier but pays dividends at scale. The routing classifier itself should use a cheap model; having the routing decision made by an expensive model defeats the purpose.

Semantic caching can dramatically reduce costs for applications where users ask similar questions repeatedly. Rather than sending every query to the LLM, cache recent query-response pairs and, at query time, compute the semantic similarity between the new query and cached queries. If a cached query is above a similarity threshold (typically 0.92-0.95 cosine similarity), return the cached response instead of generating a new one. For FAQ-style applications, knowledge base chatbots, and documentation assistants, where a large fraction of queries are near-duplicates of previous queries, cache hit rates of 30-80% are achievable. Cache hit rates should be monitored and the similarity threshold tuned: too high a threshold and the cache is underutilized; too low and you return responses to queries that were semantically similar but meaningfully different. Semantic caches should have TTLs tied to index freshness — a cached response grounded in documents that have since changed should expire when those documents are re-indexed.

17.5 Observability and Anti-Patterns

Traditional service observability — p99 latency, error rate, throughput — is necessary but not sufficient for LLM systems. A GenAI service can have perfect uptime and sub-second latency while providing confidently wrong answers that erode user trust over weeks. LLM-specific observability adds a quality dimension to the conventional reliability dimension, and production deployments that neglect it will be flying blind. The goal is to know, before your customers tell you, when your system’s output quality has degraded.

LLM-specific tracing requires capturing more signal per request than a conventional trace. At minimum, log: the complete prompt sent to the model (including the system prompt, not just the user message), the complete response, the retrieved documents and their relevance scores (for RAG systems), the latency of each component (retrieval, reranking, generation), and the token counts for input and output. These traces should be correlated with a request ID that flows from the user’s browser through every component of the system, so that a specific user complaint (“the answer I got at 2:17pm was wrong”) can be reproduced exactly. Storage costs for full prompt/response logging are non-trivial at scale — a common pattern is to log 100% of requests to a hot store for 7 days (for debugging), then sample 1-5% to a cold store for longer-term analysis.

Output quality monitoring closes the gap between “did the system respond” and “did the system respond well.” The standard approach is periodic sampling: take a random 1-5% of production responses and evaluate them with a quality model (a separate LLM call that judges the response against a rubric) or a human review queue. Track quality metrics over time and alert on meaningful degradations. Common quality metrics for RAG systems: faithfulness (does the answer contradict the retrieved documents?), answer relevance (does the answer address the question?), context relevance (did we retrieve the right documents?). These three metrics — sometimes called the RAG triad — give you structured signal on where the pipeline is failing when quality degrades.

The most consequential anti-patterns in enterprise GenAI deployments are consistent enough to be worth naming explicitly. Chunking too small (under 150 tokens) loses context that makes individual chunks interpretable — a sentence like “This exception applies in the following cases:” is meaningless without the surrounding paragraph. Chunking too large (over 800 tokens) dilutes relevance because a large chunk contains many topics, making it a poor match for specific queries even when the relevant information is within it. The right chunk size depends on the document type and query type and should be determined empirically on the customer’s actual data rather than assumed from defaults. Skipping re-ranking is a common shortcut that meaningfully degrades retrieval quality: vector similarity scores are noisy, and the top result by cosine similarity is often not the most relevant result after accounting for document quality, freshness, and semantic specificity. A cross-encoder re-ranker (a model that jointly encodes the query and each candidate document) adds latency but significantly improves precision at rank 1, which is what matters most for answer quality.

Hardcoding model version identifiers in production code is a maintenance timebomb. When a model is deprecated by its provider, systems that specify the deprecated version by exact name break — often silently, with confusing error messages, at a time outside business hours. The right practice is to pin to a model alias (like “claude-sonnet-latest” or a versioned alias your gateway controls) that you can update in one place, with testing, rather than discovering all the places a model version was hardcoded when a deprecation notice arrives. Similarly, the absence of a human escalation path — when the LLM cannot answer, what happens to the user? — is a design gap that manifests as user frustration and support escalations. Every user-facing GenAI interface should have a clear, visible path to a human, and the system should actively offer that path when its confidence is low rather than returning a confusing “I don’t know.”

17.6 Interview Questions

Mid Level

Q1. What is semantic caching and when does it work well vs. poorly?

Model Answer

Semantic caching stores recent LLM query-response pairs and, instead of calling the LLM for every new query, computes the embedding of the incoming query and searches the cache for semantically similar past queries. When similarity exceeds a threshold, the cached response is returned without an LLM call. This reduces cost and latency for applications with repetitive query patterns.

It works well for FAQ-style applications, documentation chatbots, customer support assistants, and any system where a significant fraction of users ask the same or very similar questions. Hit rates of 30-80% are achievable in these contexts. The key tuning parameter is the similarity threshold: typically 0.92-0.95 cosine similarity on normalized embeddings. Higher threshold means fewer false cache hits but lower cache utilization.

It works poorly when queries are highly varied and personalized (each user’s context is unique, so cache hits are rare), when documents in the index change frequently (cached responses may cite outdated content and need TTLs tied to document freshness), and when the application requires responses tailored to the specific user’s history or account state. It also degrades when users game or probe the system in unusual ways that the cache cannot anticipate.

Interviewers want to hear: the mechanism (embedding similarity search over cached queries), a concrete hit rate estimate for a suitable use case, the failure modes (stale responses, low hit rates for personalized queries), and the TTL/freshness coupling consideration.

Q2. Explain the tradeoffs between synchronous, streaming, and async API patterns for LLM features.

Model Answer

The three patterns serve different use cases and have distinct tradeoffs that should be matched to the requirements of each LLM feature.

Synchronous (request-response) is the simplest pattern: the client sends a request and waits for the complete response. It is appropriate only for very short generations (under 2-3 seconds) or when the calling system is not user-facing and can tolerate waiting. The problem for most LLM use cases is user-perceived latency: waiting 8-15 seconds for a response before seeing any output feels broken, even if the final answer is good.

Streaming via SSE resolves the perceived latency problem for user-facing features by sending tokens to the client as they are generated. The user sees output begin within 0.5-1 second, and the progressively appearing text creates a satisfying experience even if total generation time is 10+ seconds. Streaming adds complexity: error handling mid-stream requires a special event type in the stream protocol, and clients must handle partial responses gracefully. Error rates must be monitored per stream, not just per request.

Async (job queue) is the right pattern for long-running tasks: document analysis, batch summarization, complex agent runs. The API accepts the request immediately (HTTP 202), returns a job ID, and the client polls a status endpoint or registers a webhook. Async patterns decouple generation latency from client timeout constraints but require more infrastructure (job queue, worker pool, status storage, webhook delivery). For tasks over 30 seconds, async is not optional — it is the only reliable pattern.

What interviewers want to hear: clear definitions of all three, specific latency thresholds that guide the choice, error handling differences, and infrastructure requirements.

Forward Deployed Engineer

Q1. A customer wants to integrate your company’s LLM API into their existing enterprise software. What are the first five questions you ask about their architecture?

Model Answer

This question tests whether you approach integration technically and methodically, not just at the product level. The interviewer wants to see that you understand what makes enterprise integration hard and that you gather the right information before proposing a design.

First question: “What is the request latency budget for the integration?” User-facing features need streaming; batch processing can tolerate 30+ seconds. This single question shapes the entire API integration pattern.

Second question: “What authentication and identity system do you use, and does the LLM integration need to enforce per-user permissions?” This surfaces whether you need document-level access control and what identity provider you’re integrating with.

Third question: “Do you have existing API gateway or middleware infrastructure that LLM calls should flow through?” If they do, cost governance, logging, and rate limiting may already have a home. If they don’t, you need to design for it.

Fourth question: “What are your compliance and data residency requirements?” HIPAA, GDPR, SOC 2, or industry-specific regulations constrain which providers you can use, where data can be processed, and what logging is required or prohibited (PII).

Fifth question: “How do you handle service dependencies today — do you have circuit breakers and fallback patterns?” LLM providers have higher latency and lower SLA guarantees than internal databases. Understanding how the customer’s systems handle downstream failures tells you how much resilience work the integration requires.

What interviewers want to hear: questions that reveal real constraints rather than surface-level preferences, coverage of latency, auth, compliance, infrastructure, and reliability concerns.

Q2. The customer’s RAG index goes stale within hours of document updates. Design a near-real-time ingestion pipeline that gets freshness down to under 5 minutes.

Model Answer

This is a systems design question. Walk through the architecture clearly, justifying each component.

The core insight is that batch ingestion cannot achieve 5-minute freshness and the customer needs event-driven architecture. The pipeline: (1) Change Detection Layer — subscribe to change events from the document source. For SharePoint or Confluence, this means registering webhooks that fire on document create/update/delete. For a database, set up CDC (Change Data Capture) using a tool like Debezium. For a file system, use inotify or cloud storage event notifications. (2) Event Queue — route change events to a durable message queue (SQS, Pub/Sub, Kafka). The queue decouples ingestion rate from processing speed and provides a retry mechanism for failed processing. (3) Document Processor — consumers pull events from the queue, fetch the full document from the source, run the ingestion pipeline (text extraction, chunking, embedding generation), and write the updated chunks to the vector index. Use the document’s unique identifier as an idempotent key so reprocessing a document produces the same result. (4) Index Write — update the vector index: delete stale chunks (by document ID), insert new chunks. (5) Deletion Handling — when a delete event arrives, remove all chunks with matching document ID from the index.

Latency breakdown for this pipeline: webhook delivery ~1-10 seconds, queue processing ~1-5 seconds, embedding generation ~10-30 seconds, index write ~1-5 seconds. Total: approximately 15-50 seconds end-to-end — well within the 5-minute target.

What interviewers want to hear: event-driven rather than polling, idempotent processing, explicit deletion handling, and a realistic latency estimate.

Q3. How would you implement document-level access control in a RAG system where different employees have different document permissions? Walk through two approaches and compare them.

Model Answer

Document-level access control in RAG is a common enterprise requirement and a frequent interview topic for FDE roles because it sits at the intersection of security, systems design, and retrieval quality.

Approach 1 — Pre-filter retrieval: At indexing time, store each chunk with a metadata field listing the permission groups that can access it (e.g., allowed_groups: ["finance", "executives"]). At query time, inject a metadata filter into the vector search that restricts results to documents matching the querying user’s group memberships. This is executed inside the vector database, so only authorized documents are scored and returned. Advantages: no unauthorized content ever enters the LLM context (defense in depth), correct recall is maintained for the user’s authorized set, computationally efficient for coarse-grained permissions. Disadvantages: requires the vector database to support metadata filtering alongside ANN search (most production databases do, but it’s a capability to verify), and permission updates (a document’s access list changes) require reindexing affected chunks.

Approach 2 — Post-filter retrieval: Retrieve the top-k most relevant chunks regardless of permissions, then filter out chunks the user cannot access before constructing the LLM context. Simpler to implement — no changes to the retrieval query. Disadvantages: unauthorized content is fetched (wasted compute, potential log exposure), and if many top-k results are unauthorized, the LLM context may be impoverished, degrading answer quality. For users with narrow permissions in a mixed knowledge base, this degrades noticeably.

Recommendation: Pre-filter for production enterprise deployments with genuine access control requirements. Post-filter as a quick prototype only. For extremely fine-grained permissions (per-user document access), consider hybrid: coarse pre-filter by department, post-filter for exact document-level permissions.

What interviewers want to hear: both approaches clearly defined, honest tradeoffs, and a defensible recommendation.

Q4. Design a cost governance system for a company deploying LLM features across 10 internal teams with a shared $50k/month budget.

Model Answer

Cost governance is a systems design problem with three layers: allocation, enforcement, and visibility. Walk through each.

Allocation: Divide the $50k/month budget across 10 teams. The allocation process should be governance-driven (finance + leadership approve team allocations), not purely technical. For the design, assume each team gets a monthly token budget (translating dollars to tokens at your provider’s rates). Teams with high-volume use cases (customer support, internal assistants with many users) get larger allocations. Track budgets in a relational database: table with columns team_id, budget_tokens_allocated, budget_tokens_used, budget_period, overage_policy.

Enforcement: All LLM calls flow through a central API gateway. Before forwarding to the LLM provider, the gateway: looks up the requesting application’s team_id (from the API key registry), checks the team’s remaining budget for the current period, either proceeds (budget available) or returns HTTP 429 with a budget_exhausted error code (budget depleted). The gateway records token consumption atomically after each call, using the provider’s actual token count from the response. Warning alerts fire at 70% utilization; hard limits fire at 100%. Emergency budget override process: team lead requests via ticketing system, approved by finance within 4 business hours.

Visibility: A cost dashboard shows each team’s real-time spend, projected month-end spend, and per-application breakdown. This is the single most impactful governance tool — most overspend is accidental, and teams that can see their consumption self-correct. Weekly automated email to each team lead with their spend summary.

Model routing as a multiplier: implement a query classifier that routes simple queries to cheaper models, multiplying effective budget. For 10 teams with mixed query complexity, this commonly achieves 40-60% cost reduction.

What interviewers want to hear: all three layers (allocation, enforcement, visibility), specific implementation components, and at least one optimization mechanism.

Q5. A customer’s LLM feature has 99% uptime but customers keep complaining about “it not working.” What does your observability gap look like and how do you close it?

Model Answer

This question distinguishes candidates who understand LLM-specific quality from those who think about reliability only in traditional terms. 99% uptime means the service is responding — it says nothing about whether the responses are useful, accurate, or coherent.

The observability gap has three layers. First, no output quality monitoring. The system is logging request count and latency (uptime metrics) but not evaluating whether responses are correct or helpful. Users who get wrong answers don’t generate 500 errors; they just report “it’s not working.” To close this: implement response quality sampling — take 2% of production responses and run them through a quality evaluator (either an LLM-as-judge or a domain-specific quality rubric). Track quality scores over time; alert on degradation.

Second, no faithfulness monitoring for RAG. The system may be retrieving the wrong documents (low context relevance) or generating answers that contradict the retrieved documents (low faithfulness). Either failure looks like “wrong answers” to users. Instrument the retrieval pipeline to log retrieval scores and track faithfulness as a separate metric. A faithfulness drop often precedes a wave of user complaints.

Third, no user feedback loop. Users experiencing failures have no mechanism to report them at the point of failure (no thumbs up/down, no escalation button). Without structured user feedback, you only hear about quality issues through support tickets — too slow and too aggregated to act on. Add inline feedback at the response level and route low-rated responses to a review queue.

Closing the gap: quality sampling + faithfulness monitoring + user feedback collection, with alerting and weekly review cadence.

What interviewers want to hear: recognition that uptime is not quality, specific quality metrics (faithfulness, relevance), concrete monitoring mechanisms, and a user feedback loop.

17.7 Further Reading

Weaviate documentation on metadata filtering — production reference for pre-filter access control implementation
Change Data Capture with Debezium — standard CDC approach for event-driven ingestion
LangSmith documentation — LLM observability and tracing platform
Arize AI documentation — production LLM monitoring and evaluation
Chip Huyen, Designing Machine Learning Systems — Chapter 9 on data pipelines is directly applicable to RAG ingestion design

# Enterprise Integration Patterns ::: {.callout-note} **Who this chapter is for:** FDE (Mid+ background assumed) **What you'll be able to answer after reading this:** - How to design a GenAI API layer that fits into an existing enterprise architecture - Data pipeline patterns for keeping a RAG index fresh - Authentication, rate limiting, and cost governance for enterprise deployments - Common integration anti-patterns and how to avoid them ::: ## API Design for GenAI Features GenAI features have fundamentally different API characteristics than traditional REST endpoints, and trying to design them as though they were conventional CRUD services leads to poor user experience, brittle systems, and frustrated customers. The three major differences are latency, probabilistic output, and cost structure. A traditional database query takes milliseconds; an LLM generation takes one to thirty seconds depending on output length and model size. A traditional API call is deterministic; the same LLM prompt can return different outputs on successive calls. A traditional API has negligible per-call cost; LLM calls cost money proportional to input and output token counts, which makes cost governance a first-class engineering concern rather than an afterthought. The most important API design decision for user-facing GenAI features is streaming versus synchronous response. When a user submits a question and waits ten seconds staring at a blank screen before text appears, the experience feels broken — even if the final answer is excellent. Streaming via Server-Sent Events (SSE) allows tokens to appear progressively as the model generates them, creating the impression of a fast response even when total generation time is long. For web applications, SSE is simpler to implement than WebSockets and sufficient for unidirectional model-to-client streaming. For bidirectional or real-time applications (voice, interactive agents), WebSockets may be preferable. The implementation pattern on the API layer is straightforward: call the LLM provider's streaming endpoint, pipe the token stream through your middleware, and emit SSE events to the client. Error handling becomes more complex with streaming because you cannot return an HTTP error code after you have already started sending a 200 response; errors mid-stream must be communicated as a special event type in the stream protocol. Idempotency is a common design principle in traditional APIs — if the same request is submitted twice, the second should be a no-op or return the same result. This principle does not transfer cleanly to LLM APIs. LLM calls are inherently non-idempotent: the same prompt submitted twice will generally return different outputs. This has implications for retry logic. If your API layer blindly retries a failed LLM call, you may generate duplicate responses that are both sent to the user, or you may waste tokens generating a response you never use. The right pattern is to implement idempotency at the application layer where it makes semantic sense: for example, if the user submits the same question twice because the first response never reached them, you might return a cached response rather than re-generating, but only if your use case allows for cached responses. Blind infrastructure-level retries with exponential backoff are appropriate for transient provider errors (503s), but not as a general pattern. For long-running generations — document analysis, batch summarization, complex multi-step agent runs — synchronous HTTP is the wrong pattern entirely. The request will time out before the generation completes, and the client has no visibility into progress. The correct pattern is async: accept the request, return a job ID immediately with HTTP 202, process the generation asynchronously, and notify the client via webhook or allow polling against a status endpoint. This is more complex to implement but is the only reliable pattern for tasks that take more than thirty seconds. Production systems should implement both mechanisms: a webhook URL the customer can register for push notification, and a GET /jobs/{id} endpoint for polling when webhooks are not feasible. Include enough information in the status response that clients can display meaningful progress to users: "Analyzing document 3 of 12" is better than a spinner. Prompt versioning is a practice that most teams adopt only after their first production debugging nightmare. When an LLM feature starts behaving differently in production, the first question is: did the model change, or did the prompt change? Without versioning, that question is unanswerable. Treat your prompt templates as versioned artifacts with the same discipline as code: store them in version control, include a `prompt_version` field in every API response (both stored in your logs and optionally surfaced to the caller), and enforce that prompt changes go through the same review and deployment process as code changes. When a customer reports a quality regression, you can immediately identify whether it correlates with a prompt version change. The investment in prompt versioning is small; the debugging leverage it provides is large. ## Data Pipeline Patterns for Index Freshness A RAG system is only as useful as the freshness of its index. Serving answers grounded in documents that were last indexed six months ago is often worse than serving no answer at all — the system confidently references outdated information, and users learn not to trust it. Index freshness is one of the most commonly underspecified requirements in RAG deployments, and FDEs who raise it explicitly during scoping demonstrate experience that general-purpose ML engineers often lack. Three pipeline architectures cover most enterprise use cases, differentiated primarily by required freshness latency. Batch ingestion, the simplest pattern, runs on a schedule — typically nightly or weekly. A cron job fetches all documents from the source system, chunks them, generates embeddings, and either replaces or incrementally updates the vector index. Batch ingestion is appropriate for content that changes infrequently: product documentation, policy handbooks, HR procedures, archived research reports. The primary advantages are simplicity (a single script that can be debugged in isolation) and cost-effectiveness (embeddings generated once per batch, not per change). The limitation is that content changed at 9am on Monday may not appear in search results until 9am on Tuesday. For many enterprise knowledge base applications, a 24-hour lag is entirely acceptable and not worth the engineering complexity of something faster. Event-driven ingestion reduces the freshness lag to minutes or hours. The trigger is a document change event: a webhook from SharePoint or Confluence when a page is edited, a database Change Data Capture (CDC) stream when a row is updated, or a message on a Kafka topic when a file is uploaded to blob storage. The pipeline consumes the event, identifies the changed document, re-processes only that document (fetch, chunk, embed), and updates the corresponding entries in the vector index. This pattern is significantly more complex to implement reliably than batch ingestion — you need to handle event ordering, duplicate events, failed processing retries, and partial index updates that leave the index in an inconsistent state. Use an idempotent document key (typically the document URL or database primary key) so that re-processing the same document multiple times converges to the correct final state. Event-driven ingestion is the right choice when freshness requirements are in the minutes-to-hours range and the source system can emit change events. Real-time streaming ingestion targets freshness of seconds and is appropriate for high-velocity content: Slack messages, customer support tickets, live news feeds, monitoring alerts. The architecture typically involves a message streaming system (Kafka, Kinesis, Pub/Sub) as the backbone, with consumers that process documents in near-real-time, mini-batching to control embedding API call costs, and index writes optimized for high throughput. The engineering cost is substantial: real-time pipelines require careful attention to backpressure (what happens when the index writer is slower than the document stream), exactly-once semantics (ensuring each document is indexed exactly once even under failure), and index write latency (many vector databases have write latency that makes sub-second indexing infeasible). A good practice is to define "index lag SLA" explicitly with the customer — the maximum acceptable time between a document being created or changed and that content being searchable. Define it, measure it, and alert when it is violated. Regardless of which pipeline pattern you choose, document deletion is a frequently overlooked edge case. When a document is deleted from the source system, the corresponding chunks must also be removed from the vector index. Failure to handle deletions means your RAG system will continue retrieving and citing documents that no longer exist or have been retracted — a correctness and potentially a compliance problem. The implementation requires maintaining a mapping from source document identifiers to vector index chunk IDs, so that deletion of a document triggers deletion of all its corresponding chunks. Soft deletion (marking chunks as inactive without removing them) is an option that simplifies implementation and allows recovery from accidental deletions, but requires filtering inactive chunks at query time. The gap between document creation or modification and search availability is your "index freshness SLA." Define it explicitly during scoping. Put it in writing. Measure it in production and alert when it is violated. Customers who build workflows on top of your RAG system will develop assumptions about freshness — either from what you tell them or from observing the system. Unmanaged freshness expectations are a frequent source of enterprise support tickets and customer dissatisfaction. ## Authentication and Authorization in GenAI Systems Standard web application authentication — OAuth tokens, session cookies, API keys, JWTs — handles the question of user identity in GenAI systems the same way it handles it elsewhere. What is new and subtle about GenAI systems is document-level authorization during retrieval. If a company has a mixed knowledge base where some documents are accessible to all employees and some are accessible only to specific departments, the RAG system must enforce those access controls at query time. Failing to do so results in the system leaking confidential information — financial projections to people who should not see them, HR investigations to employees not involved, legal advice to unauthorized parties. This is not a theoretical concern; it is one of the first compliance questions a CISO will ask about a RAG deployment. The pre-filter pattern is the recommended approach for most enterprise deployments. At query time, before the vector similarity search, filter the search to only the documents the querying user has permission to access. Most production vector databases (Pinecone, Weaviate, Qdrant, pgvector) support metadata filtering that can be applied alongside vector similarity search. You store each document chunk with a metadata field indicating which permission groups can access it, and at query time you inject a filter specifying the current user's permission groups. The limitation is performance: filtering significantly reduces the effective index size before similarity search, which can degrade recall if the user's permission set is small. Optimizing this may require separate sub-indexes for major permission tiers, with the query layer routing to the appropriate sub-index. The post-filter pattern retrieves broadly without access control filtering and then removes unauthorized results from the retrieved set before passing context to the LLM. This is simpler to implement — no changes to the vector search query — but has meaningful downsides. You waste compute on retrieving documents you will discard. More critically, if many of the top-k results are unauthorized, the filtered set may not have enough relevant documents to answer the question well. A user with narrow permissions who submits a query that would be well-answered by an unauthorized document will receive a poor answer or an "I don't know" — not because the knowledge doesn't exist in the system, but because the retrieval strategy can't surface it without also surfacing unauthorized content. For fine-grained permissions where individual users have unique access sets, post-filtering becomes impractical. A third pattern — separate indexes per permission group — trades storage cost for simplicity. If the company has five major data classification levels (public, internal, confidential, restricted, executive), maintain five separate vector indexes. At query time, query only the indexes the user is authorized to access. This is the simplest access control model to implement and reason about, and it performs well when permission groups are coarse-grained and stable. It becomes expensive and complex when permission groups are fine-grained (per-document permissions) or change frequently (employees joining, leaving, changing roles). For large enterprises with hundreds of permission groups, this approach requires an indexing infrastructure management layer that adds significant operational overhead. For agentic systems that execute tool calls — querying databases, writing to systems of record, sending notifications — authorization becomes even more critical. Each tool call executed by an agent should be made with the permissions of the user who initiated the agent, not with a service account. An agent that queries a database as a service account with broad permissions is effectively a privilege escalation vector: the user asks the agent something they could not query directly, the agent queries the database with its elevated permissions, and the answer is returned to the user. This is a subtle but serious security issue. The implementation requires threading the user's credentials or session token through the agent's tool call layer, and ensuring each tool validates permissions before execution. It is more complex than using a single service account for all agent operations, but it is the only architecturally correct approach for production enterprise systems. ## Cost Governance Enterprise LLM costs can scale in ways that surprise engineers who are accustomed to compute costs that scale predictably with infrastructure size. Consider the arithmetic: a model that costs $0.003 per 1,000 input tokens sounds negligible until you work out the enterprise math. Ten thousand employees each making 100 queries per day, each query with a 3,000-token context (system prompt plus retrieved documents plus question), produces $9,000 per day or $270,000 per month in input token costs alone — before counting output tokens. A poorly governed deployment where one team launches a high-volume application without budget awareness can consume the entire company's LLM budget in a week. Cost governance is not optional for enterprise deployments; it is a first-class system requirement with the same priority as security and reliability. Token budgets are the foundational governance mechanism. Each team or application is allocated a monthly token budget, tracked in a persistent store (a database table with team_id, budget_allocated, budget_used, budget_period). Every LLM call records its token consumption against the appropriate team's budget. When consumption approaches the budget threshold (commonly 80% for warning, 100% for enforcement), the system either sends an alert or begins rate limiting. Implementation lives in an API gateway or middleware layer that wraps all LLM calls: the gateway looks up the team's budget, records the call, and enforces limits. This architecture means budget policy is enforced consistently regardless of which application or code path makes the LLM call. Budget resets, overrides, and emergency allocations should be manageable through an admin interface without code deploys. Cost attribution tagging enables the budget system and also provides the diagnostic signal needed to understand cost drivers. Every LLM call should be tagged with: the application identifier (what feature or product generated this call), the team identifier (who is accountable for this cost), the user identifier (for abuse detection and per-user limits), the use case type (chat, summarization, extraction, classification — different use cases have different expected token profiles), and a session or request ID for correlation with application-level logs. These tags are attached as metadata to the LLM request and recorded in your observability system alongside the token count and cost. When a cost anomaly occurs — a spike in usage that does not correspond to a known traffic increase — the tags allow you to identify which application, which team, and which request pattern is responsible within minutes rather than hours of investigation. Model routing is a cost optimization strategy that can reduce LLM spend by 50-80% without material quality degradation for most enterprise applications. The insight is that not all queries require the most powerful and expensive model. Simple classification tasks ("is this message a complaint or a question?"), short extraction tasks ("what is the invoice number in this document?"), and straightforward question-answering over well-structured content can be handled effectively by smaller, cheaper models. Complex reasoning, nuanced writing, multi-step analysis, and tasks requiring broad world knowledge benefit from larger, more expensive models. Implementing a routing layer — a lightweight classifier that assigns each incoming query to a cost tier, then directs it to the appropriate model — requires an upfront investment in building and validating the classifier but pays dividends at scale. The routing classifier itself should use a cheap model; having the routing decision made by an expensive model defeats the purpose. Semantic caching can dramatically reduce costs for applications where users ask similar questions repeatedly. Rather than sending every query to the LLM, cache recent query-response pairs and, at query time, compute the semantic similarity between the new query and cached queries. If a cached query is above a similarity threshold (typically 0.92-0.95 cosine similarity), return the cached response instead of generating a new one. For FAQ-style applications, knowledge base chatbots, and documentation assistants, where a large fraction of queries are near-duplicates of previous queries, cache hit rates of 30-80% are achievable. Cache hit rates should be monitored and the similarity threshold tuned: too high a threshold and the cache is underutilized; too low and you return responses to queries that were semantically similar but meaningfully different. Semantic caches should have TTLs tied to index freshness — a cached response grounded in documents that have since changed should expire when those documents are re-indexed. ## Observability and Anti-Patterns Traditional service observability — p99 latency, error rate, throughput — is necessary but not sufficient for LLM systems. A GenAI service can have perfect uptime and sub-second latency while providing confidently wrong answers that erode user trust over weeks. LLM-specific observability adds a quality dimension to the conventional reliability dimension, and production deployments that neglect it will be flying blind. The goal is to know, before your customers tell you, when your system's output quality has degraded. LLM-specific tracing requires capturing more signal per request than a conventional trace. At minimum, log: the complete prompt sent to the model (including the system prompt, not just the user message), the complete response, the retrieved documents and their relevance scores (for RAG systems), the latency of each component (retrieval, reranking, generation), and the token counts for input and output. These traces should be correlated with a request ID that flows from the user's browser through every component of the system, so that a specific user complaint ("the answer I got at 2:17pm was wrong") can be reproduced exactly. Storage costs for full prompt/response logging are non-trivial at scale — a common pattern is to log 100% of requests to a hot store for 7 days (for debugging), then sample 1-5% to a cold store for longer-term analysis. Output quality monitoring closes the gap between "did the system respond" and "did the system respond well." The standard approach is periodic sampling: take a random 1-5% of production responses and evaluate them with a quality model (a separate LLM call that judges the response against a rubric) or a human review queue. Track quality metrics over time and alert on meaningful degradations. Common quality metrics for RAG systems: faithfulness (does the answer contradict the retrieved documents?), answer relevance (does the answer address the question?), context relevance (did we retrieve the right documents?). These three metrics — sometimes called the RAG triad — give you structured signal on where the pipeline is failing when quality degrades. The most consequential anti-patterns in enterprise GenAI deployments are consistent enough to be worth naming explicitly. Chunking too small (under 150 tokens) loses context that makes individual chunks interpretable — a sentence like "This exception applies in the following cases:" is meaningless without the surrounding paragraph. Chunking too large (over 800 tokens) dilutes relevance because a large chunk contains many topics, making it a poor match for specific queries even when the relevant information is within it. The right chunk size depends on the document type and query type and should be determined empirically on the customer's actual data rather than assumed from defaults. Skipping re-ranking is a common shortcut that meaningfully degrades retrieval quality: vector similarity scores are noisy, and the top result by cosine similarity is often not the most relevant result after accounting for document quality, freshness, and semantic specificity. A cross-encoder re-ranker (a model that jointly encodes the query and each candidate document) adds latency but significantly improves precision at rank 1, which is what matters most for answer quality. Hardcoding model version identifiers in production code is a maintenance timebomb. When a model is deprecated by its provider, systems that specify the deprecated version by exact name break — often silently, with confusing error messages, at a time outside business hours. The right practice is to pin to a model alias (like "claude-sonnet-latest" or a versioned alias your gateway controls) that you can update in one place, with testing, rather than discovering all the places a model version was hardcoded when a deprecation notice arrives. Similarly, the absence of a human escalation path — when the LLM cannot answer, what happens to the user? — is a design gap that manifests as user frustration and support escalations. Every user-facing GenAI interface should have a clear, visible path to a human, and the system should actively offer that path when its confidence is low rather than returning a confusing "I don't know." --- ## Interview Questions ::: {.callout-warning title="Mid Level"} **Q1. What is semantic caching and when does it work well vs. poorly?** ::: {.callout-note collapse="true" title="Model Answer"} Semantic caching stores recent LLM query-response pairs and, instead of calling the LLM for every new query, computes the embedding of the incoming query and searches the cache for semantically similar past queries. When similarity exceeds a threshold, the cached response is returned without an LLM call. This reduces cost and latency for applications with repetitive query patterns. It works well for FAQ-style applications, documentation chatbots, customer support assistants, and any system where a significant fraction of users ask the same or very similar questions. Hit rates of 30-80% are achievable in these contexts. The key tuning parameter is the similarity threshold: typically 0.92-0.95 cosine similarity on normalized embeddings. Higher threshold means fewer false cache hits but lower cache utilization. It works poorly when queries are highly varied and personalized (each user's context is unique, so cache hits are rare), when documents in the index change frequently (cached responses may cite outdated content and need TTLs tied to document freshness), and when the application requires responses tailored to the specific user's history or account state. It also degrades when users game or probe the system in unusual ways that the cache cannot anticipate. Interviewers want to hear: the mechanism (embedding similarity search over cached queries), a concrete hit rate estimate for a suitable use case, the failure modes (stale responses, low hit rates for personalized queries), and the TTL/freshness coupling consideration. ::: **Q2. Explain the tradeoffs between synchronous, streaming, and async API patterns for LLM features.** ::: {.callout-note collapse="true" title="Model Answer"} The three patterns serve different use cases and have distinct tradeoffs that should be matched to the requirements of each LLM feature. Synchronous (request-response) is the simplest pattern: the client sends a request and waits for the complete response. It is appropriate only for very short generations (under 2-3 seconds) or when the calling system is not user-facing and can tolerate waiting. The problem for most LLM use cases is user-perceived latency: waiting 8-15 seconds for a response before seeing any output feels broken, even if the final answer is good. Streaming via SSE resolves the perceived latency problem for user-facing features by sending tokens to the client as they are generated. The user sees output begin within 0.5-1 second, and the progressively appearing text creates a satisfying experience even if total generation time is 10+ seconds. Streaming adds complexity: error handling mid-stream requires a special event type in the stream protocol, and clients must handle partial responses gracefully. Error rates must be monitored per stream, not just per request. Async (job queue) is the right pattern for long-running tasks: document analysis, batch summarization, complex agent runs. The API accepts the request immediately (HTTP 202), returns a job ID, and the client polls a status endpoint or registers a webhook. Async patterns decouple generation latency from client timeout constraints but require more infrastructure (job queue, worker pool, status storage, webhook delivery). For tasks over 30 seconds, async is not optional — it is the only reliable pattern. What interviewers want to hear: clear definitions of all three, specific latency thresholds that guide the choice, error handling differences, and infrastructure requirements. ::: ::: ::: {.callout-important title="Forward Deployed Engineer"} **Q1. A customer wants to integrate your company's LLM API into their existing enterprise software. What are the first five questions you ask about their architecture?** ::: {.callout-note collapse="true" title="Model Answer"} This question tests whether you approach integration technically and methodically, not just at the product level. The interviewer wants to see that you understand what makes enterprise integration hard and that you gather the right information before proposing a design. First question: "What is the request latency budget for the integration?" User-facing features need streaming; batch processing can tolerate 30+ seconds. This single question shapes the entire API integration pattern. Second question: "What authentication and identity system do you use, and does the LLM integration need to enforce per-user permissions?" This surfaces whether you need document-level access control and what identity provider you're integrating with. Third question: "Do you have existing API gateway or middleware infrastructure that LLM calls should flow through?" If they do, cost governance, logging, and rate limiting may already have a home. If they don't, you need to design for it. Fourth question: "What are your compliance and data residency requirements?" HIPAA, GDPR, SOC 2, or industry-specific regulations constrain which providers you can use, where data can be processed, and what logging is required or prohibited (PII). Fifth question: "How do you handle service dependencies today — do you have circuit breakers and fallback patterns?" LLM providers have higher latency and lower SLA guarantees than internal databases. Understanding how the customer's systems handle downstream failures tells you how much resilience work the integration requires. What interviewers want to hear: questions that reveal real constraints rather than surface-level preferences, coverage of latency, auth, compliance, infrastructure, and reliability concerns. ::: **Q2. The customer's RAG index goes stale within hours of document updates. Design a near-real-time ingestion pipeline that gets freshness down to under 5 minutes.** ::: {.callout-note collapse="true" title="Model Answer"} This is a systems design question. Walk through the architecture clearly, justifying each component. The core insight is that batch ingestion cannot achieve 5-minute freshness and the customer needs event-driven architecture. The pipeline: (1) Change Detection Layer — subscribe to change events from the document source. For SharePoint or Confluence, this means registering webhooks that fire on document create/update/delete. For a database, set up CDC (Change Data Capture) using a tool like Debezium. For a file system, use inotify or cloud storage event notifications. (2) Event Queue — route change events to a durable message queue (SQS, Pub/Sub, Kafka). The queue decouples ingestion rate from processing speed and provides a retry mechanism for failed processing. (3) Document Processor — consumers pull events from the queue, fetch the full document from the source, run the ingestion pipeline (text extraction, chunking, embedding generation), and write the updated chunks to the vector index. Use the document's unique identifier as an idempotent key so reprocessing a document produces the same result. (4) Index Write — update the vector index: delete stale chunks (by document ID), insert new chunks. (5) Deletion Handling — when a delete event arrives, remove all chunks with matching document ID from the index. Latency breakdown for this pipeline: webhook delivery ~1-10 seconds, queue processing ~1-5 seconds, embedding generation ~10-30 seconds, index write ~1-5 seconds. Total: approximately 15-50 seconds end-to-end — well within the 5-minute target. What interviewers want to hear: event-driven rather than polling, idempotent processing, explicit deletion handling, and a realistic latency estimate. ::: **Q3. How would you implement document-level access control in a RAG system where different employees have different document permissions? Walk through two approaches and compare them.** ::: {.callout-note collapse="true" title="Model Answer"} Document-level access control in RAG is a common enterprise requirement and a frequent interview topic for FDE roles because it sits at the intersection of security, systems design, and retrieval quality. Approach 1 — Pre-filter retrieval: At indexing time, store each chunk with a metadata field listing the permission groups that can access it (e.g., `allowed_groups: ["finance", "executives"]`). At query time, inject a metadata filter into the vector search that restricts results to documents matching the querying user's group memberships. This is executed inside the vector database, so only authorized documents are scored and returned. Advantages: no unauthorized content ever enters the LLM context (defense in depth), correct recall is maintained for the user's authorized set, computationally efficient for coarse-grained permissions. Disadvantages: requires the vector database to support metadata filtering alongside ANN search (most production databases do, but it's a capability to verify), and permission updates (a document's access list changes) require reindexing affected chunks. Approach 2 — Post-filter retrieval: Retrieve the top-k most relevant chunks regardless of permissions, then filter out chunks the user cannot access before constructing the LLM context. Simpler to implement — no changes to the retrieval query. Disadvantages: unauthorized content is fetched (wasted compute, potential log exposure), and if many top-k results are unauthorized, the LLM context may be impoverished, degrading answer quality. For users with narrow permissions in a mixed knowledge base, this degrades noticeably. Recommendation: Pre-filter for production enterprise deployments with genuine access control requirements. Post-filter as a quick prototype only. For extremely fine-grained permissions (per-user document access), consider hybrid: coarse pre-filter by department, post-filter for exact document-level permissions. What interviewers want to hear: both approaches clearly defined, honest tradeoffs, and a defensible recommendation. ::: **Q4. Design a cost governance system for a company deploying LLM features across 10 internal teams with a shared $50k/month budget.** ::: {.callout-note collapse="true" title="Model Answer"} Cost governance is a systems design problem with three layers: allocation, enforcement, and visibility. Walk through each. Allocation: Divide the $50k/month budget across 10 teams. The allocation process should be governance-driven (finance + leadership approve team allocations), not purely technical. For the design, assume each team gets a monthly token budget (translating dollars to tokens at your provider's rates). Teams with high-volume use cases (customer support, internal assistants with many users) get larger allocations. Track budgets in a relational database: table with columns team_id, budget_tokens_allocated, budget_tokens_used, budget_period, overage_policy. Enforcement: All LLM calls flow through a central API gateway. Before forwarding to the LLM provider, the gateway: looks up the requesting application's team_id (from the API key registry), checks the team's remaining budget for the current period, either proceeds (budget available) or returns HTTP 429 with a budget_exhausted error code (budget depleted). The gateway records token consumption atomically after each call, using the provider's actual token count from the response. Warning alerts fire at 70% utilization; hard limits fire at 100%. Emergency budget override process: team lead requests via ticketing system, approved by finance within 4 business hours. Visibility: A cost dashboard shows each team's real-time spend, projected month-end spend, and per-application breakdown. This is the single most impactful governance tool — most overspend is accidental, and teams that can see their consumption self-correct. Weekly automated email to each team lead with their spend summary. Model routing as a multiplier: implement a query classifier that routes simple queries to cheaper models, multiplying effective budget. For 10 teams with mixed query complexity, this commonly achieves 40-60% cost reduction. What interviewers want to hear: all three layers (allocation, enforcement, visibility), specific implementation components, and at least one optimization mechanism. ::: **Q5. A customer's LLM feature has 99% uptime but customers keep complaining about "it not working." What does your observability gap look like and how do you close it?** ::: {.callout-note collapse="true" title="Model Answer"} This question distinguishes candidates who understand LLM-specific quality from those who think about reliability only in traditional terms. 99% uptime means the service is responding — it says nothing about whether the responses are useful, accurate, or coherent. The observability gap has three layers. First, no output quality monitoring. The system is logging request count and latency (uptime metrics) but not evaluating whether responses are correct or helpful. Users who get wrong answers don't generate 500 errors; they just report "it's not working." To close this: implement response quality sampling — take 2% of production responses and run them through a quality evaluator (either an LLM-as-judge or a domain-specific quality rubric). Track quality scores over time; alert on degradation. Second, no faithfulness monitoring for RAG. The system may be retrieving the wrong documents (low context relevance) or generating answers that contradict the retrieved documents (low faithfulness). Either failure looks like "wrong answers" to users. Instrument the retrieval pipeline to log retrieval scores and track faithfulness as a separate metric. A faithfulness drop often precedes a wave of user complaints. Third, no user feedback loop. Users experiencing failures have no mechanism to report them at the point of failure (no thumbs up/down, no escalation button). Without structured user feedback, you only hear about quality issues through support tickets — too slow and too aggregated to act on. Add inline feedback at the response level and route low-rated responses to a review queue. Closing the gap: quality sampling + faithfulness monitoring + user feedback collection, with alerting and weekly review cadence. What interviewers want to hear: recognition that uptime is not quality, specific quality metrics (faithfulness, relevance), concrete monitoring mechanisms, and a user feedback loop. ::: ::: ## Further Reading - [Weaviate documentation on metadata filtering](https://weaviate.io/developers/weaviate/search/filters) — production reference for pre-filter access control implementation - [Change Data Capture with Debezium](https://debezium.io/documentation/reference/stable/) — standard CDC approach for event-driven ingestion - [LangSmith documentation](https://docs.smith.langchain.com/) — LLM observability and tracing platform - [Arize AI documentation](https://docs.arize.com/) — production LLM monitoring and evaluation - Chip Huyen, *Designing Machine Learning Systems* — Chapter 9 on data pipelines is directly applicable to RAG ingestion design