Intelligent Knowledge Systems — RAG Architecture

What Separates a RAG Proof of Concept from a Production System

Most enterprise RAG pilots fail in production because the architecture was designed for demos, not for the access controls, scale, and observability that real deployments require.

Talk to an Architect IKS Practice Overview

The Production Gap

78%

of enterprise RAG pilots never reach production deployment

architectural layers required for a production-grade RAG system

more likely to succeed when architecture is designed before the first line of code

40%

of RAG accuracy failures trace back to poor chunking strategy, not model quality

Where RAG Breaks Down

The Three Architecture Mistakes That Kill Enterprise RAG

Teams that build RAG systems without a production architecture blueprint hit the same three walls. The mistakes are predictable -- and entirely avoidable.

Retrieval designed for demos, not for enterprise data

Prototype RAG systems use simple cosine similarity over a small, clean document set. Production environments have hundreds of thousands of documents, inconsistent formatting, multiple languages, and access restrictions that a demo index cannot represent. The retrieval layer breaks the moment real data is loaded.

No access control layer in the retrieval pipeline

The most common enterprise RAG failure is retrieval without permission filters. A user submits a query, the system retrieves the most relevant chunks, and some of those chunks come from documents the user is not authorized to see. The model then synthesizes an answer from confidential content.

No observability means no ability to improve

A RAG system with no logging, no accuracy metrics, and no feedback loop cannot be debugged or improved. When answers degrade -- because the knowledge base changes, the query distribution shifts, or the retrieval model drifts -- there is no signal to detect it.

Reference Architecture

The Six Layers of a Production Enterprise RAG System

Every production RAG deployment ClarityArc architects includes these six layers. Omit any one of them and you have a system that works in the lab and fails in production.

Layer 1
Query

Query Understanding & Intent Classification

Incoming queries are parsed for intent, rewritten for retrieval precision, and routed to the appropriate retrieval path. Ambiguous queries trigger clarification before retrieval begins. This layer prevents irrelevant chunks from entering context.

Layer 2
Retrieval

Hybrid Retrieval -- Dense Vector + Sparse Keyword

Azure AI Search runs dense vector search (semantic similarity) and sparse BM25 keyword search in parallel. Results are fused using Reciprocal Rank Fusion. This hybrid approach outperforms either method alone by 15 to 30% on enterprise document corpora.

Layer 3
Access

Permission Filtering at Query Time

Retrieved chunks are filtered against the requesting user's Entra ID permissions before any content reaches the generation layer. No chunk that the user could not access manually is ever injected into the model context. This is enforced at the retrieval API level, not in the prompt.

Layer 4
Reranking

Cross-Encoder Reranking & Context Assembly

A cross-encoder reranking model scores retrieved chunks against the original query for relevance -- a more expensive but more accurate signal than the initial retrieval score. Top-k chunks are assembled into a context window with source attribution metadata preserved.

Layer 5
Generation

Grounded Generation with Abstention Logic

The language model generates a response strictly from assembled context. System prompts enforce citation requirements and activate abstention when retrieved evidence is below confidence threshold. Output includes source references traceable to exact document chunks.

Layer 6
Observe

Observability, Evaluation & Feedback Loop

Every query, retrieval result, and generated response is logged. Faithfulness, context recall, and answer relevance are measured continuously via Azure Monitor and a RAG evaluation framework. Accuracy regression triggers automated alerts and weekly review reports.

Component Design

Five Architecture Decisions That Determine RAG Performance

The six-layer model defines what must be present. These five decisions define how well each layer performs. Getting them wrong is the difference between a 60% accurate system and a 90% accurate one.

Decision 01

Chunking Strategy

How documents are split determines what the retrieval layer can find. Fixed-size chunking is easy to implement but destroys sentence and paragraph context. Semantic chunking splits at natural boundaries and preserves the reasoning units that make answers accurate.

Great practice: Semantic chunking with 20% overlap and parent-chunk retrieval for context expansion.

Decision 02

Embedding Model Selection

The embedding model converts text into vectors for similarity search. General-purpose models underperform on domain-specific terminology -- a critical issue in energy, legal, and financial environments where precise language matters.

Great practice: Domain-adapted embeddings fine-tuned on a sample of your actual document corpus.

Decision 03

Index Architecture

A single flat index works for prototypes. Production systems require separate indexes per security classification, with metadata fields enabling filtered retrieval by document type, date, business unit, and classification level.

Great practice: Hierarchical index structure with security-tier separation and metadata schema defined before ingestion.

Decision 04

Context Window Management

Injecting too many chunks degrades generation quality -- the model loses focus across a long context. Too few chunks increases the risk of incomplete answers. The optimal top-k value varies by query type and must be tuned against your actual query distribution.

Great practice: Dynamic top-k selection based on query confidence score, with a hard ceiling to prevent context dilution.

Decision 05

Freshness and Re-indexing Pipeline

A RAG system is only as accurate as its index. Documents change, policies are updated, and old content becomes misleading. Without a scheduled re-indexing pipeline and freshness metadata, the system degrades silently over time.

Great practice: Incremental indexing on document change events plus weekly full-index validation with staleness alerts.

What Separates Good from Great

Prototype Architecture vs. Production Architecture

Most vendors deliver a working prototype. ClarityArc designs for production from the first architectural session. The difference shows up in production -- not in the demo.

Dimension	Typical Prototype	ClarityArc Production Architecture
Chunking	Fixed 512-token splits, no overlap strategy	Semantic chunking with overlap, parent-chunk retrieval, per-document-type tuning
Retrieval	Vector similarity only, single index	Hybrid dense + sparse retrieval, RRF fusion, security-tiered indexes
Access Control	Not implemented -- all users see all content	Entra ID permission filters enforced at retrieval API, not in prompt
Reranking	Initial retrieval score used as-is	Cross-encoder reranking on top-k candidates before context assembly
Abstention	Model always generates an answer regardless of evidence	Confidence threshold triggers abstention and user-facing "insufficient evidence" response
Observability	Application logs only, no accuracy measurement	Faithfulness, recall, and relevance scored on every response; weekly accuracy reports
Re-indexing	Manual, ad hoc, no freshness tracking	Incremental event-driven re-indexing with staleness alerts and monthly full validation

Self-Assessment

Is Your RAG Architecture Production-Ready?

Run through this checklist before declaring your RAG system ready for enterprise deployment. If any item is missing, you have a production risk.

Chunking strategy is document-type-aware -- not a single fixed-size split applied uniformly across all content

Retrieval uses hybrid dense + sparse search with result fusion -- not vector similarity alone

Permission filters are enforced at the retrieval API layer and tied to your identity provider

A cross-encoder reranking step scores retrieved chunks before they enter the context window

The generation layer has abstention logic and will decline to answer when evidence is insufficient

Every response is logged with the query, retrieved chunks, and generated output for auditability

Faithfulness and context recall are measured continuously -- not just checked at launch

A re-indexing pipeline runs on a defined schedule with staleness alerts for outdated documents

Common Questions

Enterprise RAG Architecture: What Teams Ask Us

How long does it take to design a production RAG architecture?

A full architecture design engagement runs two to three weeks, covering knowledge source inventory, access control mapping, chunking strategy, index design, and observability requirements. The output is an architecture document and an Azure deployment blueprint -- not a slide deck. See our RAG implementation consulting page for engagement structure.

Can we retrofit a production architecture onto an existing RAG prototype?

In most cases, yes -- but it requires a structured audit first. Chunking strategy and index architecture are difficult to retrofit without re-ingesting the knowledge base. Access controls and reranking can typically be layered on top of an existing retrieval pipeline with less disruption. ClarityArc's grounding and hallucination prevention engagement starts with exactly this kind of retrofit assessment.

Does this architecture require Azure, or does it work with other clouds?

The six-layer model is cloud-agnostic in principle, but ClarityArc's implementation practice is built on Azure -- Azure OpenAI for generation, Azure AI Search for hybrid retrieval, and Azure Monitor for observability. This stack is the best-integrated enterprise RAG platform available for Microsoft 365 organizations. See our Azure OpenAI enterprise consulting page for stack detail.

What is the most common architecture mistake you see in existing deployments?

Missing access controls at the retrieval layer. Organizations implement retrieval, generation, and even observability -- but skip permission filtering because it requires integrating with Entra ID and feels like an IT problem rather than an AI problem. It is both, and skipping it creates regulatory and reputational exposure the moment a user receives content they should not see.

How does RAG architecture relate to our existing SharePoint and knowledge management setup?

SharePoint is typically the primary knowledge source, not a barrier. The RAG architecture sits on top of your existing SharePoint environment -- indexing content via Microsoft Graph, preserving your existing permission structure, and delivering AI-powered retrieval without requiring you to migrate or restructure your content. See our SharePoint AI knowledge retrieval page for integration detail.

Intelligent Knowledge Systems

View the full practice →

Solutions Enterprise RAG Solutions RAG Implementation Consulting Microsoft Copilot RAG Enterprise AI Search AI Knowledge Base SharePoint AI Retrieval Azure OpenAI Consulting

Comparisons & Architecture RAG vs. Fine-Tuning Hallucination Prevention RAG Architecture Guide Vector Database Selection ROI & Business Case Knowledge Governance RAG Security & Compliance Implementation Cost Guide

Guides & Education What Is RAG? Enterprise KM with AI Why AI Hallucinates Knowledge Worker Productivity RAG Use Cases Copilot Knowledge Base Setup AI vs. Traditional Search Onboarding with AI

Industry Applications Financial Services Oil & Gas Operations All IKS Services → Related Services Microsoft AI Enablement Agentic Automation Data Strategy for AI AI Strategy Consulting

Ready to Architect a RAG System Built for Production?

ClarityArc delivers a full architecture design engagement in two to three weeks -- knowledge source inventory, access control mapping, index design, and Azure deployment blueprint included.

Book an Architecture Session IKS Practice Overview