Retrieval-Augmented Generation (RAG) is a technique that connects a large language model to your organization's documents, so it can answer questions using your actual data instead of its training knowledge alone. For enterprises drowning in scattered information, RAG is the difference between an AI that guesses and one that knows.

Why Traditional Enterprise Search Fails

Every company has the same problem. Knowledge lives in email threads, Slack messages, Google Docs, Confluence pages, Jira tickets, PDF reports, and spreadsheets. Traditional search tools — whether it is Google Workspace search, Confluence search, or a SharePoint index — all share the same limitations:

Keyword matching is brittle. Searching for "budget approval" misses the document titled "Q2 Financial Sign-Off." The concepts are the same; the words are not.
Each tool is a silo. Confluence search does not know about your emails. Slack search does not surface Jira tickets. You need to search five tools and mentally combine the results.
No synthesis. Search returns a list of links. You still have to open each document, read it, and extract the answer yourself. For a question like "What was the final decision on the pricing model?", the answer might be spread across three emails and a meeting note.

RAG addresses all three problems. It understands meaning (not just keywords), works across data sources, and synthesizes a direct answer from the retrieved content.

How RAG Works: The Basics

A RAG system has two phases: indexing and retrieval.

Indexing Phase

During indexing, your documents are processed and stored in a way that enables semantic search:

Chunking — Documents are split into smaller pieces (chunks), typically 200-500 tokens each. This is not random splitting; good chunking respects paragraph boundaries, section headers, and logical units of meaning.
Embedding — Each chunk is converted into a dense vector (a list of numbers) using an embedding model. Semantically similar text produces similar vectors, regardless of exact wording.
Storage — The vectors are stored in a vector database (such as PostgreSQL with pgvector, Pinecone, or Weaviate) alongside the original text and metadata (source, date, author, permissions).

Retrieval Phase

When a user asks a question:

The question is embedded using the same model.
The vector database finds the most similar chunks (nearest neighbor search).
The top chunks are passed to the LLM as context.
The LLM generates an answer grounded in the retrieved content.

This is basic RAG. It works reasonably well for simple questions and clean document collections. But enterprises need more.

Hybrid Search: Combining Dense and Sparse Retrieval

Pure vector search has a weakness: it can miss exact matches. If someone asks for "ticket PROJ-4521," a semantic embedding might not prioritize the document containing that exact identifier. Conversely, keyword search (BM25) excels at exact matches but fails at semantic similarity.

Hybrid search combines both:

Dense retrieval (vector search) catches semantic matches — finding documents about "budget approval" when you search "financial sign-off."
Sparse retrieval (BM25) catches exact matches — finding the specific ticket number, email address, or product code.
Cross-encoder reranking takes the combined results and reorders them using a more powerful model that reads the query and each candidate together. This is computationally expensive (you cannot run it over millions of documents), but applied to the top 50-100 candidates from the first stage, it dramatically improves precision.

The pipeline looks like this:

Query → [Dense Retrieval] → Top 50 candidates
      → [BM25 Retrieval]  → Top 50 candidates
      → [Merge + Dedupe]  → ~80 unique candidates
      → [Cross-Encoder Rerank] → Top 10 results
      → [LLM generates answer from Top 10]

In practice, hybrid search with reranking consistently outperforms either method alone by 15-25% on retrieval benchmarks.

Knowledge Graphs: When RAG Is Not Enough

RAG retrieves text chunks. But some questions require understanding relationships that span across documents.

Consider: "Who are the engineers working on features that depend on the billing module?" No single document contains this answer. You need to know which features depend on billing (from architecture docs), and who is assigned to those features (from Jira). This is a multi-hop query that requires traversing relationships.

Knowledge graphs solve this by extracting entities (people, projects, concepts, tools) and their relationships from your documents, then storing them as a graph structure. When the agent receives a question that requires relational reasoning, it can traverse the graph:

billing_module --depends_on--> payment_service
payment_service --assigned_to--> Alice
payment_service --assigned_to--> Bob
billing_module --depends_on--> invoice_generator
invoice_generator --assigned_to--> Carol

The query "Who works on billing dependencies?" becomes a graph traversal that returns Alice, Bob, and Carol — something vector search alone could never produce reliably.

Combining RAG and Knowledge Graphs

The most effective enterprise knowledge systems use both:

RAG for document-level retrieval: "Find me the Q2 financial report" or "What did the CEO say about expansion plans?"
Knowledge graph for relational queries: "Which teams are affected if we change the authentication API?" or "What projects has the marketing team collaborated on with engineering?"

The LLM decides which to use based on the question type, or queries both and merges results.

Memory: Making the Agent Personal

Enterprise search is not one-size-fits-all. When you ask "What is the status of my project?", the agent needs to know which project is yours. When you say "Use the same format as last time," it needs to remember what format you used.

A memory system stores per-user facts extracted from conversations:

Preferences — "I prefer bullet-point summaries" or "Always include ticket numbers"
Context — "I'm the lead on Project Phoenix" or "I report to Sarah"
Decisions — "We decided to use Stripe for payments on March 3rd"

These facts are injected into every interaction as "hot context," making the agent feel like it genuinely knows you rather than starting fresh each time.

Memory facts have different lifespans. A preference persists indefinitely. A project status might refresh weekly. A meeting decision stays relevant for months. Good memory systems attach TTL (time-to-live) metadata to each fact and refresh them from source data.

Implementation Considerations

Building RAG for enterprise use involves challenges that demos and tutorials skip over.

Chunking Strategy Matters

Naive chunking (splitting every 500 tokens) destroys context. A paragraph about "the decision" that references "the proposal discussed above" becomes meaningless when separated from the preceding text. Effective approaches include:

Semantic chunking — splitting at natural boundaries (paragraphs, sections) and including contextual headers with each chunk
Contextual enrichment — prepending a brief summary of the document and section to each chunk, so it makes sense in isolation
Overlapping windows — including some text from adjacent chunks to preserve continuity

Access Control Cannot Be an Afterthought

In an enterprise, not everyone should see every document. The sales team's compensation data should not appear in an engineer's search results. Access controls must be embedded at the chunk level:

Every chunk carries permission metadata (who can access it)
Every query is filtered by the requesting user's permissions
There is no "retrieve first, filter later" — filtering happens during retrieval

This is harder than it sounds when documents come from different sources with different permission models (Google Drive sharing, Jira project roles, Confluence space permissions). A robust ACL system must normalize all of these into a unified permission model.

Freshness and Sync

Enterprise data changes constantly. New emails arrive every minute. Documents get updated. Tickets change status. A RAG system that indexes once and never updates becomes stale quickly.

A sync orchestrator must continuously pull changes from connected sources, detect what has changed (diff), re-chunk and re-embed updated content, and remove deleted content. This is a background process that must run reliably without consuming excessive resources.

How Wardian Implements RAG

Wardian's knowledge engine uses a three-layer approach:

Hybrid RAG with dense retrieval (pgvector), BM25 sparse retrieval, and cross-encoder reranking. Documents are semantically chunked with contextual enrichment.
Knowledge graph built with entity extraction and relationship mapping, enabling multi-hop queries across the entire organizational knowledge base.
Memory system that stores per-user facts with TTL, injected as hot context into every agent interaction.

All three layers are exposed as MCP tools (search_documents, search_knowledge, remember, recall), so the agent seamlessly queries whichever layer is most appropriate for the question. Data stays on the customer's infrastructure, with per-chunk ACL enforcement and continuous sync from connected sources.

The result is an AI that does not just search your documents — it understands your organization.