RAG Isn’t a Silver Bullet—But This Setup Works Often

Context

Retrieval-Augmented Generation (RAG) is great at grounding answers in your content, but it’s not the answer to every problem. Many teams ship RAG too early, then fight latency, cost, and inconsistent answers. This essay offers a pragmatic setup that works in most cases, plus decision rules for when to skip RAG entirely.

Core Framework

Start with Answerability: Before vectors, measure whether queries can be answered from your corpus. Build a 50–100 item “golden set” and tag each with answerable / unanswerable / needs synthesis.
Chunk with Structure: Prefer semantic sections (policy clause, SOP step, ticket resolution) over fixed token windows. Attach source_id, doc_type, effective_date, and access_scope to every chunk.
Hybrid Retrieval First: Use BM25 (keyword) + embeddings rerank. Keyword recall saves you when acronyms or exact phrases matter; embeddings catch paraphrase and context.
Grounding in the Prompt: Force the model to cite source_ids and to “refuse gracefully” when confidence is low. Keep few-shot examples of correct refusal.
Latency Budgeting: Set a 1.5–2.0s P95 budget. Profile hops (network → vector store → rerank → LLM). Cache aggressively at the retrieval and final answer layers.

Recommended Actions

Define a Retrieval Schema: Standard fields for chunk metadata; normalize across sources.
Canary Queries: Track a small set of stable queries daily; alert when results drift.
Rerank, Don’t Over-Retrieve: Fetch 20–40 candidates, rerank to top 5–8. More context can hurt.
Refusal Patterns: Add a “not in corpus” pathway with helpful fallback (search link or escalation).
Access Controls: Enforce row-level access at retrieval time; never “filter in the prompt.”
A/B Safety Gate: Route low-risk to a fast model; escalate only if answerability and sensitivity require.

When to Skip RAG

Stable Knowledge, Narrow Domain: Fine-tune or instruction-tune a small model; you’ll beat RAG on latency and cost.
Highly Structured Data: Use tools (SQL, APIs) with function calling; generate prose from results.
No Trust in Sources: If content is outdated or contradictory, fix the content pipeline before adding retrieval.

“Works Often” Setup (Reference)

Index: Hybrid BM25 + vector store (768–1024d), cosine similarity, HNSW; nightly rebuild + streaming upserts.
Chunking: Section-aware splitting with titles; overlap only on boundary uncertainty (≤15%).
Rerank: Lightweight cross-encoder on the top 40 → 8.
Prompt: System message requires citations; few-shot “refuse when weak” examples.
Caching: Query-to-doc cache (5–30 min TTL) + answer cache keyed by (query_hash, top_k_ids, policy_version).
Telemetry: Answerability %, citation click-through, refusal rate, P95 latency, token spend, drift alerts.

Common Pitfalls

Over-chunking: Tiny chunks lose context and increase hallucination risk.
Embedding Drift Ignored: New model versions shift similarity geometry; monitor with canaries.
Prompt-only Guardrails: Access controls must be enforced pre-LLM.
Everything is RAG: For calculation or status lookups, use tools—don’t stuff tables into context.

Quick Win Checklist

Ship a 100-item golden set and track answerability weekly.
Implement BM25+embeddings with rerank; cap final context to 5–8 passages.
Add explicit refusal and citation requirements to the prompt.
Set a 2s P95 budget; cache retrieval and answers.

Closing

RAG should serve your product, not the other way around. Start with answerability, keep retrieval hybrid and lean, and enforce citations and refusals. When data is stable or highly structured, skip RAG and call the right tool, or fine-tune. That discipline yields reliable answers at acceptable latency and cost.

Consulting Essay by OneMind Strata Team