RAG Chatbots Explained: When They Work and When They Don't
An honest primer on Retrieval-Augmented Generation: the four patterns where RAG genuinely earns its keep, the four where it fails, and a realistic look at engineering effort and cost.
RAG — Retrieval-Augmented Generation — is the default answer every vendor gives when you say the words "AI on our own data". The pitch sounds magical: take your documents, plug in a chatbot, and watch employees ask questions in natural language. The reality is more nuanced. RAG is a useful pattern for a specific class of problems, and a wasteful detour for everything else. This article is for technical decision-makers who want to know which is which before signing off on a budget.
We write this as a software vendor that builds RAG systems in production for mid-market clients. We've shipped RAG chatbots that genuinely changed how teams work. We've also walked away from RAG projects where a 200-line script or a better search box would have done the job at a tenth of the cost.
What RAG actually is (and isn't)
RAG is a two-step pattern: retrieve relevant content from a corpus, then ask an LLM to generate an answer grounded in that content. The retrieval part is usually a vector search — your documents are split into chunks, each chunk gets an embedding, and at query time the system pulls the closest matches. The generation part is a Claude or GPT API call with the chunks injected as context.
What RAG is not: a new kind of AI, a substitute for a search engine, a magic trick that lets an LLM "learn" your data. The model doesn't memorise your documents. Every question is answered fresh, with whatever the retriever happens to surface in that moment. If retrieval is bad, the answer is bad — no amount of clever prompting fixes a corpus that wasn't ready.
The four patterns where RAG works
In our work we see RAG genuinely earn its keep in four scenarios. They share three traits: the corpus is large enough that humans can't scan it, the questions are open-ended, and answers must cite sources.
1. Internal knowledge base
Confluence, SharePoint, internal wikis, policy PDFs — places where the answer exists but nobody can find it. A RAG chatbot turns hours of searching into a single question. The win is rarely "new knowledge"; it's faster access to knowledge that's already there.
2. Customer support over product docs
Tier-1 customer questions where the answer lives in your help centre. RAG handles the long tail of "how do I do X" questions, escalates the rest to humans. Works particularly well when docs are kept fresh — fails when they aren't.
3. Sales enablement
Reps need answers about pricing, competitors, integrations, edge cases — fast, during a call. A RAG bot over case studies, battle cards and product specs is genuinely useful here. The corpus is bounded, the questions are predictable, the consequence of a wrong answer is bounded.
4. Compliance & policy Q&A
Regulated industries with thick policy documents. Employees need to know the rule that applies to a specific situation. RAG with strict citation requirements works — the model retrieves the relevant clause and quotes it. Critical here: the system must say "I don't know" when retrieval misses, never invent.
The four patterns where RAG fails
Equally important: when RAG is the wrong answer. Four cases we see repeatedly.
1. The corpus is small and clean enough that you don't need RAG
If your knowledge fits in 50–100 pages of well-structured text, you don't need a vector store. Just put the whole thing in the LLM context window. Modern models handle 200K+ tokens. RAG adds infrastructure complexity without value below a certain corpus size.
2. The question needs an action, not an answer
"Cancel my subscription", "raise a ticket with priority high", "update the contract end date in the CRM" — these aren't retrieval problems. They're agent problems. A RAG bot will helpfully explain how to cancel a subscription instead of cancelling it. If the desired output is an action in a system, you need an agent with tools, not a chatbot.
3. Latency matters more than depth
RAG adds at least one network round-trip and one LLM call to every query — typically 1.5–4 seconds end to end. For autocomplete, real-time UI hints, or anything inside a typing flow, that's too slow. Use a smaller search index or classical retrieval and skip the generation step.
4. The sources can't be trusted
If your knowledge base is a graveyard of stale, contradictory, or poorly-written documents, RAG will faithfully hallucinate authoritative-sounding answers from garbage. Garbage in, confident garbage out. Fix the corpus first; only then add a chatbot on top.
The engineering reality
The demo of RAG takes an afternoon. The production version takes weeks. The non-obvious work:
- Chunking: how you split documents matters more than which embedding model you pick. Bad chunks (mid-sentence cuts, lost headings, no overlap) destroy retrieval quality. Good chunking respects document structure.
- Embedding choice: the default OpenAI or Cohere embedding is fine for English. For Dutch, multilingual models (e.g. multilingual-e5) typically perform meaningfully better — test before committing.
- Retrieval evaluation: build a set of 50–200 real questions with expected source documents. Measure recall@k. Without this, you're tuning blind.
- Reranking: a cross-encoder reranker on top of vector search consistently improves answer quality. It's an extra 200–500ms per query, usually worth it.
- The "I don't know" guardrail: instruct the model to refuse when retrieved chunks don't contain the answer. Test this aggressively — it's the difference between trustworthy and dangerous.
- Source citation: surface which chunks were used, ideally with page numbers and a link back to the original document. Builds trust and lets users verify.
RAG vs fine-tuning vs general LLM vs agent
These four options get conflated in vendor pitches. They solve different problems.
| General LLM | RAG | Fine-tuning | Agent | |
|---|---|---|---|---|
| Use case | Generic Q&A, writing, code | Q&A grounded in your documents | Style/format adaptation | Multi-step tasks with actions |
| Updates with new info? | No (until next model release) | Yes — just add documents | No — requires retraining | Yes — uses RAG + tools |
| Source citations? | No | Yes | No | Yes (when it uses RAG) |
| Takes actions? | No | No | No | Yes |
| Build cost | €0 — just API calls | €20K–€80K | €50K–€300K+ | €40K–€250K |
| When to choose | Default. Try this first. | You have a corpus and need grounded answers | Rare — voice/format is the actual problem | You need actions, not answers |
Our default advice: start with a general LLM call. If that's not enough, add RAG. Only add fine-tuning if a specific style or format requirement makes prompting impractical — which is rare. If the user wants something done rather than answered, you're building an agent, not a chatbot.
Cost and timeline reality check
Realistic ranges for a production RAG chatbot at a mid-market company:
- Light internal RAG bot, single source, basic UI: €20K–€35K, 4–5 weeks.
- Multi-source RAG with reranking, evaluation set, monitoring: €40K–€60K, 6–8 weeks.
- Customer-facing RAG with strict guardrails, SSO, audit logging: €60K–€80K, 8–10 weeks.
- Plus 15–25% per year for maintenance — corpus changes, models change, prompts drift.
Anything quoted dramatically below that range is either skipping evaluation work, reusing a closed-source platform you'll be locked into, or under-scoping. Anything dramatically above is usually scope creep — you're paying for a data platform you didn't ask for.
How we build RAG systems
We build RAG chatbots and AI agents for mid-market companies. We start with a 1–2 week discovery: assess the corpus, build an evaluation set, prototype the retrieval pipeline, and show real numbers before quoting a build. More on our approach is on our service page for RAG chatbots.
Have a corpus and a question in mind? Describe what you'd want users to ask it via our contact form — we'll respond within one working day with an honest read on whether RAG is the right shape, and roughly what it would cost.