RAG: What It Is and How to Actually Use It
A Practical Guide for Agentic Programmers
Lien vers la section A Practical Guide for Agentic ProgrammersYour LLM knows a lot. But it doesn't know your data. It doesn't know what's in your company's internal wiki, what your latest policy documents say, or what a customer reported in a support ticket yesterday. It can't, because that information wasn't in its training data — and even if it was, the training data has a cutoff date.
Retrieval-Augmented Generation (RAG) solves this by giving the model an open book before it writes its answer. Instead of relying solely on what the model "remembers" from training, RAG retrieves relevant documents from your own data sources and injects them into the prompt. The model then generates a response grounded in that evidence.
The idea is simple. The execution is where teams stumble.
How RAG Works: The Basic Pipeline
Lien vers la section How RAG Works: The Basic PipelineA standard RAG system has two phases: an offline phase (prepare the knowledge base) and an online phase (answer queries).
Offline: Build the Knowledge Base
Lien vers la section Offline: Build the Knowledge Base- Ingest your source documents — PDFs, internal docs, database records, web pages, whatever you've got.
- Chunk them into smaller pieces. A 50-page document is too large to fit in a prompt alongside the user's question. You need to split it into passages — typically 200 to 1,000 tokens each.
- Embed each chunk. An embedding model (like OpenAI's
text-embedding-3-smallor an open-source alternative likebge-large) converts each chunk into a dense vector — a numerical representation of its meaning. - Store the vectors in a vector database — Pinecone, Weaviate, Qdrant, pgvector, Chroma, or even a simple FAISS index for prototyping.
Online: Answer a Query
Lien vers la section Online: Answer a Query- The user asks a question.
- Embed the question using the same embedding model.
- Retrieve the top-k most similar chunks from the vector database using similarity search (typically cosine similarity or dot product).
- Augment the prompt: concatenate the retrieved chunks with the user's question and any system instructions.
- Generate: send the augmented prompt to the LLM, which produces an answer grounded in the retrieved context.
That's the canonical pipeline. In a diagram:
User query → Embed → Vector search → Top-k chunks → [System prompt + Chunks + Query] → LLM → Answer
If you've built a prototype RAG system, you've probably implemented something close to this in an afternoon. The problem is that this basic pipeline has several failure modes that only surface when you hit real data at real scale.
Where the Basic Pipeline Breaks
Lien vers la section Where the Basic Pipeline Breaks1. Chunking Is Harder Than It Looks
Lien vers la section 1. Chunking Is Harder Than It LooksNaive chunking — splitting every N tokens with some overlap — fragments context. A paragraph explaining a concept gets cut in half, and neither half makes sense on its own. A table's header ends up in one chunk and its rows in another.
Better approaches exist. Split at semantic boundaries using document structure (headings, paragraphs, sections). Use Tree-sitter for code (as Cursor does). For mixed-content documents, consider specialized parsers that preserve tables, lists, and hierarchies. The chunk size itself is a tradeoff: smaller chunks are more precise in retrieval but lose context; larger chunks preserve context but may dilute relevance. Most production systems settle on 500–1,000 tokens with 10–20% overlap, but the right answer depends on your data.
2. Semantic Search Alone Isn't Enough
Lien vers la section 2. Semantic Search Alone Isn't EnoughVector similarity finds passages that are semantically related to the query. But sometimes you need an exact keyword match — a product ID, a policy number, a person's name. Embeddings aren't great at exact matches.
The fix: hybrid search. Combine dense vector search with lexical search (BM25 or similar). Many vector databases now support this natively. In practice, teams that combine keyword and vector search consistently see significant relevance gains over either method alone, especially for short or ambiguous queries.
3. Retrieved Chunks May Not Be Relevant
Lien vers la section 3. Retrieved Chunks May Not Be RelevantRetrieving the top-5 most similar chunks doesn't mean you got the top-5 most useful chunks. Similarity and relevance are not the same thing. A passage that uses the same vocabulary as the query might be about a completely different topic.
The fix: reranking. After retrieval, pass the candidate chunks through a cross-encoder reranker (like Cohere Rerank or a fine-tuned BERT model) that scores each chunk against the query more carefully. This second pass is slower but dramatically more accurate than raw vector similarity. Think of it as: the vector search casts a wide net, the reranker selects the fish.
4. The Query Itself Might Be Bad
Lien vers la section 4. The Query Itself Might Be BadUsers don't always ask well-formed questions. "What's the thing about the policy change?" doesn't embed well. The retriever can only be as good as the query it receives.
The fix: query processing. Before retrieval, rewrite the query to be more specific, expand it with related terms, or generate hypothetical answers (HyDE — Hypothetical Document Embeddings) and use those to retrieve. Treat every user query as raw material that needs normalization before it touches the knowledge base.
Advanced Patterns Worth Knowing
Lien vers la section Advanced Patterns Worth KnowingSelf-RAG
Lien vers la section Self-RAGStandard RAG retrieves on every query, whether retrieval is needed or not. Self-RAG trains the model to decide when to retrieve and to critique its own outputs for groundedness. The model generates "reflection tokens" — markers that indicate whether retrieval is needed, whether the retrieved data supports the claim, and whether the response is actually useful. This reduces unnecessary retrieval (saves latency and cost) and catches hallucinations that standard RAG misses.
GraphRAG
Lien vers la section GraphRAGWhen your data has structured relationships — organizational hierarchies, regulatory dependencies, product catalogs — flat vector search loses relational context. GraphRAG combines vector search with knowledge graphs to capture and query relationships between entities. Microsoft's GraphRAG implementation, for example, uses community detection algorithms to create hierarchical summaries of the graph, enabling both local (specific entity) and global (theme-level) queries.
Agentic RAG
Lien vers la section Agentic RAGIn standard RAG, retrieval happens once. In agentic RAG, the LLM itself decides when and how to retrieve — potentially making multiple retrieval calls, refining its query between calls, or combining results from different sources. This is where RAG meets agent architecture: the retriever becomes a tool the agent can call, not a fixed pipeline step. WHOOP's inline tools and Cursor's context engine are both variations of this pattern — retrieval integrated into the agent's decision loop rather than bolted on before generation.
Building RAG That Works in Production
Lien vers la section Building RAG That Works in ProductionStart With Evaluation
Lien vers la section Start With EvaluationThe single most common mistake in RAG development: building the pipeline before building the evaluation framework. You need to measure three things separately:
- Retrieval quality: Are you getting the right chunks? (Precision and recall against a labeled set)
- Groundedness: Is the model actually using the retrieved data, or hallucinating alongside it?
- Answer quality: Is the final answer useful and accurate?
If you only measure answer quality, you can't diagnose whether a bad answer came from bad retrieval, bad generation, or both. Separate your metrics. Many teams now use LLM-as-a-judge for evaluation — having a model score responses against criteria — which scales better than human evaluation for iterative development.
Chunk Size Is Your First Hyperparameter
Lien vers la section Chunk Size Is Your First HyperparameterDon't default to a fixed chunk size. Experiment. Run your eval suite across 256, 512, and 1,024 token chunks. Measure retrieval recall and answer quality at each. The optimal size varies dramatically by domain — legal documents chunk differently than code documentation, which chunks differently than customer support transcripts.
Metadata Is Retrieval Context
Lien vers la section Metadata Is Retrieval ContextDon't throw away document metadata during ingestion. Source, date, author, document type, section title — all of this can be used to filter retrieval results before or after vector search. A question about "current pricing" should preferentially retrieve recent documents. Metadata filters are often more effective than embedding improvements.
Respect Access Controls
Lien vers la section Respect Access ControlsIf your data has different visibility levels — and enterprise data almost always does — your retrieval layer must enforce access controls. A user asking a question should only retrieve documents they're authorized to see. This isn't a nice-to-have; it's a security requirement that needs to be built into the retrieval pipeline from day one.
Watch for Staleness
Lien vers la section Watch for StalenessRAG is only as current as your index. If your source documents change and your embeddings don't update, you're serving stale context. Build a sync pipeline that re-indexes changed documents on a schedule that matches your domain's rate of change — hourly for support tickets, daily for documentation, immediately for regulatory updates.
When to Use RAG and When Not To
Lien vers la section When to Use RAG and When Not ToRAG is the right choice when:
- Your data changes frequently and retraining is impractical
- You need answers grounded in specific, citable sources
- You want to keep the base model general-purpose and plug in domain knowledge externally
- Privacy or compliance requires that proprietary data never enters model training
RAG is the wrong choice when:
- The knowledge is already well-represented in the model's training data (don't RAG for "what is photosynthesis")
- You need the model to deeply internalize domain reasoning patterns, not just reference facts (consider fine-tuning)
- Your data is small enough to fit entirely in the context window (just put it in the prompt)
- Latency requirements are so tight that the retrieval round-trip is unacceptable
Most real-world systems end up using RAG alongside fine-tuning and prompt engineering — not instead of them. RAG handles dynamic knowledge. Fine-tuning handles reasoning patterns and style. Prompt engineering handles task framing and output formatting. They're complementary, not competing.
A Minimal RAG Implementation
Lien vers la section A Minimal RAG ImplementationFor teams that want to start today, here's a skeleton in Python using LangChain (though the pattern is framework-agnostic):
# 1. Load and chunk documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
chunks = splitter.split_documents(docs)
# 2. Embed and store
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
# 3. Retrieve and generate
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
llm = ChatOpenAI(model="gpt-4o")
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
answer = qa_chain.invoke("What is our refund policy for enterprise customers?")
This gets you a working prototype in under 50 lines. From there, the work is in evaluation, chunking strategy, hybrid search, reranking, and access controls — the production concerns described above.
The Takeaway
Lien vers la section The TakeawayRAG is not "add retrieval and hallucinations disappear." It's a systems architecture that requires thoughtful chunking, hybrid search, reranking, query processing, evaluation, and access controls to work reliably. The basic pipeline is a starting point, not a destination.
But when built well, RAG gives your agents something they can't get any other way: access to your data, in real time, with citations, without retraining the model. That's the foundation on which most useful agentic systems are built.
Further Reading
Lien vers la section Further Reading- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — The original RAG paper by Lewis et al. (2020)
- Enhancing Retrieval-Augmented Generation: A Study of Best Practices — Li et al., COLING 2025, systematic study of RAG configuration factors
- Microsoft RAG Evaluation Framework — Separating retrieval, groundedness, and answer quality metrics
- Priompt — Cursor's open-source priority-based context budgeting library (relevant to managing retrieved chunks in token-limited prompts)
- The Crux of Every AI System: Evaluations — WHOOP's evaluation framework, applicable to RAG quality measurement