RAG: What It Is and How to Actually Use It

A Practical Guide for Agentic Programmers

Your LLM knows a lot. But it doesn't know your data. It doesn't know what's in your company's internal wiki, what your latest policy documents say, or what a customer reported in a support ticket yesterday. It can't, because that information wasn't in its training data — and even if it was, the training data has a cutoff date.

Retrieval-Augmented Generation (RAG) solves this by giving the model an open book before it writes its answer. Instead of relying solely on what the model "remembers" from training, RAG retrieves relevant documents from your own data sources and injects them into the prompt. The model then generates a response grounded in that evidence.

The idea is simple. The execution is where teams stumble.

How RAG Works: The Basic Pipeline

A standard RAG system has two phases: an offline phase (prepare the knowledge base) and an online phase (answer queries).

Offline: Build the Knowledge Base

Ingest your source documents — PDFs, internal docs, database records, web pages, whatever you've got.
Chunk them into smaller pieces. A 50-page document is too large to fit in a prompt alongside the user's question. You need to split it into passages — typically 200 to 1,000 tokens each.
Embed each chunk. An embedding model (like OpenAI's text-embedding-3-small or an open-source alternative like bge-large) converts each chunk into a dense vector — a numerical representation of its meaning.
Store the vectors in a vector database — Pinecone, Weaviate, Qdrant, pgvector, Chroma, or even a simple FAISS index for prototyping.

Online: Answer a Query

The user asks a question.
Embed the question using the same embedding model.
Retrieve the top-k most similar chunks from the vector database using similarity search (typically cosine similarity or dot product).
Augment the prompt: concatenate the retrieved chunks with the user's question and any system instructions.
Generate: send the augmented prompt to the LLM, which produces an answer grounded in the retrieved context.

That's the canonical pipeline. In a diagram:

User query → Embed → Vector search → Top-k chunks → [System prompt + Chunks + Query] → LLM → Answer

If you've built a prototype RAG system, you've probably implemented something close to this in an afternoon. The problem is that this basic pipeline has several failure modes that only surface when you hit real data at real scale.

Where the Basic Pipeline Breaks

1. Chunking Is Harder Than It Looks

Naive chunking — splitting every N tokens with some overlap — fragments context. A paragraph explaining a concept gets cut in half, and neither half makes sense on its own. A table's header ends up in one chunk and its rows in another.

Better approaches exist. Split at semantic boundaries using document structure (headings, paragraphs, sections). Use Tree-sitter for code (as Cursor does). For mixed-content documents, consider specialized parsers that preserve tables, lists, and hierarchies. The chunk size itself is a tradeoff: smaller chunks are more precise in retrieval but lose context; larger chunks preserve context but may dilute relevance. Most production systems settle on 500–1,000 tokens with 10–20% overlap, but the right answer depends on your data.

2. Semantic Search Alone Isn't Enough

Vector similarity finds passages that are semantically related to the query. But sometimes you need an exact keyword match — a product ID, a policy number, a person's name. Embeddings aren't great at exact matches.

The fix: hybrid search. Combine dense vector search with lexical search (BM25 or similar). Many vector databases now support this natively. In practice, teams that combine keyword and vector search consistently see significant relevance gains over either method alone, especially for short or ambiguous queries.

3. Retrieved Chunks May Not Be Relevant

Retrieving the top-5 most similar chunks doesn't mean you got the top-5 most useful chunks. Similarity and relevance are not the same thing. A passage that uses the same vocabulary as the query might be about a completely different topic.

The fix: reranking. After retrieval, pass the candidate chunks through a cross-encoder reranker (like Cohere Rerank or a fine-tuned BERT model) that scores each chunk against the query more carefully. This second pass is slower but dramatically more accurate than raw vector similarity. Think of it as: the vector search casts a wide net, the reranker selects the fish.

4. The Query Itself Might Be Bad

Users don't always ask well-formed questions. "What's the thing about the policy change?" doesn't embed well. The retriever can only be as good as the query it receives.

The fix: query processing. Before retrieval, rewrite the query to be more specific, expand it with related terms, or generate hypothetical answers (HyDE — Hypothetical Document Embeddings) and use those to retrieve. Treat every user query as raw material that needs normalization before it touches the knowledge base.

Advanced Patterns Worth Knowing

Self-RAG

Standard RAG retrieves on every query, whether retrieval is needed or not. Self-RAG trains the model to decide when to retrieve and to critique its own outputs for groundedness. The model generates "reflection tokens" — markers that indicate whether retrieval is needed, whether the retrieved data supports the claim, and whether the response is actually useful. This reduces unnecessary retrieval (saves latency and cost) and catches hallucinations that standard RAG misses.

GraphRAG

When your data has structured relationships — organizational hierarchies, regulatory dependencies, product catalogs — flat vector search loses relational context. GraphRAG combines vector search with knowledge graphs to capture and query relationships between entities. Microsoft's GraphRAG implementation, for example, uses community detection algorithms to create hierarchical summaries of the graph, enabling both local (specific entity) and global (theme-level) queries.

Agentic RAG

In standard RAG, retrieval happens once. In agentic RAG, the LLM itself decides when and how to retrieve — potentially making multiple retrieval calls, refining its query between calls, or combining results from different sources. This is where RAG meets agent architecture: the retriever becomes a tool the agent can call, not a fixed pipeline step. WHOOP's inline tools and Cursor's context engine are both variations of this pattern — retrieval integrated into the agent's decision loop rather than bolted on before generation.

Building RAG That Works in Production

Start With Evaluation

The single most common mistake in RAG development: building the pipeline before building the evaluation framework. You need to measure three things separately:

Retrieval quality: Are you getting the right chunks? (Precision and recall against a labeled set)
Groundedness: Is the model actually using the retrieved data, or hallucinating alongside it?
Answer quality: Is the final answer useful and accurate?

If you only measure answer quality, you can't diagnose whether a bad answer came from bad retrieval, bad generation, or both. Separate your metrics. Many teams now use LLM-as-a-judge for evaluation — having a model score responses against criteria — which scales better than human evaluation for iterative development.

Chunk Size Is Your First Hyperparameter

Don't default to a fixed chunk size. Experiment. Run your eval suite across 256, 512, and 1,024 token chunks. Measure retrieval recall and answer quality at each. The optimal size varies dramatically by domain — legal documents chunk differently than code documentation, which chunks differently than customer support transcripts.

Metadata Is Retrieval Context

Don't throw away document metadata during ingestion. Source, date, author, document type, section title — all of this can be used to filter retrieval results before or after vector search. A question about "current pricing" should preferentially retrieve recent documents. Metadata filters are often more effective than embedding improvements.

Respect Access Controls

If your data has different visibility levels — and enterprise data almost always does — your retrieval layer must enforce access controls. A user asking a question should only retrieve documents they're authorized to see. This isn't a nice-to-have; it's a security requirement that needs to be built into the retrieval pipeline from day one.

Watch for Staleness

RAG is only as current as your index. If your source documents change and your embeddings don't update, you're serving stale context. Build a sync pipeline that re-indexes changed documents on a schedule that matches your domain's rate of change — hourly for support tickets, daily for documentation, immediately for regulatory updates.

When to Use RAG and When Not To

RAG is the right choice when:

Your data changes frequently and retraining is impractical
You need answers grounded in specific, citable sources
You want to keep the base model general-purpose and plug in domain knowledge externally
Privacy or compliance requires that proprietary data never enters model training

RAG is the wrong choice when:

The knowledge is already well-represented in the model's training data (don't RAG for "what is photosynthesis")
You need the model to deeply internalize domain reasoning patterns, not just reference facts (consider fine-tuning)
Your data is small enough to fit entirely in the context window (just put it in the prompt)
Latency requirements are so tight that the retrieval round-trip is unacceptable

Most real-world systems end up using RAG alongside fine-tuning and prompt engineering — not instead of them. RAG handles dynamic knowledge. Fine-tuning handles reasoning patterns and style. Prompt engineering handles task framing and output formatting. They're complementary, not competing.

A Minimal RAG Implementation

For teams that want to start today, here's a skeleton in Python using LangChain (though the pattern is framework-agnostic):

# 1. Load and chunk documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
chunks = splitter.split_documents(docs)

# 2. Embed and store
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# 3. Retrieve and generate
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-4o")
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
answer = qa_chain.invoke("What is our refund policy for enterprise customers?")

This gets you a working prototype in under 50 lines. From there, the work is in evaluation, chunking strategy, hybrid search, reranking, and access controls — the production concerns described above.

The Takeaway

RAG is not "add retrieval and hallucinations disappear." It's a systems architecture that requires thoughtful chunking, hybrid search, reranking, query processing, evaluation, and access controls to work reliably. The basic pipeline is a starting point, not a destination.

But when built well, RAG gives your agents something they can't get any other way: access to your data, in real time, with citations, without retraining the model. That's the foundation on which most useful agentic systems are built.