Large language models are impressive until they confidently tell your users something wrong. The model does not know what is in your internal knowledge base, your product documentation, or last quarter’s policy update. Retrieval-Augmented Generation, commonly called RAG, is the practical fix: before the model answers, you retrieve the relevant documents and put them in the context window. The model reasons over real text instead of guessing from training data.

This guide walks you through every layer of a production-ready RAG pipeline: document preparation, chunking, embedding, vector storage, retrieval, and prompt assembly. Code examples use the official anthropic Python SDK and Claude models. The architecture itself applies to any embedding provider and any LLM.

How RAG Works: The Core Loop

At runtime, a RAG system does three things before calling the LLM:

  1. Embed the user query into a vector using the same embedding model you used to index your documents.
  2. Search the vector store for the chunks most similar to that query vector.
  3. Inject the retrieved chunks into the prompt alongside the question, then call the LLM.

The LLM never sees your entire corpus. It sees a carefully selected slice of it, sized to fit the context window of the model you are using. This is why chunking and retrieval quality matter more than most people expect.

Step 1: Prepare and Chunk Your Documents

Raw documents are too large and too noisy for direct retrieval. You need to split them into chunks that are small enough to be precise but large enough to carry complete thoughts.

Common strategies:

  • Fixed-size with overlap. Split every N tokens, with a sliding window of M tokens overlap so a sentence is never cut across a chunk boundary. A starting point: 512-token chunks with 64-token overlap.
  • Semantic splitting. Split on natural boundaries like paragraphs, headings, or code blocks before falling back to size limits. This preserves logical units.
  • Hierarchical chunking. Store both a full document summary and fine-grained passage chunks. Use the summary for broad questions and passage chunks for precise ones.

Metadata matters. Attach the source filename, document title, section heading, and chunk index to every chunk. You will need this for citations and for filtering at query time.

import anthropic
from typing import List

def chunk_text(text: str, chunk_size: int = 512, overlap: int = 64) -> List[str]:
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

This is a word-level approximation. For production, use a tokenizer that matches your embedding model so your chunk sizes are accurate.

Step 2: Embed and Index

Embeddings turn text into fixed-length vectors that encode semantic meaning. Chunks that discuss similar topics land near each other in vector space, which is what makes semantic search possible.

Choose an embedding model that matches your domain. General-purpose models work well for mixed content. Specialized models exist for code, legal text, and biomedical literature. Whatever you choose, use the same model at index time and query time. Mixing models breaks retrieval.

Store the resulting vectors in a vector database. Options range from local libraries like FAISS and ChromaDB to managed services. The right choice depends on your scale, latency requirements, and infrastructure constraints. For most teams starting out, a lightweight local store is fine until you hit real traffic.

# Pseudocode: indexing loop
for doc in documents:
    chunks = chunk_text(doc.text)
    for i, chunk in enumerate(chunks):
        vector = embedding_model.embed(chunk)
        vector_store.upsert(
            id=f"{doc.id}_chunk_{i}",
            vector=vector,
            metadata={"text": chunk, "source": doc.filename, "chunk_index": i}
        )

Step 3: Retrieve Relevant Chunks at Query Time

When a user submits a question, embed that question and run a nearest-neighbor search against your index. Retrieve the top-K most similar chunks. A value of K between 5 and 10 is a reasonable starting point. More chunks give the model more context but also add noise and cost.

Two retrieval techniques worth knowing beyond basic vector search:

  • Hybrid search. Combine vector similarity with keyword search (BM25). Keyword search handles exact product names and identifiers that embeddings sometimes blur. Most production RAG systems use both.
  • Re-ranking. After retrieving the top-K candidates, run a cross-encoder re-ranker to reorder them by relevance before sending to the LLM. Re-rankers are slower than vector search but dramatically improve precision.

Step 4: Assemble the Prompt and Call the Model

With your retrieved chunks in hand, you build a prompt that gives the model clear instructions and the retrieved context. Be explicit: tell the model to answer only from the provided context, and to say so when the context does not contain an answer. This is the primary guard against hallucination in a RAG system.

import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def build_prompt(question: str, chunks: list[dict]) -> str:
    context_text = "\n\n---\n\n".join(
        f"Source: {c['metadata']['source']}\n{c['metadata']['text']}"
        for c in chunks
    )
    return f"""You are a precise assistant. Answer the question using ONLY the context below.
If the context does not contain enough information, say so clearly.

CONTEXT:
{context_text}

QUESTION: {question}

ANSWER:"""

def rag_query(question: str, vector_store) -> str:
    query_vector = embedding_model.embed(question)
    chunks = vector_store.search(query_vector, top_k=6)

    prompt = build_prompt(question, chunks)

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Model selection depends on your accuracy and latency requirements. claude-haiku-4-5 is fast and economical for high-volume lookups with well-structured retrieval. claude-sonnet-4-6 handles nuanced synthesis across multiple retrieved chunks. For the most demanding multi-document reasoning tasks, claude-opus-4-8 or claude-fable-5 bring the highest capability, and their 1M-token context windows can accommodate large retrieved sets without truncation. Haiku offers a 200K-token context window, which is ample for most RAG use cases.

Step 5: Add Adaptive Thinking for Complex Retrieval Tasks

On Claude 4.6 and newer models, you can enable adaptive thinking to improve reasoning quality on hard questions. Instead of a fixed token budget, adaptive thinking lets the model decide how much internal reasoning the question requires.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    thinking={"type": "adaptive"},
    messages=[{"role": "user", "content": prompt}]
)

Use adaptive thinking when the question requires the model to reconcile conflicting information across multiple retrieved chunks, or when accuracy on a single high-stakes query matters more than latency. Do not enable it for every routine lookup because it adds processing time.

Evaluation: How to Know It Is Working

RAG systems fail in two distinct places. Retrieval failures mean the right chunk was never returned. Generation failures mean the model had the right context but still produced a wrong answer. You need to evaluate both separately.

  • Retrieval evaluation. Build a test set of questions where you know which document contains the answer. Measure recall at K: what fraction of the time is the correct chunk in the top K results?
  • Generation evaluation. Given perfect retrieved context, does the model produce a correct answer? Use LLM-as-judge scoring or human review against a golden answer set.
  • End-to-end evaluation. Run the full pipeline on your test set and measure answer accuracy. This catches compounding failures that neither sub-evaluation surfaces alone.

Tracing each query through its retrieved chunks is essential for debugging. Log the question, the retrieved chunk IDs and scores, and the final answer. When something goes wrong, you can immediately see whether the problem started at retrieval or generation.

Common Pitfalls and How to Avoid Them

  • Chunks that are too small. Sub-sentence fragments lose context. If your retrieved chunks lack enough surrounding text for the model to reason, increase chunk size or add overlap.
  • Stale indexes. When source documents change, re-embed and re-index affected chunks. A RAG system pointed at outdated content is worse than no RAG at all.
  • No metadata filtering. If a user asks about a specific product version or date range, filtering by metadata before vector search dramatically improves precision. Build metadata filters into your retrieval layer early.
  • Forgetting to cite sources. Ask the model to include source references in its answer. This gives users a way to verify claims and builds trust in the system.
  • Skipping the system prompt. Always include clear instructions for how the model should behave when context is insufficient. Without this, models drift toward fabrication rather than honest uncertainty.

Connecting to External Data with MCP

If your documents live in external systems like wikis, databases, or file servers, the Model Context Protocol (MCP) offers a standardized way to connect Claude to those sources at runtime. Rather than pre-indexing everything offline, an MCP server can expose retrieval tools that Claude invokes dynamically. This is a useful complement to a static vector index when your data is large, frequently updated, or spread across many systems.

Takeaway

RAG is not a single feature you toggle on. It is a pipeline with multiple moving parts, each of which can fail quietly. The teams that get the most out of it are the ones who treat chunking, embedding, and retrieval as first-class engineering concerns rather than plumbing to skip over. Get your chunk quality right, measure retrieval recall separately from generation quality, and be explicit in your prompts about what the model should do when context runs out. Done well, RAG turns a hallucination-prone assistant into a reliable, citable knowledge tool.