Why RAG beats pure LLM for understanding codebases

"Why don't you just paste the whole codebase into Claude's context?" We get this question a lot. Here's why that doesn't work — and what we do instead.

Context is not free

Claude's context window is impressive — hundreds of thousands of tokens. But real codebases dwarf that. A medium Django monolith is a million tokens. A Kubernetes fork is ten million. Even when everything fits, paying for the full context on every query is wasteful — you don't need to load the entire auth subsystem to answer "what does our README say about deployment?"

And even when it fits and cost doesn't matter, quality degrades. Models attend unevenly across long contexts — the "lost in the middle" problem. Relevant code buried in file 237 gets ignored while irrelevant boilerplate in file 1 gets weighted heavily. You end up with confidently-wrong answers about code the model technically "saw" but didn't really process.

RAG: retrieve the relevant slice

Retrieval-augmented generation (RAG) solves this by doing a search step first. When you ask a question, we don't send the whole repo to Claude — we send the 8-12 chunks most semantically similar to your question. Claude gets a focused, high-signal context window.

The pipeline:

Chunk every file into overlapping 60-line windows. Overlap matters — a symbol referenced across a chunk boundary would otherwise be cut in half.
Embed each chunk into a 384-dim vector using sentence-transformers (we use all-MiniLM-L6-v2, which runs locally with no API cost).
Store vectors in ChromaDB, keyed by repo-id so queries scope cleanly.
Query: embed the user's question, cosine-similarity against the chunks, take the top-k.
Generate: send those chunks to Claude with an instruction to cite the file path and line range for every claim.

What makes code RAG different from document RAG

Most RAG systems you've read about are tuned for prose — technical manuals, knowledge bases, wikis. Code is different in three ways that forced us to rethink the default patterns.

1. Structure matters more than prose

In a knowledge base, "chunk by paragraph" works fine. In code, paragraph boundaries are meaningless — chunking mid-function destroys the thing you're trying to retrieve. We chunk by line ranges with generous overlap so a function landing at the boundary still gets captured in full by the next chunk.

2. File paths carry meaning

A chunk from auth/oauth.py is semantically different from a chunk from tests/auth/test_oauth.py, even if the code looks identical. We encode file paths as metadata alongside the vector and surface them in answers — users can click the citation to jump to the exact file and line.

3. Semantic similarity ≠ lexical similarity

A user asks "where do we handle login?". The code doesn't say "login" — it says authenticate, verify_credentials, jwt_issue. Pure keyword search misses this entirely. Embeddings bridge the gap: they map "login" and "authenticate" to nearby points in vector space, so the right chunks surface even without lexical matches.

The tradeoff: retrieval quality caps answer quality

We mitigate this by (a) generous top-k (8-12 chunks, not 3), (b) chunk overlap so boundary cases still get covered, and (c) including filename-level hints in the retrieval step so questions like "what does the Dockerfile do?" route correctly even when the file itself is small and wouldn't otherwise score high.

Why not just full-context with a bigger model?

Context windows keep growing. Won't RAG be obsolete soon?

Maybe eventually. But today, even with a 1M-token model, you'd still want retrieval for cost and latency reasons. Sending 800K tokens of code on every query to answer a simple question costs more and takes longer than retrieving 5K relevant tokens. The right architecture for most production use cases is a hybrid: retrieval as the default, full-context escape hatch for the rare cases that need it.

RepoInsight is free for public repositories. Try it on a codebase you already know and see if the retrieval surfaces what you expect.