Chunking, Retrieval, and Reranking (A Practical End‑to‑End Guide)
Retrieval‑Augmented Generation (RAG) is often discussed as a modeling problem. In practice, most RAG failures have little to do with the language model.
They fail because:
- the wrong information is retrieved
- the right information is split incorrectly
- or relevant context is retrieved but ranked poorly
This guide walks through the three layers that actually determine RAG quality:
- Chunking — how information is segmented
- Retrieval — how candidates are found
- Reranking — how the best context is selected
Each layer builds on the previous one. Optimizing them out of order leads to fragile systems.
The RAG pipeline (conceptual overview)
Documents
↓
Chunking
↓
Indexes (Vector + Lexical)
↓
Retrieval
↓
Rank Fusion
↓
Reranking
↓
LLM
Most systems over‑optimize the bottom and under‑engineer the top.
Part 1 — Chunking: Making Information Retrievable
What chunking actually is
Chunking is the process of dividing documents into retrievable units (“chunks”) that can be indexed and searched.
Chunking is not:
- a way to satisfy context windows
- a preprocessing detail
- something embeddings will fix later
Chunking determines what information can be retrieved at all.
If information is split incorrectly, it effectively does not exist.
The core rule of chunking
A chunk should answer one coherent question well.
If a chunk cannot stand on its own for a human reader, it is unlikely to work for retrieval.
Token count is a constraint — not the objective.
Why naive chunking fails
Common mistakes:
- splitting by fixed token counts
- splitting mid‑sentence or mid‑rule
- over‑lapping aggressively “just in case”
- flattening structure into plain text
These mistakes cause:
- partial answers
- missing qualifiers
- hallucinations blamed on models
Chunking by structure, not text
Before chunking, documents should be treated as structured blocks:
- titles
- sections
- paragraphs
- lists
- tables
- code blocks
Chunking should assemble blocks into decision units, not slice raw text.
Conceptual flow
Raw Document
↓
Structured Blocks
↓
Chunk Assembly
A sane default chunking strategy
This works for most real‑world systems:
- Preserve document order and hierarchy
- Merge adjacent blocks until a full idea is captured
- Target ~200–600 tokens (flexible)
- Avoid splitting rules from their exceptions
- Prepend minimal context:
- document title
- section path
This produces chunks that are:
- meaningful
- retrievable
- debuggable
Chunk expansion (critical idea)
You are not locked into chunk size.
A powerful pattern is retrieval‑time expansion:
- Retrieve small, precise chunks
- Expand to adjacent chunks or parent sections
- Merge before generation
Retrieved chunk
↑ ↓
Neighbors / Parent context
This improves context without bloating the index.
Part 2 — Retrieval: Finding the Right Candidates
Chunking defines what can be retrieved. Retrieval defines which chunks are considered.
Retrieval is about recall, not final correctness.
Retrieval methods (what they actually do)
Lexical retrieval (BM25 / FTS)
- Matches exact terms
- Excellent for:
- identifiers
- names
- keywords
- Weak at paraphrases
Lexical retrieval answers:
“Does this text contain these words?”
Vector retrieval (embeddings)
- Matches semantic similarity
- Excellent for:
- paraphrases
- vague queries
- Weak at:
- rare tokens
- numbers
- precise constraints
Vector retrieval answers:
“Does this text mean something similar?”
Why neither is sufficient alone
- Lexical search misses meaning
- Vector search overgeneralizes meaning
Using either alone creates systematic blind spots.
Hybrid retrieval (the default)
Most reliable systems use both:
Query
├─ Lexical retrieval (BM25)
├─ Vector retrieval (embeddings)
└─ Candidate union
This maximizes recall.
Rank fusion: merging retrieval signals
Lexical and vector scores are not comparable.
Instead of score blending, use rank‑based fusion.
Reciprocal Rank Fusion (RRF)
Intuition:
- Documents that appear near the top in multiple lists are more reliable
Simplified formula:
score(doc) = Σ 1 / (k + rank)
RRF is:
- simple
- robust
- parameter‑light
It is an excellent default.
Retrieval goal (important)
Retrieval is not about picking the best chunk.
Retrieval is about:
not missing the right chunk
Precision comes later.
Part 3 — Reranking: Selecting the Best Context
After retrieval, you typically have:
- 20–100 candidate chunks
This is too many for an LLM — and many are only weakly relevant.
Reranking is the step that introduces understanding.
What rerankers do differently
Unlike retrieval:
- rerankers see the query and chunk together
- they model cross‑attention between them
This allows understanding of:
- constraints
- negation
- specificity
- intent
Rerankers answer:
“Does this chunk actually answer the query?”
Why reranking matters
Without reranking:
- semantically “close” but wrong chunks rise
- confident hallucinations occur
- irrelevant context pollutes prompts
Reranking dramatically improves:
- answer accuracy
- faithfulness
- citation quality
Typical reranking flow
Top‑K retrieved chunks
↓
Cross‑encoder reranker
↓
Top‑N high‑precision chunks
N is usually small (5–10).
Cost vs quality tradeoff
Rerankers are:
- slower than retrieval
- more expensive per query
That’s why they are used after retrieval, not instead of it.
This layered approach keeps systems scalable.
Putting it all together
End‑to‑end RAG pipeline
Documents
↓
Chunking (decision units)
↓
Indexing
├─ Lexical index
└─ Vector index
↓
Retrieval
├─ BM25
├─ Vector search
└─ Rank fusion (RRF)
↓
Reranking
↓
Chunk expansion (optional)
↓
LLM
Each layer has a single responsibility.
How to evaluate the system (often skipped)
Do not tune models first. Evaluate retrieval first.
Key questions:
- Does the correct chunk appear in top‑K?
- Is the correct section retrieved?
- Does reranking move the right chunk up?
- Can a human answer the question using retrieved context alone?
Metrics to track:
- recall@K
- section hit rate
- answer faithfulness
- citation correctness
If retrieval is wrong, generation cannot be right.
Common anti‑patterns
- Vector‑only retrieval
- Sentence‑level chunking everywhere
- Excessive overlap
- LLM‑only chunking by default
- Blaming hallucinations on the model
These usually mask upstream issues.
The boring but reliable truth
- Chunking determines what can be found
- Retrieval determines what is considered
- Reranking determines what is trusted
Models sit downstream of all three.
Good RAG systems are built from the top down, not the bottom up.
Final takeaway
If you remember only one thing:
RAG quality is a retrieval problem long before it is a generation problem.
Get chunking, retrieval, and reranking right — and the model suddenly looks much smarter.
