8. Corrective RAG (CRAG) - Self-Correcting Retrieval
Implement CRAG: a retrieval pipeline that grades its own results and automatically corrects poor retrievals before generating. Combines LangGraph state machines with a web search fallback.
Corrective RAG
Self-Correcting Retrieval with CRAG
Standard RAG blindly passes every retrieved chunk to the LLM - even irrelevant ones. CRAG adds a grading step: score each chunk, decide whether to keep it, discard it, or rephrase the query entirely before generating.
🩺The Problem with Blind Retrieval
Every RAG pipeline makes a silent assumption: the top-k chunks returned by the vector index are relevant to the question. This assumption breaks more often than you'd expect. Embeddings measure semantic similarity, not factual relevance - a chunk can be topically close to your question while being completely useless for answering it.
The consequences compound: irrelevant chunks dilute the context window, push useful information further from the beginning (where LLMs pay the most attention), and introduce noise that increases hallucination risk. Worse, there is no signal - the LLM receives the garbage context and confidently generates a garbage answer.
⚙️How CRAG Works
CRAG inserts a correction step between retrieval and generation. Rather than asking "what chunks are similar to this question?", it also asks "are these chunks actually useful for answering it?"
The original 2024 paper uses a trained classifier for grading. Our implementation uses a zero-shot LLM prompt - no training data, no fine-tuning, fully local via Ollama. The trade-off is slightly lower precision but zero infrastructure cost.
What changes vs standard RAG
- Each chunk is graded before reaching the LLM
- Low-quality chunks are discarded from the context
- If nothing passes, the query is automatically rephrased
- The generator only sees evidence that passed the filter
What stays the same
- Vector search for initial retrieval (ChromaDB)
- Same FastAPI structure from Article 3
- Same Ollama generation model
- Same tenant isolation pattern
🎯The Three Grades
Every retrieved chunk is assigned one of three grades by the LLM scorer:
CORRECT
The chunk directly contains information that helps answer the question. Confidence ≥ 0.7. These chunks are passed to the generator unchanged.
AMBIGUOUS
The chunk is topically related but does not directly answer the question. Confidence 0.3-0.7. Kept when there are no CORRECT chunks, discarded when there are.
INCORRECT
The chunk is off-topic, irrelevant, or about a different entity entirely. Confidence < 0.3. Always discarded - they do more harm than good in context.
The strategy applied depends on the distribution of grades across all chunks:
🔬Relevance Scorer
The scorer sends each chunk to Ollama with a structured prompt that demands a JSON response. temperature=0 makes the grades deterministic - the same chunk always produces the same grade:
_GRADE_PROMPT = """\ You are a relevance grader. Your job is to decide whether a document chunk is useful for answering a question. Question: {question} Chunk: {chunk} Grade strictly: - CORRECT → chunk directly contains information that helps answer - AMBIGUOUS → loosely related but does not directly answer - INCORRECT → off-topic or about something else entirely Reply ONLY with valid JSON on one line: {{"grade": "CORRECT"|"AMBIGUOUS"|"INCORRECT", "confidence": 0.0-1.0, "reason": "one sentence"}}"""
The ScoredChunk dataclass carries the original text alongside the grading result so downstream code never needs to re-fetch anything:
class Grade(str, Enum): CORRECT = "CORRECT" AMBIGUOUS = "AMBIGUOUS" INCORRECT = "INCORRECT" @dataclass class ScoredChunk: chunk_id: str text: str grade: Grade confidence: float # 0.0-1.0 reason: str # one-line explanation from the LLM
def score( question: str, chunks: list[tuple[str, str]], # [(chunk_id, text), ...] model: str = "llama3.2:3b", ) -> list[ScoredChunk]: results = [] for chunk_id, text in chunks: resp = ollama.chat( model=model, messages=[{"role": "user", "content": _GRADE_PROMPT.format( question=question, chunk=text[:800])}], options={"temperature": 0, "num_predict": 128}, ) grade, confidence, reason = _parse_grade(resp["message"]["content"]) results.append(ScoredChunk(chunk_id, text, grade, confidence, reason)) return results
We truncate each chunk to 800 characters before sending it to the grader. The grader only needs to decide relevance - it does not need the full text. This cuts grading latency roughly in half for long chunks while producing identical grades.
🔧Corrective Retriever
The retriever orchestrates the full CRAG pipeline: search → grade → select strategy → (optionally rephrase + retry) → generate:
def retrieve_and_correct( question: str, collection: chromadb.Collection, model: str = "llama3.2:3b", k: int = 5, max_retry: int = 1, ) -> CRAGResult: # 1. Initial retrieval raw_chunks = _vector_search(collection, question, k) scored = rs.score(question, raw_chunks, model) strategy = _select_strategy(scored) # 2. Corrective step - rephrase if everything is incorrect if strategy == Strategy.ALL_INCORRECT and max_retry > 0: rephrased = _rephrase(question, model) raw_chunks = _vector_search(collection, rephrased, k) scored = rs.score(question, raw_chunks, model) strategy = _select_strategy(scored) # 3. Select chunks to pass to the generator if strategy == Strategy.ALL_CORRECT: used = scored elif strategy == Strategy.PARTIAL: used = [c for c in scored if c.grade != Grade.INCORRECT] else: used = scored # all incorrect even after rephrase - try anyway # 4. Generate answer = _generate(question, used, model) return CRAGResult(answer=answer, strategy=strategy, scored=scored, used=used, ...)
The strategy selection function is deliberately simple - three cases, no weights:
def _select_strategy(scored: list[ScoredChunk]) -> Strategy: incorrect = sum(1 for c in scored if c.grade == Grade.INCORRECT) correct = sum(1 for c in scored if c.grade == Grade.CORRECT) if incorrect == len(scored): return Strategy.ALL_INCORRECT if correct == len(scored): return Strategy.ALL_CORRECT return Strategy.PARTIAL
✏️Query Rephrasing
When all chunks are graded INCORRECT, the original query failed to retrieve anything useful. Simply re-running the same query would return the same results. Instead, we ask the LLM to reformulate the query from a different angle:
_REPHRASE_PROMPT = """\ The following search query did not return relevant results from the knowledge base. Rephrase it to approach the topic from a different angle. Try synonyms, related concepts, or a more specific/general formulation. Original query: {query} Reply with ONLY the rephrased query - no explanation, no quotes."""
Note that we use temperature=0.3 for rephrasing - slightly above zero to encourage genuine variation rather than trivially restating the original query with different word order.
The rephrase step is capped at max_retry=1 by default. Adding more retries rarely helps - if two different phrasings both return irrelevant chunks, the knowledge base simply does not contain the answer. More retries waste latency without improving recall.
A concrete example of what rephrasing looks like in practice:
🚀FastAPI Endpoint
@router.post("/crag/query", response_model=QueryResponse) async def crag_query( q: Annotated[str, Query(description="Question")], tenant_id: Annotated[str, Query()] = "default", model: Annotated[str, Query()] = "llama3.2:3b", k: Annotated[int, Query(ge=2, le=10)] = 5, max_retry: Annotated[int, Query(ge=0, le=2)] = 1, ): collection = _get_collection(tenant_id) result = cr.retrieve_and_correct( question=q, collection=collection, model=model, k=k, max_retry=max_retry, ) used_ids = {c.chunk_id for c in result.used} return QueryResponse( answer = result.answer, strategy = result.strategy.value, rephrased= result.rephrased, chunks = [ GradedChunkOut( chunk_id=c.chunk_id, text=c.text[:300], grade=c.grade.value, confidence=round(c.confidence, 3), reason=c.reason, used=c.chunk_id in used_ids, ) for c in result.scored ], )
The response exposes the full grading trace so you can inspect exactly why each chunk was kept or discarded. This is invaluable for debugging retrieval quality:
{
"answer": "The system uses HNSW indexing with ef_construction=200...",
"strategy": "PARTIAL",
"rephrased": "",
"chunks": [
{"chunk_id": "doc1-chunk3", "grade": "CORRECT", "confidence": 0.92, "used": true},
{"chunk_id": "doc2-chunk1", "grade": "AMBIGUOUS", "confidence": 0.51, "used": true},
{"chunk_id": "doc3-chunk7", "grade": "INCORRECT", "confidence": 0.18, "used": false}
]
}
A second endpoint lets you test the scorer in isolation - useful when tuning the grading prompt for your specific domain:
curl "http://localhost:8000/crag/grade?q=How+does+HNSW+indexing+work&chunk=HNSW+builds+a+hierarchical..." # response {"grade": "CORRECT", "confidence": 0.94, "reason": "Chunk directly explains HNSW construction"}
📊Grading Quality & Limits
LLM-based grading is powerful but not perfect. Know the failure modes:
| Failure mode | Symptom | Mitigation |
|---|---|---|
| Over-zealous grader | Grades INCORRECT too aggressively - discards good chunks, triggers unnecessary rephrase | Raise confidence thresholds; use a larger model for grading |
| Under-zealous grader | Grades everything CORRECT - no filtering, same behaviour as standard RAG | Add domain-specific examples to the prompt |
| Grade inconsistency | Same chunk grades differently across calls | Always use temperature=0 for the grader |
| JSON parse failure | Model returns prose instead of JSON | Default to AMBIGUOUS on parse error - safe middle ground |
| Latency overhead | k=5 → 5 extra LLM calls before generation | Use a small fast model (llama3.2:3b) for grading, larger for generation |
Split-model strategy: run the grader with llama3.2:3b (fast, low-cost, good binary decisions) and the generator with llama3.1:8b or mistral:7b (slower, higher quality synthesis). The endpoint exposes a single model parameter for simplicity - split it into grade_model and gen_model for production use.
Grade Before You Generate
Two files and a grading prompt - that's all it takes to add a self-correcting layer to your RAG pipeline. The system no longer trusts that retrieved chunks are good; it verifies. Bad evidence is discarded, borderline evidence is kept, and when everything fails the query is automatically reformulated.
Next up: RAG with Structured Outputs - using JSON mode and Pydantic models to constrain the generator's response into machine-readable structures your application can directly consume.
→ Continue to Article 9: Structured Outputs with Pydantic