8. Corrective RAG (CRAG) - Self-Correcting Retrieval

Implement CRAG: a retrieval pipeline that grades its own results and automatically corrects poor retrievals before generating. Combines LangGraph state machines with a web search fallback.

8. Corrective RAG (CRAG) - Self-Correcting Retrieval
8. Corrective RAG (CRAG) - Self-Correcting Retrieval
Series · Article 8 of 10

Corrective RAG
Self-Correcting Retrieval with CRAG

Standard RAG blindly passes every retrieved chunk to the LLM - even irrelevant ones. CRAG adds a grading step: score each chunk, decide whether to keep it, discard it, or rephrase the query entirely before generating.

⏱ ~35 min build 🔧 ollama grader · query rephrase · fastapi 📦 Builds on Article 3

🩺The Problem with Blind Retrieval

Every RAG pipeline makes a silent assumption: the top-k chunks returned by the vector index are relevant to the question. This assumption breaks more often than you'd expect. Embeddings measure semantic similarity, not factual relevance - a chunk can be topically close to your question while being completely useless for answering it.

The consequences compound: irrelevant chunks dilute the context window, push useful information further from the beginning (where LLMs pay the most attention), and introduce noise that increases hallucination risk. Worse, there is no signal - the LLM receives the garbage context and confidently generates a garbage answer.

⚠ Anti-pattern - context dilution
Passing all retrieved chunks to the LLM regardless of quality. A chunk about "connection pooling general best practices" in response to a question about "PostgreSQL deadlock detection" is semantically close but factually useless - and it consumes 25% of a k=4 context budget.
Grade each chunk before generation. Discard INCORRECT chunks, keep CORRECT ones, and rephrase the query if everything scores poorly.
QUESTION "How does X work?" VECTOR SEARCH k=5 chunks LLM GRADER score each chunk CORRECT PARTIAL INCORRECT REPHRASE new query GENERATE curated chunks ANSWER grounded ✓

⚙️How CRAG Works

CRAG inserts a correction step between retrieval and generation. Rather than asking "what chunks are similar to this question?", it also asks "are these chunks actually useful for answering it?"

The original 2024 paper uses a trained classifier for grading. Our implementation uses a zero-shot LLM prompt - no training data, no fine-tuning, fully local via Ollama. The trade-off is slightly lower precision but zero infrastructure cost.

What changes vs standard RAG

  • Each chunk is graded before reaching the LLM
  • Low-quality chunks are discarded from the context
  • If nothing passes, the query is automatically rephrased
  • The generator only sees evidence that passed the filter

What stays the same

  • Vector search for initial retrieval (ChromaDB)
  • Same FastAPI structure from Article 3
  • Same Ollama generation model
  • Same tenant isolation pattern

🎯The Three Grades

Every retrieved chunk is assigned one of three grades by the LLM scorer:

CORRECT

The chunk directly contains information that helps answer the question. Confidence ≥ 0.7. These chunks are passed to the generator unchanged.

⚠️

AMBIGUOUS

The chunk is topically related but does not directly answer the question. Confidence 0.3-0.7. Kept when there are no CORRECT chunks, discarded when there are.

INCORRECT

The chunk is off-topic, irrelevant, or about a different entity entirely. Confidence < 0.3. Always discarded - they do more harm than good in context.

The strategy applied depends on the distribution of grades across all chunks:

ALL_CORRECT
All chunks pass - use them all as-is. No filtering needed.
PARTIAL
Mixed results - keep CORRECT + AMBIGUOUS chunks, discard INCORRECT ones. The context shrinks but improves in quality.
ALL_INCORRECT
Nothing passes - rephrase the query from a different angle, search again. One retry attempt by default.

🔬Relevance Scorer

The scorer sends each chunk to Ollama with a structured prompt that demands a JSON response. temperature=0 makes the grades deterministic - the same chunk always produces the same grade:

pythonrelevance_scorer.py - grading prompt
_GRADE_PROMPT = """\
You are a relevance grader. Your job is to decide whether a document chunk
is useful for answering a question.

Question: {question}
Chunk: {chunk}

Grade strictly:
- CORRECT    → chunk directly contains information that helps answer
- AMBIGUOUS  → loosely related but does not directly answer
- INCORRECT  → off-topic or about something else entirely

Reply ONLY with valid JSON on one line:
{{"grade": "CORRECT"|"AMBIGUOUS"|"INCORRECT", "confidence": 0.0-1.0, "reason": "one sentence"}}"""

The ScoredChunk dataclass carries the original text alongside the grading result so downstream code never needs to re-fetch anything:

pythonrelevance_scorer.py
class Grade(str, Enum):
    CORRECT    = "CORRECT"
    AMBIGUOUS  = "AMBIGUOUS"
    INCORRECT  = "INCORRECT"

@dataclass
class ScoredChunk:
    chunk_id:   str
    text:       str
    grade:      Grade
    confidence: float   # 0.0-1.0
    reason:     str     # one-line explanation from the LLM
pythonrelevance_scorer.py - score()
def score(
    question: str,
    chunks:   list[tuple[str, str]],   # [(chunk_id, text), ...]
    model:    str = "llama3.2:3b",
) -> list[ScoredChunk]:
    results = []
    for chunk_id, text in chunks:
        resp = ollama.chat(
            model=model,
            messages=[{"role": "user",
                        "content": _GRADE_PROMPT.format(
                            question=question, chunk=text[:800])}],
            options={"temperature": 0, "num_predict": 128},
        )
        grade, confidence, reason = _parse_grade(resp["message"]["content"])
        results.append(ScoredChunk(chunk_id, text, grade, confidence, reason))
    return results
💡

We truncate each chunk to 800 characters before sending it to the grader. The grader only needs to decide relevance - it does not need the full text. This cuts grading latency roughly in half for long chunks while producing identical grades.

🔧Corrective Retriever

The retriever orchestrates the full CRAG pipeline: search → grade → select strategy → (optionally rephrase + retry) → generate:

pythoncorrective_retriever.py - retrieve_and_correct()
def retrieve_and_correct(
    question:   str,
    collection: chromadb.Collection,
    model:      str = "llama3.2:3b",
    k:          int = 5,
    max_retry:  int = 1,
) -> CRAGResult:
    # 1. Initial retrieval
    raw_chunks = _vector_search(collection, question, k)
    scored     = rs.score(question, raw_chunks, model)
    strategy   = _select_strategy(scored)

    # 2. Corrective step - rephrase if everything is incorrect
    if strategy == Strategy.ALL_INCORRECT and max_retry > 0:
        rephrased  = _rephrase(question, model)
        raw_chunks = _vector_search(collection, rephrased, k)
        scored     = rs.score(question, raw_chunks, model)
        strategy   = _select_strategy(scored)

    # 3. Select chunks to pass to the generator
    if strategy == Strategy.ALL_CORRECT:
        used = scored
    elif strategy == Strategy.PARTIAL:
        used = [c for c in scored if c.grade != Grade.INCORRECT]
    else:
        used = scored   # all incorrect even after rephrase - try anyway

    # 4. Generate
    answer = _generate(question, used, model)
    return CRAGResult(answer=answer, strategy=strategy, scored=scored, used=used, ...)

The strategy selection function is deliberately simple - three cases, no weights:

pythoncorrective_retriever.py - _select_strategy()
def _select_strategy(scored: list[ScoredChunk]) -> Strategy:
    incorrect = sum(1 for c in scored if c.grade == Grade.INCORRECT)
    correct   = sum(1 for c in scored if c.grade == Grade.CORRECT)
    if incorrect == len(scored): return Strategy.ALL_INCORRECT
    if correct  == len(scored): return Strategy.ALL_CORRECT
    return Strategy.PARTIAL

✏️Query Rephrasing

When all chunks are graded INCORRECT, the original query failed to retrieve anything useful. Simply re-running the same query would return the same results. Instead, we ask the LLM to reformulate the query from a different angle:

pythoncorrective_retriever.py - rephrase prompt
_REPHRASE_PROMPT = """\
The following search query did not return relevant results from the knowledge base.
Rephrase it to approach the topic from a different angle. Try synonyms, related
concepts, or a more specific/general formulation.

Original query: {query}

Reply with ONLY the rephrased query - no explanation, no quotes."""

Note that we use temperature=0.3 for rephrasing - slightly above zero to encourage genuine variation rather than trivially restating the original query with different word order.

⚠️

The rephrase step is capped at max_retry=1 by default. Adding more retries rarely helps - if two different phrasings both return irrelevant chunks, the knowledge base simply does not contain the answer. More retries waste latency without improving recall.

A concrete example of what rephrasing looks like in practice:

ORIGINAL
"What is the maximum throughput of the message queue?" → 0/5 chunks pass
REPHRASED
"message broker capacity limits events per second rate" → 3/5 chunks pass

🚀FastAPI Endpoint

pythonrouters/crag.py
@router.post("/crag/query", response_model=QueryResponse)
async def crag_query(
    q:         Annotated[str, Query(description="Question")],
    tenant_id: Annotated[str, Query()] = "default",
    model:     Annotated[str, Query()] = "llama3.2:3b",
    k:         Annotated[int, Query(ge=2, le=10)] = 5,
    max_retry: Annotated[int, Query(ge=0, le=2)]  = 1,
):
    collection = _get_collection(tenant_id)
    result     = cr.retrieve_and_correct(
        question=q, collection=collection,
        model=model, k=k, max_retry=max_retry,
    )
    used_ids = {c.chunk_id for c in result.used}
    return QueryResponse(
        answer   = result.answer,
        strategy = result.strategy.value,
        rephrased= result.rephrased,
        chunks   = [
            GradedChunkOut(
                chunk_id=c.chunk_id, text=c.text[:300],
                grade=c.grade.value, confidence=round(c.confidence, 3),
                reason=c.reason, used=c.chunk_id in used_ids,
            )
            for c in result.scored
        ],
    )

The response exposes the full grading trace so you can inspect exactly why each chunk was kept or discarded. This is invaluable for debugging retrieval quality:

jsonPOST /crag/query - example response
{
  "answer":    "The system uses HNSW indexing with ef_construction=200...",
  "strategy":  "PARTIAL",
  "rephrased": "",
  "chunks": [
    {"chunk_id": "doc1-chunk3", "grade": "CORRECT",   "confidence": 0.92, "used": true},
    {"chunk_id": "doc2-chunk1", "grade": "AMBIGUOUS", "confidence": 0.51, "used": true},
    {"chunk_id": "doc3-chunk7", "grade": "INCORRECT", "confidence": 0.18, "used": false}
  ]
}

A second endpoint lets you test the scorer in isolation - useful when tuning the grading prompt for your specific domain:

bashtest the grader directly
curl "http://localhost:8000/crag/grade?q=How+does+HNSW+indexing+work&chunk=HNSW+builds+a+hierarchical..."

# response
{"grade": "CORRECT", "confidence": 0.94, "reason": "Chunk directly explains HNSW construction"}

📊Grading Quality & Limits

LLM-based grading is powerful but not perfect. Know the failure modes:

Failure modeSymptomMitigation
Over-zealous grader Grades INCORRECT too aggressively - discards good chunks, triggers unnecessary rephrase Raise confidence thresholds; use a larger model for grading
Under-zealous grader Grades everything CORRECT - no filtering, same behaviour as standard RAG Add domain-specific examples to the prompt
Grade inconsistency Same chunk grades differently across calls Always use temperature=0 for the grader
JSON parse failure Model returns prose instead of JSON Default to AMBIGUOUS on parse error - safe middle ground
Latency overhead k=5 → 5 extra LLM calls before generation Use a small fast model (llama3.2:3b) for grading, larger for generation

Split-model strategy: run the grader with llama3.2:3b (fast, low-cost, good binary decisions) and the generator with llama3.1:8b or mistral:7b (slower, higher quality synthesis). The endpoint exposes a single model parameter for simplicity - split it into grade_model and gen_model for production use.

Grade Before You Generate

Two files and a grading prompt - that's all it takes to add a self-correcting layer to your RAG pipeline. The system no longer trusts that retrieved chunks are good; it verifies. Bad evidence is discarded, borderline evidence is kept, and when everything fails the query is automatically reformulated.

Next up: RAG with Structured Outputs - using JSON mode and Pydantic models to constrain the generator's response into machine-readable structures your application can directly consume.

→ Continue to Article 9: Structured Outputs with Pydantic