8. Corrective RAG (CRAG) - Self-Correcting Retrieval

🩺The Problem with Blind Retrieval

Every RAG pipeline makes a silent assumption: the top-k chunks returned by the vector index are relevant to the question. This assumption breaks more often than you'd expect. Embeddings measure semantic similarity, not factual relevance - a chunk can be topically close to your question while being completely useless for answering it.

The consequences compound: irrelevant chunks dilute the context window, push useful information further from the beginning (where LLMs pay the most attention), and introduce noise that increases hallucination risk. Worse, there is no signal - the LLM receives the garbage context and confidently generates a garbage answer.

⚠ Anti-pattern - context dilution

Passing all retrieved chunks to the LLM regardless of quality. A chunk about "connection pooling general best practices" in response to a question about "PostgreSQL deadlock detection" is semantically close but factually useless - and it consumes 25% of a k=4 context budget.

Grade each chunk before generation. Discard INCORRECT chunks, keep CORRECT ones, and rephrase the query if everything scores poorly.

⚙️How CRAG Works

CRAG inserts a correction step between retrieval and generation. Rather than asking "what chunks are similar to this question?", it also asks "are these chunks actually useful for answering it?"

The original 2024 paper uses a trained classifier for grading. Our implementation uses a zero-shot LLM prompt - no training data, no fine-tuning, fully local via Ollama. The trade-off is slightly lower precision but zero infrastructure cost.

What changes vs standard RAG

Each chunk is graded before reaching the LLM
Low-quality chunks are discarded from the context
If nothing passes, the query is automatically rephrased
The generator only sees evidence that passed the filter

What stays the same

Vector search for initial retrieval (ChromaDB)
Same FastAPI structure from Article 3
Same Ollama generation model
Same tenant isolation pattern

🎯The Three Grades

Every retrieved chunk is assigned one of three grades by the LLM scorer:

✅

CORRECT

The chunk directly contains information that helps answer the question. Confidence ≥ 0.7. These chunks are passed to the generator unchanged.

⚠️

AMBIGUOUS

The chunk is topically related but does not directly answer the question. Confidence 0.3-0.7. Kept when there are no CORRECT chunks, discarded when there are.

❌

INCORRECT

The chunk is off-topic, irrelevant, or about a different entity entirely. Confidence < 0.3. Always discarded - they do more harm than good in context.

The strategy applied depends on the distribution of grades across all chunks:

ALL_CORRECT

All chunks pass - use them all as-is. No filtering needed.

PARTIAL

Mixed results - keep CORRECT + AMBIGUOUS chunks, discard INCORRECT ones. The context shrinks but improves in quality.

ALL_INCORRECT

Nothing passes - rephrase the query from a different angle, search again. One retry attempt by default.

🔬Relevance Scorer

The scorer sends each chunk to Ollama with a structured prompt that demands a JSON response. temperature=0 makes the grades deterministic - the same chunk always produces the same grade:

pythonrelevance_scorer.py - grading prompt

_GRADE_PROMPT = """\
You are a relevance grader. Your job is to decide whether a document chunk
is useful for answering a question.

Question: {question}
Chunk: {chunk}

Grade strictly:
- CORRECT    → chunk directly contains information that helps answer
- AMBIGUOUS  → loosely related but does not directly answer
- INCORRECT  → off-topic or about something else entirely

Reply ONLY with valid JSON on one line:
{{"grade": "CORRECT"|"AMBIGUOUS"|"INCORRECT", "confidence": 0.0-1.0, "reason": "one sentence"}}"""

The ScoredChunk dataclass carries the original text alongside the grading result so downstream code never needs to re-fetch anything:

pythonrelevance_scorer.py

class Grade(str, Enum):
    CORRECT    = "CORRECT"
    AMBIGUOUS  = "AMBIGUOUS"
    INCORRECT  = "INCORRECT"

@dataclass
class ScoredChunk:
    chunk_id:   str
    text:       str
    grade:      Grade
    confidence: float   # 0.0-1.0
    reason:     str     # one-line explanation from the LLM

pythonrelevance_scorer.py - score()

def score(
    question: str,
    chunks:   list[tuple[str, str]],   # [(chunk_id, text), ...]
    model:    str = "llama3.2:3b",
) -> list[ScoredChunk]:
    results = []
    for chunk_id, text in chunks:
        resp = ollama.chat(
            model=model,
            messages=[{"role": "user",
                        "content": _GRADE_PROMPT.format(
                            question=question, chunk=text[:800])}],
            options={"temperature": 0, "num_predict": 128},
        )
        grade, confidence, reason = _parse_grade(resp["message"]["content"])
        results.append(ScoredChunk(chunk_id, text, grade, confidence, reason))
    return results

💡

We truncate each chunk to 800 characters before sending it to the grader. The grader only needs to decide relevance - it does not need the full text. This cuts grading latency roughly in half for long chunks while producing identical grades.

🔧Corrective Retriever

The retriever orchestrates the full CRAG pipeline: search → grade → select strategy → (optionally rephrase + retry) → generate:

pythoncorrective_retriever.py - retrieve_and_correct()

def retrieve_and_correct(
    question:   str,
    collection: chromadb.Collection,
    model:      str = "llama3.2:3b",
    k:          int = 5,
    max_retry:  int = 1,
) -> CRAGResult:
    # 1. Initial retrieval
    raw_chunks = _vector_search(collection, question, k)
    scored     = rs.score(question, raw_chunks, model)
    strategy   = _select_strategy(scored)

    # 2. Corrective step - rephrase if everything is incorrect
    if strategy == Strategy.ALL_INCORRECT and max_retry > 0:
        rephrased  = _rephrase(question, model)
        raw_chunks = _vector_search(collection, rephrased, k)
        scored     = rs.score(question, raw_chunks, model)
        strategy   = _select_strategy(scored)

    # 3. Select chunks to pass to the generator
    if strategy == Strategy.ALL_CORRECT:
        used = scored
    elif strategy == Strategy.PARTIAL:
        used = [c for c in scored if c.grade != Grade.INCORRECT]
    else:
        used = scored   # all incorrect even after rephrase - try anyway

    # 4. Generate
    answer = _generate(question, used, model)
    return CRAGResult(answer=answer, strategy=strategy, scored=scored, used=used, ...)

The strategy selection function is deliberately simple - three cases, no weights:

pythoncorrective_retriever.py - _select_strategy()

def _select_strategy(scored: list[ScoredChunk]) -> Strategy:
    incorrect = sum(1 for c in scored if c.grade == Grade.INCORRECT)
    correct   = sum(1 for c in scored if c.grade == Grade.CORRECT)
    if incorrect == len(scored): return Strategy.ALL_INCORRECT
    if correct  == len(scored): return Strategy.ALL_CORRECT
    return Strategy.PARTIAL

✏️Query Rephrasing

When all chunks are graded INCORRECT, the original query failed to retrieve anything useful. Simply re-running the same query would return the same results. Instead, we ask the LLM to reformulate the query from a different angle:

pythoncorrective_retriever.py - rephrase prompt

_REPHRASE_PROMPT = """\
The following search query did not return relevant results from the knowledge base.
Rephrase it to approach the topic from a different angle. Try synonyms, related
concepts, or a more specific/general formulation.

Original query: {query}

Reply with ONLY the rephrased query - no explanation, no quotes."""

Note that we use temperature=0.3 for rephrasing - slightly above zero to encourage genuine variation rather than trivially restating the original query with different word order.

⚠️

The rephrase step is capped at max_retry=1 by default. Adding more retries rarely helps - if two different phrasings both return irrelevant chunks, the knowledge base simply does not contain the answer. More retries waste latency without improving recall.

A concrete example of what rephrasing looks like in practice:

ORIGINAL

"What is the maximum throughput of the message queue?" → 0/5 chunks pass

REPHRASED

"message broker capacity limits events per second rate" → 3/5 chunks pass

🚀FastAPI Endpoint

pythonrouters/crag.py

@router.post("/crag/query", response_model=QueryResponse)
async def crag_query(
    q:         Annotated[str, Query(description="Question")],
    tenant_id: Annotated[str, Query()] = "default",
    model:     Annotated[str, Query()] = "llama3.2:3b",
    k:         Annotated[int, Query(ge=2, le=10)] = 5,
    max_retry: Annotated[int, Query(ge=0, le=2)]  = 1,
):
    collection = _get_collection(tenant_id)
    result     = cr.retrieve_and_correct(
        question=q, collection=collection,
        model=model, k=k, max_retry=max_retry,
    )
    used_ids = {c.chunk_id for c in result.used}
    return QueryResponse(
        answer   = result.answer,
        strategy = result.strategy.value,
        rephrased= result.rephrased,
        chunks   = [
            GradedChunkOut(
                chunk_id=c.chunk_id, text=c.text[:300],
                grade=c.grade.value, confidence=round(c.confidence, 3),
                reason=c.reason, used=c.chunk_id in used_ids,
            )
            for c in result.scored
        ],
    )

The response exposes the full grading trace so you can inspect exactly why each chunk was kept or discarded. This is invaluable for debugging retrieval quality:

jsonPOST /crag/query - example response

{
  "answer":    "The system uses HNSW indexing with ef_construction=200...",
  "strategy":  "PARTIAL",
  "rephrased": "",
  "chunks": [
    {"chunk_id": "doc1-chunk3", "grade": "CORRECT",   "confidence": 0.92, "used": true},
    {"chunk_id": "doc2-chunk1", "grade": "AMBIGUOUS", "confidence": 0.51, "used": true},
    {"chunk_id": "doc3-chunk7", "grade": "INCORRECT", "confidence": 0.18, "used": false}
  ]
}

A second endpoint lets you test the scorer in isolation - useful when tuning the grading prompt for your specific domain:

bashtest the grader directly

curl "http://localhost:8000/crag/grade?q=How+does+HNSW+indexing+work&chunk=HNSW+builds+a+hierarchical..."

# response
{"grade": "CORRECT", "confidence": 0.94, "reason": "Chunk directly explains HNSW construction"}

📊Grading Quality & Limits

LLM-based grading is powerful but not perfect. Know the failure modes:

Failure mode	Symptom	Mitigation
Over-zealous grader	Grades INCORRECT too aggressively - discards good chunks, triggers unnecessary rephrase	Raise confidence thresholds; use a larger model for grading
Under-zealous grader	Grades everything CORRECT - no filtering, same behaviour as standard RAG	Add domain-specific examples to the prompt
Grade inconsistency	Same chunk grades differently across calls	Always use `temperature=0` for the grader
JSON parse failure	Model returns prose instead of JSON	Default to AMBIGUOUS on parse error - safe middle ground
Latency overhead	k=5 → 5 extra LLM calls before generation	Use a small fast model (`llama3.2:3b`) for grading, larger for generation

✅

Split-model strategy: run the grader with llama3.2:3b (fast, low-cost, good binary decisions) and the generator with llama3.1:8b or mistral:7b (slower, higher quality synthesis). The endpoint exposes a single model parameter for simplicity - split it into grade_model and gen_model for production use.

Grade Before You Generate

Two files and a grading prompt - that's all it takes to add a self-correcting layer to your RAG pipeline. The system no longer trusts that retrieved chunks are good; it verifies. Bad evidence is discarded, borderline evidence is kept, and when everything fails the query is automatically reformulated.

Next up: RAG with Structured Outputs - using JSON mode and Pydantic models to constrain the generator's response into machine-readable structures your application can directly consume.

→ Continue to Article 9: Structured Outputs with Pydantic

References

01 Yan et al. - Corrective Retrieval Augmented Generation (CRAG), 2024 02 ChromaDB - Official documentation 03 Ollama - Run large language models locally 04 FastAPI - Modern, fast web framework for building APIs with Python

10 RAG Projects That Teach Real-World AI Engineering · Article 8 of 10 · Tags: RAG, CRAG, Corrective RAG, Ollama, Python, FastAPI

8. Corrective RAG (CRAG) - Self-Correcting Retrieval

Idir Mellaz

Corrective RAG
Self-Correcting Retrieval with CRAG

🩺The Problem with Blind Retrieval

⚙️How CRAG Works

What changes vs standard RAG

What stays the same

🎯The Three Grades

CORRECT

AMBIGUOUS

INCORRECT

🔬Relevance Scorer

🔧Corrective Retriever

✏️Query Rephrasing

🚀FastAPI Endpoint

📊Grading Quality & Limits

Grade Before You Generate

Read more

10. Production RAG - Everything Together

9. RAG with Structured Outputs - JSON Mode + Pydantic

7. Agentic RAG - ReAct Agent with Tool-Calling

6. GraphRAG - Multi-Hop Reasoning with a Local Knowledge Graph

Corrective RAGSelf-Correcting Retrieval with CRAG

🩺The Problem with Blind Retrieval

⚙️How CRAG Works

What changes vs standard RAG

What stays the same

🎯The Three Grades

CORRECT

AMBIGUOUS

INCORRECT

🔬Relevance Scorer

🔧Corrective Retriever

✏️Query Rephrasing

🚀FastAPI Endpoint

📊Grading Quality & Limits

Grade Before You Generate

Read more

10. Production RAG - Everything Together

9. RAG with Structured Outputs - JSON Mode + Pydantic

7. Agentic RAG - ReAct Agent with Tool-Calling

6. GraphRAG - Multi-Hop Reasoning with a Local Knowledge Graph

Corrective RAG
Self-Correcting Retrieval with CRAG