9. RAG with Structured Outputs - JSON Mode + Pydantic

🚧Why Free Text Fails in Production

Every RAG implementation in this series so far has returned a string. That works fine when the answer is read by a human. It breaks the moment another piece of code needs to consume it.

Consider what you actually need from a RAG system in a real application:

Which source chunks supported the answer? (For citations in a UI)
How confident is the model? (To decide whether to fall back to a human)
Did the model genuinely find an answer, or is it hallucinating? (The explicit "I don't know" signal)
What's the chain of reasoning? (For audit logs and debugging)

You could parse all of this from free text - but regex over LLM output is fragile, model-specific, and breaks with every prompt change. The alternative is to make the structure a first-class contract: define it once in Pydantic, inject it into the prompt, and validate on the way out.

💡

The core idea: A Pydantic Field(description="...") is not just documentation - it becomes a precise instruction to the LLM when you inject model_json_schema() into the system prompt. Write good field descriptions, get better outputs.

⚙️Two Mechanisms Working Together

Structured outputs in Ollama rely on two independent mechanisms that reinforce each other:

1 · format="json" (grammar decoding)

Activates grammar-constrained token sampling
The model's output is token-level forced to be valid JSON
Prevents runaway text before or after the JSON object
Does not enforce specific field names or types

2 · Schema in system prompt (semantic contract)

Inject the JSON schema verbatim into the system message
Field description values guide the model's choices
The model knows which fields to populate and what values are valid
Pydantic validates on exit - rejects structurally wrong output

Neither mechanism alone is sufficient. format="json" guarantees valid JSON - but the model might output {"result": "some text"} with no citations, no confidence score, and no structure you can use. The schema injection guides it to the right fields. Together they give you a strong contract at low latency cost.

⚠️

Model capability matters. Smaller models (3B) follow schema instructions less reliably than larger ones (8B+). The fallback parser in structured_rag.py handles validation failures gracefully - but if you need high schema compliance on 3B models, include concrete examples in the prompt.

🏗️Designing the Output Schema

The schema is the most important design decision. Every field you add costs tokens in the prompt (the schema is injected verbatim) and increases the risk of compliance failures in small models. Every field you omit is information you can't recover downstream.

python schemas.py

from pydantic import BaseModel, Field


class Citation(BaseModel):
    chunk_id: str = Field(
        description="ID of the source chunk (must match an ID from the provided context)"
    )
    excerpt: str = Field(
        description="Verbatim quote (≤150 chars) from the chunk that directly supports the answer"
    )
    relevance_score: float = Field(
        ge=0.0, le=1.0,
        description="How essential this chunk is to the answer (0=tangential, 1=directly answers it)",
    )


class StructuredAnswer(BaseModel):
    answer: str = Field(
        description="Direct, concise answer to the question in 1-3 sentences"
    )
    citations: list[Citation] = Field(
        description="Chunks that were used to construct the answer - omit irrelevant chunks"
    )
    confidence: float = Field(
        ge=0.0, le=1.0,
        description="How fully the provided context supports the answer (0=guessing, 1=fully supported)",
    )
    cannot_answer: bool = Field(
        description="Set true when the context lacks sufficient information - do NOT hallucinate"
    )
    reasoning: str = Field(
        description="One sentence explaining how the answer was derived from the cited chunks"
    )

The schema has two levels: a top-level StructuredAnswer and a nested Citation. Let's walk through the field choices:

Field	Type	Why it's there
answer	str	The actual answer - same as classic RAG, but now co-located with metadata
citations	list[Citation]	Machine-readable provenance; chunk IDs let the UI link back to sources
confidence	float [0,1]	Self-assessed coverage score; lets downstream code threshold "unsure" answers
cannot_answer	bool	Explicit "I don't know" - prevents hallucination from being reported as an answer
reasoning	str	One-sentence CoT trace - invaluable for debugging wrong answers
Citation.excerpt	str (≤150)	Verbatim quote - verifiable against the source chunk, not a paraphrase
Citation.relevance_score	float [0,1]	Per-citation weight; lets the UI highlight the most important source

💡

Description quality is prompt quality. Compare description="confidence" (vague) vs description="How fully the provided context supports the answer (0=guessing, 1=fully supported)" (precise). The second version gives the model an unambiguous grounding scale. Write field descriptions like you're writing a rubric for a junior analyst.

💉Schema Injection in the Prompt

The magic line is StructuredAnswer.model_json_schema(). Pydantic v2 extracts the full JSON Schema for your model - including nested models, field constraints, and all descriptions - as a Python dict. You serialize it to a string and drop it verbatim into the system prompt.

python structured_rag.py

import json
from schemas import StructuredAnswer

# The system prompt template - {schema} is replaced at call time
_SYSTEM_TEMPLATE = """\
You are a precise question-answering assistant. Answer the question using ONLY the provided context.

You MUST respond with valid JSON that exactly matches this schema:
{schema}

Rules:
- Set cannot_answer=true if the context lacks enough information to answer confidently
- citations must reference real chunk_ids from the context - never invent IDs
- excerpt must be a verbatim quote (≤150 chars), not a paraphrase
- confidence reflects context coverage, not your general knowledge
- Output ONLY the JSON object - no explanation, no markdown fences"""

# At query time:
schema_str = json.dumps(StructuredAnswer.model_json_schema(), indent=2)
system     = _SYSTEM_TEMPLATE.format(schema=schema_str)

This approach has a nice property: when you add a field to StructuredAnswer, the prompt updates automatically. There is no separate "prompt schema" to maintain.

What does the injected schema actually look like? Here's a simplified excerpt of what model_json_schema() produces for our model:

json extracted schema (excerpt)

{
  "type": "object",
  "properties": {
    "answer": {
      "type": "string",
      "description": "Direct, concise answer to the question in 1-3 sentences"
    },
    "confidence": {
      "type": "number",
      "minimum": 0.0,
      "maximum": 1.0,
      "description": "How fully the provided context supports the answer (0=guessing, 1=fully supported)"
    },
    "cannot_answer": {
      "type": "boolean",
      "description": "Set true when the context lacks sufficient information - do NOT hallucinate"
    },
    ...
  },
  "required": ["answer", "citations", "confidence", "cannot_answer", "reasoning"]
}

The LLM sees the field names, types, constraints (minimum/maximum), and your descriptions in the required array. That's everything it needs to produce a valid response - derived entirely from your Python type annotations.

🔁The Structured RAG Pipeline

The full pipeline is four steps. Steps 1 and 2 are identical to classic RAG; steps 3 and 4 are what makes it structured.

🔍

1 · Vector Search

Retrieve k chunks from ChromaDB. Same as every other article in this series.

📝

2 · Schema Injection

Call model_json_schema(), serialize to JSON string, embed in system prompt.

🤖

3 · Ollama format=json

Grammar-constrained decoding ensures the raw output is always valid JSON, no fences.

✅

4 · Pydantic Validation

model_validate_json() checks types, ranges, and required fields. Fallback on failure.

python structured_rag.py

def query(
    question:   str,
    collection: chromadb.Collection,
    model:      str = "llama3.2:3b",
    k:          int = 5,
) -> StructuredAnswer:
    # 1. Retrieve k chunks
    chunks = _vector_search(collection, question, k)
    if not chunks:
        return StructuredAnswer(
            answer="No documents found in this collection.",
            citations=[], confidence=0.0, cannot_answer=True,
            reasoning="vector search returned no results",
        )

    # 2. Build schema-aware system prompt
    schema_str = json.dumps(StructuredAnswer.model_json_schema(), indent=2)
    context    = "\n\n".join(f"[{cid}]\n{text}" for cid, text in chunks)
    system     = _SYSTEM_TEMPLATE.format(schema=schema_str)

    # 3. Call Ollama with format="json" (grammar-constrained decoding)
    resp = ollama.chat(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user",   "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
        format="json",
        options={"temperature": 0, "num_predict": 1024},
    )
    raw = resp["message"]["content"].strip()

    # 4. Validate with Pydantic
    return _parse(raw)

The context string prefixes each chunk with its ID in brackets: [chunk_abc]\ntext…. This is what makes citations verifiable - the model can only reference IDs it actually sees in the context. The system prompt rule says "never invent IDs", which the schema description reinforces.

Graceful fallback on parse failure

Even with format="json", the model might produce JSON that doesn't match the expected schema - a missing required field, a string where a float was expected, or a confidence outside [0,1]. The _parse() function handles this in layers:

python structured_rag.py

def _parse(raw: str) -> StructuredAnswer:
    # Layer 1 - happy path: full Pydantic validation
    try:
        return StructuredAnswer.model_validate_json(raw)
    except (ValidationError, json.JSONDecodeError) as exc:
        log.warning("Structured parse failed (%s): %.120s", type(exc).__name__, raw)

        # Layer 2 - partial recovery: salvage the answer field if present
        try:
            data = json.loads(raw)
            return StructuredAnswer(
                answer=str(data.get("answer", raw[:400])),
                citations=[],
                confidence=0.0,
                cannot_answer=True,
                reasoning="schema validation failed - partial response preserved",
            )
        except Exception:
            # Layer 3 - last resort: raw text as answer
            return StructuredAnswer(
                answer=raw[:400],
                citations=[], confidence=0.0,
                cannot_answer=True, reasoning="json parse error",
            )

The function never raises. The caller always gets a valid StructuredAnswer - even if validation failed. The cannot_answer=True + confidence=0.0 combination signals to downstream code that the response should be treated as unreliable.

🚫The cannot_answer Pattern

This field deserves its own section because it solves one of the hardest problems in RAG production systems: hallucination detection.

Without an explicit signal, a RAG system has three possible states - but only two visible outputs:

Without structured outputs

Context is sufficient → LLM answers correctly ✓
Context is insufficient → LLM silently hallucinates ✗
Context is insufficient → LLM says "I don't know" (unreliable)

With cannot_answer field

Context is sufficient → cannot_answer: false, answer populated ✓
Context is insufficient → cannot_answer: true, confidence: 0.0 ✓
Downstream code can branch on this boolean reliably

The cannot_answer flag is effective because it's in the schema - the model is forced to make an explicit decision about it on every call. There's no way to "forget" to include it the way an LLM might forget to say "I don't know" in free-text mode.

✅

Production pattern: In a real application, check cannot_answer first. If it's true (or confidence < 0.4), route the query to a human agent or return a UI message like "I couldn't find a reliable answer in the knowledge base." Only display the answer field when cannot_answer=false and confidence is high.

Here's what a response looks like for a well-answered vs unanswerable question:

Response - question found in context

{ "answer": "The rate limiter uses a sliding window algorithm with a 60-second bucket.", "citations": [ { "chunk_id": "doc_42_chunk_7", "excerpt": "We use a sliding window counter with 60-second buckets for rate limiting.", "relevance_score": 0.94 } ], "confidence": 0.91, "cannot_answer": false, "reasoning": "chunk_7 directly describes the rate limiting algorithm used." }

Response - question not in context

{ "answer": "The context does not contain information about the database schema.", "citations": [], "confidence": 0.0, "cannot_answer": true, "reasoning": "No retrieved chunk mentions database tables or schema definitions." }

🌐FastAPI Endpoints

The router wraps the pipeline in two endpoints: one for querying, one for exposing the schema itself.

python routers/structured.py

router = APIRouter(prefix="/structured", tags=["structured"])

@router.post("/query", response_model=QueryResponse)
async def structured_query(
    q:         Annotated[str, Query(description="Question to answer")],
    tenant_id: Annotated[str, Query()] = "default",
    model:     Annotated[str, Query(description="Ollama model")] = "llama3.2:3b",
    k:         Annotated[int, Query(ge=1, le=10)] = 5,
):
    collection = _get_collection(tenant_id)
    result     = sr.query(question=q, collection=collection, model=model, k=k)
    return QueryResponse(
        answer        = result.answer,
        citations     = [CitationOut(...) for c in result.citations],
        confidence    = round(result.confidence, 3),
        cannot_answer = result.cannot_answer,
        reasoning     = result.reasoning,
    )


@router.get("/schema")
async def get_answer_schema() -> dict:
    # Expose the schema - useful for API consumers and debugging prompts
    return StructuredAnswer.model_json_schema()

The GET /structured/schema endpoint is worth calling out. It exposes the exact schema the LLM is being asked to follow. When debugging a citation compliance issue, the first thing to check is whether the schema in the prompt matches what you expect - and this endpoint gives you that without digging into the code.

Testing the endpoint

bash test from terminal

# Run a structured query
curl -s -X POST \
  "http://localhost:8000/structured/query?q=What+is+the+rate+limit+policy&tenant_id=default&k=5" \
  | python3 -m json.tool

# Inspect the live schema the LLM receives
curl -s "http://localhost:8000/structured/schema" | python3 -m json.tool

⚖️What You Gain (and Lose)

Structured outputs are not free. Here's an honest comparison:

Dimension	Standard RAG	Structured RAG
Answer format	Free text - flexible, human-readable	JSON - machine-readable, strongly typed
Provenance	Opaque - no citation trail	Explicit chunk citations with excerpts
Hallucination signal	None - LLM may fabricate silently	`cannot_answer` flag + low confidence score
System prompt size	Small	Larger - schema adds ~30-60 tokens for this model
Latency	Baseline	Similar - grammar decoding is fast; schema adds minor prompt overhead
Small model compliance	N/A	Variable - 3B models follow schema less reliably than 8B+
Downstream integration	Manual parsing / regex	Zero-effort - Pydantic object is already typed and validated
Debugging	Read the free text	Inspect `reasoning`, check citations against chunks

💡

When to use structured outputs: Any time the answer goes into a database, triggers a workflow, or is compared programmatically. When the answer is displayed directly to a human in a chat UI, free text is often better - structured JSON displayed verbatim looks terrible. The two approaches are complementary, not competing.

Combining with previous articles

Structured outputs compose cleanly with the patterns from earlier articles in this series:

Article 8 (CRAG) - Run the corrective retrieval step first to filter low-quality chunks, then pass only the curated chunks through the structured RAG pipeline. The confidence field becomes more meaningful when the input is already filtered.
Article 7 (Agentic RAG) - The ReAct agent's summarise_chunks tool can return a StructuredAnswer instead of a string. The agent then has structured access to citations and the cannot_answer flag as part of its reasoning loop.
Article 5 (RAGAS evaluation) - The citations field maps directly to RAGAS's retrieved_contexts and the answer field to the response. Structured outputs make RAGAS evaluation almost free - no parsing step required.

From Answers to Structured Data

Two additions - format="json" and a Pydantic schema in the system prompt - transform your RAG pipeline from a text generator into a typed API. The output is machine-readable, self-documenting, and safe to consume downstream without any parsing logic.

The cannot_answer flag alone is worth the complexity cost: it gives you a reliable signal to route uncertain queries to a human rather than displaying a confident-sounding hallucination.

Next: Article 10 - Production RAG. We bring together everything from the series - multi-tenant storage, streaming, evaluation, CRAG, and structured outputs - into a single deployable FastAPI service with observability, async ingestion, and a Makefile-driven deployment pipeline.

Article 10: Production RAG - Everything Together →

References

01 Pydantic v2 - JSON Schema generation (model_json_schema, Field descriptions) 02 Ollama API docs - format="json" grammar-constrained decoding 03 Understanding JSON Schema - field types, constraints, required arrays 04 "Language Models Don't Always Say What They Think" - hallucination and confidence calibration

Part of the 10 RAG Projects That Teach Real-World AI Engineering series. All code uses open-source models only - no paid APIs, no subscriptions. Run everything locally with Ollama.

9. RAG with Structured Outputs - JSON Mode + Pydantic

Idir Mellaz

RAG with Structured Outputs
JSON Mode + Pydantic Validation

🚧Why Free Text Fails in Production

⚙️Two Mechanisms Working Together

1 · format="json" (grammar decoding)

2 · Schema in system prompt (semantic contract)

🏗️Designing the Output Schema

💉Schema Injection in the Prompt

🔁The Structured RAG Pipeline

1 · Vector Search

2 · Schema Injection

3 · Ollama format=json

4 · Pydantic Validation

Graceful fallback on parse failure

🚫The cannot_answer Pattern

Without structured outputs

With cannot_answer field

🌐FastAPI Endpoints

Testing the endpoint

⚖️What You Gain (and Lose)

Combining with previous articles

From Answers to Structured Data

Read more

10. Production RAG - Everything Together

8. Corrective RAG (CRAG) - Self-Correcting Retrieval

7. Agentic RAG - ReAct Agent with Tool-Calling

6. GraphRAG - Multi-Hop Reasoning with a Local Knowledge Graph

RAG with Structured OutputsJSON Mode + Pydantic Validation

🚧Why Free Text Fails in Production

⚙️Two Mechanisms Working Together

1 · format="json" (grammar decoding)

2 · Schema in system prompt (semantic contract)

🏗️Designing the Output Schema

💉Schema Injection in the Prompt

🔁The Structured RAG Pipeline

1 · Vector Search

2 · Schema Injection

3 · Ollama format=json

4 · Pydantic Validation

Graceful fallback on parse failure

🚫The cannot_answer Pattern

Without structured outputs

With cannot_answer field

🌐FastAPI Endpoints

Testing the endpoint

⚖️What You Gain (and Lose)

Combining with previous articles

From Answers to Structured Data

Read more

10. Production RAG - Everything Together

8. Corrective RAG (CRAG) - Self-Correcting Retrieval

7. Agentic RAG - ReAct Agent with Tool-Calling

6. GraphRAG - Multi-Hop Reasoning with a Local Knowledge Graph

RAG with Structured Outputs
JSON Mode + Pydantic Validation