9. RAG with Structured Outputs - JSON Mode + Pydantic
Force your RAG pipeline to always return valid, typed data. Combine OpenAI JSON mode, Pydantic v2 schemas, and LangChain output parsers to build a pipeline where every response is schema-validated.
RAG with Structured Outputs
JSON Mode + Pydantic Validation
Free-text answers from your RAG pipeline are fine for humans - but useless for downstream code. Structured outputs force the LLM to return validated JSON with citations, confidence scores, and an explicit "I don't know" flag. No regex. No parsing heuristics. Just Pydantic.
🚧Why Free Text Fails in Production
Every RAG implementation in this series so far has returned a string. That works fine when the answer is read by a human. It breaks the moment another piece of code needs to consume it.
Consider what you actually need from a RAG system in a real application:
- Which source chunks supported the answer? (For citations in a UI)
- How confident is the model? (To decide whether to fall back to a human)
- Did the model genuinely find an answer, or is it hallucinating? (The explicit "I don't know" signal)
- What's the chain of reasoning? (For audit logs and debugging)
You could parse all of this from free text - but regex over LLM output is fragile, model-specific, and breaks with every prompt change. The alternative is to make the structure a first-class contract: define it once in Pydantic, inject it into the prompt, and validate on the way out.
The core idea: A Pydantic Field(description="...") is not just documentation - it becomes a precise instruction to the LLM when you inject model_json_schema() into the system prompt. Write good field descriptions, get better outputs.
⚙️Two Mechanisms Working Together
Structured outputs in Ollama rely on two independent mechanisms that reinforce each other:
1 · format="json" (grammar decoding)
- Activates grammar-constrained token sampling
- The model's output is token-level forced to be valid JSON
- Prevents runaway text before or after the JSON object
- Does not enforce specific field names or types
2 · Schema in system prompt (semantic contract)
- Inject the JSON schema verbatim into the system message
- Field
descriptionvalues guide the model's choices - The model knows which fields to populate and what values are valid
- Pydantic validates on exit - rejects structurally wrong output
Neither mechanism alone is sufficient. format="json" guarantees valid JSON - but the model might output {"result": "some text"} with no citations, no confidence score, and no structure you can use. The schema injection guides it to the right fields. Together they give you a strong contract at low latency cost.
Model capability matters. Smaller models (3B) follow schema instructions less reliably than larger ones (8B+). The fallback parser in structured_rag.py handles validation failures gracefully - but if you need high schema compliance on 3B models, include concrete examples in the prompt.
🏗️Designing the Output Schema
The schema is the most important design decision. Every field you add costs tokens in the prompt (the schema is injected verbatim) and increases the risk of compliance failures in small models. Every field you omit is information you can't recover downstream.
from pydantic import BaseModel, Field class Citation(BaseModel): chunk_id: str = Field( description="ID of the source chunk (must match an ID from the provided context)" ) excerpt: str = Field( description="Verbatim quote (≤150 chars) from the chunk that directly supports the answer" ) relevance_score: float = Field( ge=0.0, le=1.0, description="How essential this chunk is to the answer (0=tangential, 1=directly answers it)", ) class StructuredAnswer(BaseModel): answer: str = Field( description="Direct, concise answer to the question in 1-3 sentences" ) citations: list[Citation] = Field( description="Chunks that were used to construct the answer - omit irrelevant chunks" ) confidence: float = Field( ge=0.0, le=1.0, description="How fully the provided context supports the answer (0=guessing, 1=fully supported)", ) cannot_answer: bool = Field( description="Set true when the context lacks sufficient information - do NOT hallucinate" ) reasoning: str = Field( description="One sentence explaining how the answer was derived from the cited chunks" )
The schema has two levels: a top-level StructuredAnswer and a nested Citation. Let's walk through the field choices:
| Field | Type | Why it's there |
|---|---|---|
| answer | str | The actual answer - same as classic RAG, but now co-located with metadata |
| citations | list[Citation] | Machine-readable provenance; chunk IDs let the UI link back to sources |
| confidence | float [0,1] | Self-assessed coverage score; lets downstream code threshold "unsure" answers |
| cannot_answer | bool | Explicit "I don't know" - prevents hallucination from being reported as an answer |
| reasoning | str | One-sentence CoT trace - invaluable for debugging wrong answers |
| Citation.excerpt | str (≤150) | Verbatim quote - verifiable against the source chunk, not a paraphrase |
| Citation.relevance_score | float [0,1] | Per-citation weight; lets the UI highlight the most important source |
Description quality is prompt quality. Compare description="confidence" (vague) vs description="How fully the provided context supports the answer (0=guessing, 1=fully supported)" (precise). The second version gives the model an unambiguous grounding scale. Write field descriptions like you're writing a rubric for a junior analyst.
💉Schema Injection in the Prompt
The magic line is StructuredAnswer.model_json_schema(). Pydantic v2 extracts the full JSON Schema for your model - including nested models, field constraints, and all descriptions - as a Python dict. You serialize it to a string and drop it verbatim into the system prompt.
import json from schemas import StructuredAnswer # The system prompt template - {schema} is replaced at call time _SYSTEM_TEMPLATE = """\ You are a precise question-answering assistant. Answer the question using ONLY the provided context. You MUST respond with valid JSON that exactly matches this schema: {schema} Rules: - Set cannot_answer=true if the context lacks enough information to answer confidently - citations must reference real chunk_ids from the context - never invent IDs - excerpt must be a verbatim quote (≤150 chars), not a paraphrase - confidence reflects context coverage, not your general knowledge - Output ONLY the JSON object - no explanation, no markdown fences""" # At query time: schema_str = json.dumps(StructuredAnswer.model_json_schema(), indent=2) system = _SYSTEM_TEMPLATE.format(schema=schema_str)
This approach has a nice property: when you add a field to StructuredAnswer, the prompt updates automatically. There is no separate "prompt schema" to maintain.
What does the injected schema actually look like? Here's a simplified excerpt of what model_json_schema() produces for our model:
{
"type": "object",
"properties": {
"answer": {
"type": "string",
"description": "Direct, concise answer to the question in 1-3 sentences"
},
"confidence": {
"type": "number",
"minimum": 0.0,
"maximum": 1.0,
"description": "How fully the provided context supports the answer (0=guessing, 1=fully supported)"
},
"cannot_answer": {
"type": "boolean",
"description": "Set true when the context lacks sufficient information - do NOT hallucinate"
},
...
},
"required": ["answer", "citations", "confidence", "cannot_answer", "reasoning"]
}
The LLM sees the field names, types, constraints (minimum/maximum), and your descriptions in the required array. That's everything it needs to produce a valid response - derived entirely from your Python type annotations.
🔁The Structured RAG Pipeline
The full pipeline is four steps. Steps 1 and 2 are identical to classic RAG; steps 3 and 4 are what makes it structured.
1 · Vector Search
Retrieve k chunks from ChromaDB. Same as every other article in this series.
2 · Schema Injection
Call model_json_schema(), serialize to JSON string, embed in system prompt.
3 · Ollama format=json
Grammar-constrained decoding ensures the raw output is always valid JSON, no fences.
4 · Pydantic Validation
model_validate_json() checks types, ranges, and required fields. Fallback on failure.
def query( question: str, collection: chromadb.Collection, model: str = "llama3.2:3b", k: int = 5, ) -> StructuredAnswer: # 1. Retrieve k chunks chunks = _vector_search(collection, question, k) if not chunks: return StructuredAnswer( answer="No documents found in this collection.", citations=[], confidence=0.0, cannot_answer=True, reasoning="vector search returned no results", ) # 2. Build schema-aware system prompt schema_str = json.dumps(StructuredAnswer.model_json_schema(), indent=2) context = "\n\n".join(f"[{cid}]\n{text}" for cid, text in chunks) system = _SYSTEM_TEMPLATE.format(schema=schema_str) # 3. Call Ollama with format="json" (grammar-constrained decoding) resp = ollama.chat( model=model, messages=[ {"role": "system", "content": system}, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}, ], format="json", options={"temperature": 0, "num_predict": 1024}, ) raw = resp["message"]["content"].strip() # 4. Validate with Pydantic return _parse(raw)
The context string prefixes each chunk with its ID in brackets: [chunk_abc]\ntext…. This is what makes citations verifiable - the model can only reference IDs it actually sees in the context. The system prompt rule says "never invent IDs", which the schema description reinforces.
Graceful fallback on parse failure
Even with format="json", the model might produce JSON that doesn't match the expected schema - a missing required field, a string where a float was expected, or a confidence outside [0,1]. The _parse() function handles this in layers:
def _parse(raw: str) -> StructuredAnswer: # Layer 1 - happy path: full Pydantic validation try: return StructuredAnswer.model_validate_json(raw) except (ValidationError, json.JSONDecodeError) as exc: log.warning("Structured parse failed (%s): %.120s", type(exc).__name__, raw) # Layer 2 - partial recovery: salvage the answer field if present try: data = json.loads(raw) return StructuredAnswer( answer=str(data.get("answer", raw[:400])), citations=[], confidence=0.0, cannot_answer=True, reasoning="schema validation failed - partial response preserved", ) except Exception: # Layer 3 - last resort: raw text as answer return StructuredAnswer( answer=raw[:400], citations=[], confidence=0.0, cannot_answer=True, reasoning="json parse error", )
The function never raises. The caller always gets a valid StructuredAnswer - even if validation failed. The cannot_answer=True + confidence=0.0 combination signals to downstream code that the response should be treated as unreliable.
🚫The cannot_answer Pattern
This field deserves its own section because it solves one of the hardest problems in RAG production systems: hallucination detection.
Without an explicit signal, a RAG system has three possible states - but only two visible outputs:
Without structured outputs
- Context is sufficient → LLM answers correctly ✓
- Context is insufficient → LLM silently hallucinates ✗
- Context is insufficient → LLM says "I don't know" (unreliable)
With cannot_answer field
- Context is sufficient →
cannot_answer: false, answer populated ✓ - Context is insufficient →
cannot_answer: true,confidence: 0.0✓ - Downstream code can branch on this boolean reliably
The cannot_answer flag is effective because it's in the schema - the model is forced to make an explicit decision about it on every call. There's no way to "forget" to include it the way an LLM might forget to say "I don't know" in free-text mode.
Production pattern: In a real application, check cannot_answer first. If it's true (or confidence < 0.4), route the query to a human agent or return a UI message like "I couldn't find a reliable answer in the knowledge base." Only display the answer field when cannot_answer=false and confidence is high.
Here's what a response looks like for a well-answered vs unanswerable question:
🌐FastAPI Endpoints
The router wraps the pipeline in two endpoints: one for querying, one for exposing the schema itself.
router = APIRouter(prefix="/structured", tags=["structured"]) @router.post("/query", response_model=QueryResponse) async def structured_query( q: Annotated[str, Query(description="Question to answer")], tenant_id: Annotated[str, Query()] = "default", model: Annotated[str, Query(description="Ollama model")] = "llama3.2:3b", k: Annotated[int, Query(ge=1, le=10)] = 5, ): collection = _get_collection(tenant_id) result = sr.query(question=q, collection=collection, model=model, k=k) return QueryResponse( answer = result.answer, citations = [CitationOut(...) for c in result.citations], confidence = round(result.confidence, 3), cannot_answer = result.cannot_answer, reasoning = result.reasoning, ) @router.get("/schema") async def get_answer_schema() -> dict: # Expose the schema - useful for API consumers and debugging prompts return StructuredAnswer.model_json_schema()
The GET /structured/schema endpoint is worth calling out. It exposes the exact schema the LLM is being asked to follow. When debugging a citation compliance issue, the first thing to check is whether the schema in the prompt matches what you expect - and this endpoint gives you that without digging into the code.
Testing the endpoint
# Run a structured query curl -s -X POST \ "http://localhost:8000/structured/query?q=What+is+the+rate+limit+policy&tenant_id=default&k=5" \ | python3 -m json.tool # Inspect the live schema the LLM receives curl -s "http://localhost:8000/structured/schema" | python3 -m json.tool
⚖️What You Gain (and Lose)
Structured outputs are not free. Here's an honest comparison:
| Dimension | Standard RAG | Structured RAG |
|---|---|---|
| Answer format | Free text - flexible, human-readable | JSON - machine-readable, strongly typed |
| Provenance | Opaque - no citation trail | Explicit chunk citations with excerpts |
| Hallucination signal | None - LLM may fabricate silently | cannot_answer flag + low confidence score |
| System prompt size | Small | Larger - schema adds ~30-60 tokens for this model |
| Latency | Baseline | Similar - grammar decoding is fast; schema adds minor prompt overhead |
| Small model compliance | N/A | Variable - 3B models follow schema less reliably than 8B+ |
| Downstream integration | Manual parsing / regex | Zero-effort - Pydantic object is already typed and validated |
| Debugging | Read the free text | Inspect reasoning, check citations against chunks |
When to use structured outputs: Any time the answer goes into a database, triggers a workflow, or is compared programmatically. When the answer is displayed directly to a human in a chat UI, free text is often better - structured JSON displayed verbatim looks terrible. The two approaches are complementary, not competing.
Combining with previous articles
Structured outputs compose cleanly with the patterns from earlier articles in this series:
- Article 8 (CRAG) - Run the corrective retrieval step first to filter low-quality chunks, then pass only the curated chunks through the structured RAG pipeline. The
confidencefield becomes more meaningful when the input is already filtered. - Article 7 (Agentic RAG) - The ReAct agent's
summarise_chunkstool can return aStructuredAnswerinstead of a string. The agent then has structured access to citations and the cannot_answer flag as part of its reasoning loop. - Article 5 (RAGAS evaluation) - The
citationsfield maps directly to RAGAS'sretrieved_contextsand theanswerfield to theresponse. Structured outputs make RAGAS evaluation almost free - no parsing step required.
From Answers to Structured Data
Two additions - format="json" and a Pydantic schema in the system prompt - transform your RAG pipeline from a text generator into a typed API. The output is machine-readable, self-documenting, and safe to consume downstream without any parsing logic.
The cannot_answer flag alone is worth the complexity cost: it gives you a reliable signal to route uncertain queries to a human rather than displaying a confident-sounding hallucination.
Next: Article 10 - Production RAG. We bring together everything from the series - multi-tenant storage, streaming, evaluation, CRAG, and structured outputs - into a single deployable FastAPI service with observability, async ingestion, and a Makefile-driven deployment pipeline.
Article 10: Production RAG - Everything Together →