RAG Isn’t Enough: Building the Context Layer That Actually Makes LLM Systems Work

// Table of Contents

01 The Problem RAG Alone Cannot Solve
02 Context Engineering: A New Discipline
03 The Three Generations of RAG Architecture
04 Understanding RAG Architecture in Depth
05 Context Engine: The Missing Layer
06 Memory Systems: In-Session vs Cross-Session
07 Compression Strategies and Token Budgets
08 Re-ranking: Fixing Lost-in-the-Middle
09 Advanced Retrieval Patterns for Production
10 Agentic RAG and the Karpathy Pattern
11 Measurement, Evaluation, and LLM-as-Judge
12 Security, Governance, and Anti-Patterns
13 Choosing the Right Architecture

⚡ The Problem RAG Alone Cannot Solve

Most RAG tutorials teach you the same thing: embed your documents, store them in a vector database, retrieve the top-k chunks at query time, and pass them to an LLM. The pipeline works beautifully in a notebook. Then you put it in front of real users and something breaks in a way no chunking parameter can fix.

The conversation runs for ten turns. The user references something they said three exchanges ago. The retrieved documents are relevant, but they are competing with a growing backlog of conversation history for space in the context window. The model starts truncating. It loses the thread. It hallucinates a fact that contradicts a document it saw four turns ago. Your system has a context management problem, not a retrieval problem.

Research bears this out with a striking statistic: nearly 65% of enterprise AI failures in 2025 were attributed to context drift or memory loss during multi-step reasoning, not to model capability gaps. A 2024 survey by Gao et al. found that over 70% of errors in modern LLM applications stem from incomplete, irrelevant, or poorly structured context - not from insufficient model capability. The bottleneck in 2026 has shifted from the model side to the context side.

💡 The Right Mental Model

Andrej Karpathy put it precisely: "The LLM is the CPU, and the context window is the RAM." Just as an operating system curates what fits into RAM, a production LLM application needs a deliberate layer that controls what enters the context window, in what order, at what size. RAG fills the context window. A context engine manages it.

65%

Enterprise AI Failures

Attributed to context drift or memory loss during multi-step reasoning in 2025, not model capability (Zylos Research, 2026).

70%

Context-Side Errors

Of LLM application errors stem from incomplete, irrelevant, or poorly structured context - not model failure (Gao et al., 2024 RAG Survey).

200K+

Token Windows Today

Claude supports 200K tokens, Gemini 3 Pro reaches 2M. Bigger windows didn't kill RAG - they created new context engineering challenges.

68%

Context Size Reduction

Achievable while retaining 91% of critical information through intelligent compression and selective retention (2025 research target).

🔬 Context Engineering: A New Discipline

In June 2025, Andrej Karpathy published a now-famous post on X that crystallised what many production engineers had been building implicitly for months. He wrote: "Context engineering is the delicate art and science of filling the context window with just the right information for the next step."

He was deliberate about calling it both an art and a science. Science because it involves task descriptions, few-shot examples, RAG, state and history, compression, tool definitions - all of which can be measured and optimised. Art because it requires intuition about how LLMs allocate attention, how much context aids versus confuses, and what order information should appear in.

It is worth drawing sharp distinctions between the three overlapping concepts:

Prompt Engineering

The craft of what you say to the model - your system prompt, output format instructions, and few-shot examples. It shapes how the model reasons within a fixed context. This is the outermost layer and the one most tutorials stop at.

RAG

A technique for fetching relevant external documents and including them before generation. It grounds the model in facts it was not trained on. RAG solves the what to retrieve problem. It says nothing about how to manage what arrives.

Context Engineering

The architectural layer between retrieval and the prompt - the deliberate decisions about which information flows into the context window, in what order, at what size, and under what token budget. It is the operating system for the model's RAM. This is the layer most tutorials skip entirely.

"Context engineering is the delicate art and science of filling the context window with just the right information for the next step. Too little or of the wrong form and the LLM doesn't have the right context for optimal performance. Too much or too irrelevant and costs go up and performance comes down." - Andrej Karpathy, June 2025

As of March 2026, context engineering is no longer a standalone concept - it sits inside a broader agent stack that also includes agent harnesses, interoperability protocols (MCP), project memory for coding agents, and trace-first observability. The centre of gravity has shifted from "how to pack the best prompt" to "how agent systems manage runtime state, memory, tools, protocols, approvals, and long-horizon execution."

📅 The Three Generations of RAG Architecture

RAG is not a static pattern. It has evolved through three distinct architectural generations, each addressing the failures of the previous. Understanding this evolution explains why a naive RAG pipeline breaks in production - and what to build instead.

Naive RAG - Linear Index-Retrieve-Generate

2020 – 2023

Fixed-length chunks, converted to vectors, stored in a vector database
Top-k retrieval by cosine similarity, concatenated directly as context
Three core failures: semantic fragmentation during chunking, insufficient retrieval precision, no quality validation of retrieved results
Works well for simple single-hop Q&A over a small, static knowledge base. Breaks for anything more complex.

Advanced RAG - Query Enhancement + Re-ranking

2023 – 2025

Query rewriting and step-back prompting to improve recall on ambiguous queries
Hybrid search: combining semantic vector search with BM25/TF-IDF keyword search
Cross-encoder re-ranking (BERT-based) to fix "lost in the middle" precision problems
Parent-child chunking: retrieve small chunks for precision, expand to parent for context
HyDE (Hypothetical Document Embeddings): generate a hypothetical answer, embed it, use that embedding to retrieve real documents - dramatically improves recall on sparse queries

Agentic RAG - Autonomous Retrieval Planning

2025 – Present

Agents autonomously determine whether retrieval is needed, and from which source
Dynamic source selection: vector stores, knowledge graphs, web search, APIs - chosen at runtime based on query type
Self-RAG: model decides when to retrieve, critiques its own outputs, retries when confidence is low
GraphRAG (Microsoft): builds entity-relationship graphs over the corpus, enabling theme-level answers that naive RAG cannot produce
Reflexion-style self-correction: agents verify answer correctness post-generation and re-retrieve if needed

Workflow diagram - full context-engineered RAG pipeline (advanced generation)

Query transform stage

Hybrid retrieval (vector + BM25)

Cross-encoder re-ranker

GraphRAG node integration

Context engine (the missing layer)

🔗 Understanding RAG Architecture in Depth

Before building the context engine layer, you need a firm grip on what RAG actually does and where each component makes decisions that affect context quality. A production RAG pipeline is not a single step - it is a cascade of decisions, each of which compounds into the final context the model sees.

Stage 1: Chunking Strategy

How you split your documents determines the fundamental granularity of information available for retrieval. The three canonical approaches have distinct trade-offs:

Strategy	Mechanism	Strengths	Weaknesses	Best For
Character Split	Splits strictly by character count with optional overlap	Fast, deterministic, zero dependencies	Can cut words, sentences, or paragraphs mid-thought, destroying semantic coherence	Structured data, logs, code
Recursive Split	Tries paragraph → sentence → word boundaries in sequence	Preserves semantic units, the default choice for prose	Inconsistent chunk sizes, poor on highly structured documents	General prose, articles, docs
Token Split	Splits on LLM tokenizer vocabulary (tiktoken)	Guarantees chunks fit context windows exactly, no overflow surprises	Computationally expensive, may ignore semantic boundaries	Context-window-constrained pipelines
Semantic Split	Embeds sentences, splits on embedding similarity drops	Best semantic coherence, topic-aware boundaries	Slow (requires embedding every sentence), variable chunk sizes	High-precision retrieval over long docs
Parent-Child	Index small chunks for retrieval, expand to parent for context	High precision retrieval + rich context for generation	Increased storage, more complex indexing pipeline	Long documents, technical manuals
AST-based (Code)	Parses code along semantically meaningful AST boundaries	Preserves function, class, module boundaries exactly	Language-specific, embedding search degrades at codebase scale	Code repositories, APIs, SDKs

Stage 2: Hybrid Retrieval and Reciprocal Rank Fusion

Pure semantic (vector) retrieval casts a wide net but produces imperfect rankings. Relevant documents frequently end up "lost in the middle" of the result list. Production systems in 2026 combine two complementary signals: semantic embeddings (high recall, concept-level) and BM25/TF-IDF keyword matching (high precision, exact terminology). The scores are fused via Reciprocal Rank Fusion (RRF), which combines rankings without requiring score normalisation.

Python - Hybrid Retriever with RRF Score Fusion

from langchain_community.vectorstores import Chroma
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_openai import OpenAIEmbeddings

# Build both retrieval backends from the same documents
docs = load_documents()  # your document loading logic

# Semantic retriever - captures concept-level relevance
vectorstore = Chroma.from_documents(docs, embedding=OpenAIEmbeddings())
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 25})

# Keyword retriever - captures exact terminology and proper nouns
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 25

# EnsembleRetriever applies Reciprocal Rank Fusion automatically.
# weights=[0.6, 0.4] means semantic scores weighted slightly higher.
# Tune toward 0.4/0.6 for keyword-heavy technical queries.
hybrid_retriever = EnsembleRetriever(
    retrievers=[semantic_retriever, bm25_retriever],
    weights=[0.6, 0.4]
)

# Returns fused, deduplicated results - typically top 15-20 candidates
candidates = hybrid_retriever.invoke("What is the difference between RAG and fine-tuning?")

⚙️ Context Engine: The Missing Layer

The context engine sits between retrieval output and the final prompt construction. Its job is to make explicit what naive RAG leaves implicit: exactly what information enters the context window, at what size, in what order, within a strict token budget. Without this layer, your system's behaviour under pressure (long conversations, noisy retrieval, latency constraints) is undefined.

// Context Engine: Four Core Responsibilities

🧠

Memory

Persist + recall conversation turns across sessions

→

📉

Compress

Summarise or trim low-priority history

→

🏆

Re-rank

Elevate most relevant chunks, demote noise

→

🪙

Budget

Allocate tokens across memory, docs, system prompt

→

📐

Order

Place critical info at start or end, not the middle

Token Budget Allocation

Every context window has a fixed size. A context engine must make explicit allocation decisions before construction, not discover overflow at inference time. A practical allocation for a 128K token window looks like this:

Context Slot	Default Allocation	Notes
System prompt	5% (~6K tokens)	Instructions, persona, output format, tool descriptions. Fixed size, highest priority.
Recent conversation history	20% (~25K tokens)	Last N turns verbatim. Do not compress. Recency matters most for coherence.
Summarised history	10% (~13K tokens)	LLM-generated rolling summary of older turns. Updated asynchronously to avoid latency.
Retrieved documents	45% (~58K tokens)	Re-ranked, compressed, deduplicated retrieval results. Largest slot - where RAG lives.
User message + output reserve	20% (~26K tokens)	Current turn input plus headroom for the model's response. Never sacrifice this.

Python - Context Engine: Budget-Aware Context Construction

import tiktoken
from dataclasses import dataclass, field
from typing import Optional

enc = tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

@dataclass
class ContextBudget:
    total_tokens: int = 128_000
    system_prompt_pct: float = 0.05
    recent_history_pct: float = 0.20
    summary_pct: float = 0.10
    documents_pct: float = 0.45
    output_reserve_pct: float = 0.20

    def docs_budget(self) -> int:
        return int(self.total_tokens * self.documents_pct)

    def history_budget(self) -> int:
        return int(self.total_tokens * (self.recent_history_pct + self.summary_pct))

class ContextEngine:
    def __init__(self, budget: ContextBudget):
        self.budget = budget

    def build_context(
        self,
        system_prompt: str,
        history_turns: list[dict],
        retrieved_docs: list[str],
        user_query: str,
        summary: Optional[str] = None
    ) -> dict:
        # 1. Allocate doc budget and compress if needed
        doc_budget = self.budget.docs_budget()
        docs_ctx, compression_ratio = self._fit_documents(retrieved_docs, doc_budget)

        # 2. Allocate history budget: recent verbatim + rolling summary
        hist_budget = self.budget.history_budget()
        hist_ctx = self._fit_history(history_turns, summary, hist_budget)

        return {
            "system": system_prompt,
            "context": docs_ctx,
            "history": hist_ctx,
            "query": user_query,
            "compression_ratio": compression_ratio,
            "tokens_used": count_tokens(docs_ctx + hist_ctx),
        }

    def _fit_documents(self, docs: list[str], budget: int) -> tuple[str, float]:
        # Pack documents greedily until budget is exhausted
        selected, total = [], 0
        for doc in docs:
            tokens = count_tokens(doc)
            if total + tokens <= budget:
                selected.append(doc)
                total += tokens
            else:
                break  # never overflow - hard budget ceiling
        original_tokens = sum(count_tokens(d) for d in docs)
        ratio = 1 - (total / original_tokens) if original_tokens > 0 else 0
        return "\n\n---\n\n".join(selected), ratio

    def _fit_history(self, turns: list[dict], summary: Optional[str], budget: int) -> str:
        # Always include summary first (if present), then most-recent turns
        parts, used = [], 0
        if summary:
            s_tokens = count_tokens(summary)
            if s_tokens <= budget // 3:  # cap summary at 1/3 of history budget
                parts.append(f"[Summary of prior conversation]\n{summary}")
                used += s_tokens
        # Add recent turns newest-first, stop when budget exhausted
        for turn in reversed(turns):
            text = f"{turn['role']}: {turn['content']}"
            t = count_tokens(text)
            if used + t > budget:
                break
            parts.insert(1 if summary else 0, text)
            used += t
        return "\n".join(parts)

💾 Memory Systems: In-Session vs Cross-Session

Memory in LLM systems is not a single concept - it is three distinct problems requiring three distinct architectures. Conflating them is one of the most common design mistakes in production systems.

Workflow diagram - three memory layers for production LLM applications

In-context (current session, state["messages"])

Cross-session (checkpointer, by thread_id)

Semantic long-term (vector similarity recall)

📉 Compression Strategies and Token Budgets

When conversation history and retrieved documents together exceed the available token budget, the system must compress rather than truncate blindly. The difference matters: truncation discards arbitrarily. Compression preserves meaning at lower token cost. There are four production-grade compression strategies:

Rolling Summary (Async)

→Maintain a running LLM-generated summary of older turns
→Update asynchronously after each turn to avoid latency on the hot path
→Keep last N turns verbatim, summarise everything before that
→Cost: one LLM call per update (background job, not blocking)

TextRank Extractive Compression

→Graph-based sentence ranking, zero LLM calls needed
→Identifies most informative sentences by co-occurrence weight
→Achieves 40-60% reduction with ~85% semantic retention
→Use for retrieved document compression, not conversation history

✅ Real Numbers from the Context Engine Implementation

In the reference implementation (800-token budget, 5 documents, multi-turn conversation): Turn 1 (no history) - 48% compression applied to documents, all 5 retrieved docs fit. Turn 5 (4 prior turns in memory) - compression tightens automatically to 55% as history consumes more budget. The model always receives a coherent, non-overflowing context. The system adapts; it never fails silently with a token overflow error.

🏆 Re-ranking: Fixing Lost-in-the-Middle

Vector search maximises recall: it finds everything that might be relevant. The problem is that relevance at position #7 is just as invisible to the model as irrelevance at position #3. LLMs systematically attend less to information buried in the middle of long contexts - a well-documented phenomenon called "lost in the middle."

Re-ranking solves this by introducing a second-stage model that examines query-document pairs holistically. Cross-encoders (like BERT-based cross-attention models, Cohere Reranker, or ColBERT) evaluate relevance far more accurately than cosine similarity because they see the full text of both the query and the document simultaneously - not just their compressed embeddings.

Workflow diagram - two-stage retrieval with cross-encoder re-ranking

Stage 1 (fast, high-recall vector search)

Stage 2 (slow, high-precision cross-encoder)

Final ranked output → context engine

⚠ The Placement Rule

After re-ranking, how you order chunks in the context window matters as much as which chunks you selected. Anthropic's long context usage guide explicitly advises placing the most critical information at the beginning or end of the context. Never bury the most relevant retrieved document in the middle of a long context block - the model's attention there is systematically weaker.

🔎 Advanced Retrieval Patterns for Production

HyDE: Hypothetical Document Embeddings

Standard retrieval embeds the user's query and finds documents with similar embeddings. The problem: user queries are short, underspecified, and semantically far from the dense prose of the documents they are looking for. HyDE inverts this: it asks the LLM to generate a hypothetical answer to the query, then embeds that hypothetical answer and uses that richer embedding for retrieval. The hypothetical document is in the same semantic space as real documents, dramatically improving recall on sparse or ambiguous queries.

Python - HyDE: Hypothetical Document Embedding Retrieval

from langchain_core.prompts import ChatPromptTemplate
from langchain_anthropic import ChatAnthropic
from langchain_core.output_parsers import StrOutputParser
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

llm = ChatAnthropic(model="claude-haiku-4-5")  # cheap model for hypothesis generation
embeddings = OpenAIEmbeddings()
vectorstore = Chroma("my_index", embeddings)

# Step 1: Generate a hypothetical document that would answer the query
hypothesis_prompt = ChatPromptTemplate.from_template("""Write a short factual paragraph 
that would be a perfect answer to the following question. 
Be specific and detailed - do not hedge or say 'I don't know'.
Question: {query}
Hypothetical answer:""")

generate_hypothesis = hypothesis_prompt | llm | StrOutputParser()

def hyde_retrieve(query: str, k: int = 10) -> list:
    # Generate hypothetical document - used only for embedding, never shown to user
    hypothesis = generate_hypothesis.invoke({"query": query})

    # Retrieve using the hypothesis embedding (semantically richer than query alone)
    results = vectorstore.similarity_search_by_vector(
        embeddings.embed_query(hypothesis),
        k=k
    )
    return results

# Usage: sparse or ambiguous query benefits most from HyDE
# "What causes context rot in transformer models?"
# → hypothesis: detailed technical paragraph about attention dilution
# → retrieves docs that mention attention, context window, dilution
# → much better than embedding the 8-word query directly
results = hyde_retrieve("What causes context rot in transformer models?")

GraphRAG: Entity-Relationship Retrieval

Standard vector retrieval excels at pinpoint facts ("What does X mean?") but struggles with global questions ("What themes emerge across this entire corpus?"). GraphRAG, first demonstrated by Microsoft in 2024, builds an entity-relationship graph over the corpus during indexing. At query time, it traverses the graph rather than searching by embedding similarity, enabling theme-level and relationship-level answers with full traceability.

🔵

Naive RAG

Answers pinpoint facts from the closest matching chunk. Fails on cross-document themes or relationships.

local queries only

🕸

GraphRAG

Builds entity-relationship graph at index time. Traverses graph for theme-level, global, and relationship queries.

local + global queries

🔄

Self-RAG

Model decides when to retrieve, critiques its own outputs, and retries when confidence falls below threshold.

adaptive retrieval

🤖

Agentic RAG

LLM agent plans multi-step retrieval, selects sources dynamically, self-verifies answers post-generation.

fully autonomous

💡

HyDE

Generate hypothetical answer, embed it, retrieve against that richer embedding. Best for sparse/ambiguous queries.

recall improvement

🪆

Parent-Child

Index small chunks for high-precision retrieval. Expand to parent chunks at generation time for full context.

precision + context

🤖 Agentic RAG and the Karpathy Pattern

In April 2026, Andrej Karpathy published a GitHub Gist describing an architectural pattern that drew immediate attention as a potential successor to both naive RAG and vector database approaches for personal and team knowledge management. He called it an LLM Knowledge Base - a persistent, agent-maintained wiki.

The core thesis: instead of using an LLM to perform just-in-time retrieval from a static pool of raw documents, deploy an LLM agent to proactively and continuously compile those documents into a persistent, interconnected, structured knowledge base - a wiki. The heavy cognitive work of reading, extracting entities, identifying relationships, and synthesising conclusions happens once at ingestion. Subsequent queries operate on curated knowledge, not raw documents.

Traditional RAG (Stateless)

xRe-discovers knowledge on every query from scratch
xWasted compute: same relationships re-derived each time
xNo cross-document synthesis at indexing time
x"Amnesiac" - forgets what it concluded last time
xQuality bounded by retrieval precision at query time

Karpathy Pattern (Stateful Wiki)

✓Agent compiles knowledge once, continuously refines it
✓Relationships and syntheses stored in a persistent graph
✓Cross-document entities linked at ingestion, not query time
✓Queries answer from curated Markdown wiki, not raw docs
✓Works at ~100 articles / ~400K words without vector infra

💡 Enterprise Caveat

Karpathy's original implementation is built for personal use (Obsidian, local Markdown files). The community has been quick to identify the gap: scaling to enterprise environments with thousands of employees, millions of records, and tribal knowledge that contradicts itself across teams requires server-side, transactional, and secure knowledge layer infrastructure. Epsilla's Semantic Graph and similar enterprise products are the current direction. The principle - compile once, query from structure - is sound. The implementation must be adapted for production.

📊 Measurement, Evaluation, and LLM-as-Judge

A context engineering pipeline without measurement is guesswork. Production RAG systems require systematic evaluation across four dimensions: retrieval quality, generation accuracy, system latency, and token cost. In 2026, LLM-as-judge has become the dominant approach for automated quality evaluation, supplementing but not replacing human annotation for high-stakes deployments.

≥95%

Target Accuracy

Percentage of correct answers on your evaluation set. Baseline: measure before any retrieval changes to establish a floor.

≥90%

Precision

Fraction of answers that are factually correct (no hallucinations). Measure with LLM-as-judge using Claude or GPT-4o as the evaluator.

≥85%

Recall

Fraction of ground-truth facts that appear in the generated answer. Critical for compliance and completeness use cases.

<500ms

p95 Latency

Time to first token at the 95th percentile. Hybrid retrieval adds ~30ms, cross-encoder reranking adds ~80ms. Budget accordingly.

Python - LLM-as-Judge Evaluator for RAG Systems

from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from typing import Literal

class RAGEvalResult(BaseModel):
    faithfulness: Literal["yes", "no", "partial"] = Field(
        description="Does the answer contain only facts from the provided context?"
    )
    relevance: int = Field(ge=1, le=5, description="1=irrelevant, 5=perfectly relevant")
    completeness: int = Field(ge=1, le=5, description="1=major gaps, 5=fully addresses question")
    explanation: str = Field(description="One sentence explanation of the scores")

eval_prompt = ChatPromptTemplate.from_template("""You are an expert evaluator for RAG systems.
Given a question, a retrieved context, and a generated answer, evaluate the answer.

Question: {question}

Retrieved Context:
{context}

Generated Answer:
{answer}

Evaluate faithfulness (does the answer stick to context?), relevance, and completeness.
Return valid JSON matching the schema.""")

# Use claude-sonnet for evaluation (strong reasoning), haiku for generation (cost)
judge_llm = ChatAnthropic(model="claude-sonnet-4-5").with_structured_output(RAGEvalResult)

evaluator = eval_prompt | judge_llm

def evaluate_answer(question: str, context: str, answer: str) -> RAGEvalResult:
    return evaluator.invoke({
        "question": question,
        "context": context,
        "answer": answer
    })

# Run evaluation over your test set
results = [evaluate_answer(q, c, a) for q, c, a in test_cases]
avg_faithfulness = sum(1 for r in results if r.faithfulness == "yes") / len(results)
avg_relevance    = sum(r.relevance for r in results) / len(results)
print(f"Faithfulness: {avg_faithfulness:.1%}  Relevance: {avg_relevance:.2f}/5")

🛡 Security, Governance, and Anti-Patterns

! Injecting Full Conversation History Verbatim

Passing every previous message verbatim into every subsequent prompt - unbounded history - causes context window overflow for long sessions, inflates token cost superlinearly, and degraded the model's ability to focus on the current query. Legacy LangChain classes like ConversationBufferMemory and ConversationChain encourage this pattern and are now deprecated.

Use rolling summarisation with a sliding verbatim window. Keep the last 3-5 turns verbatim; summarise everything older. With LangGraph, SqliteSaver or PostgresSaver checkpointers manage this per thread_id automatically. Trim with trim_messages() from langchain_core to enforce a hard token ceiling before every LLM call.

! No Token Budget Ceiling (Silent Overflow)

Systems that calculate total context size only at inference time discover overflows when it is too late. Payload truncation by the API is silent - the model just receives less than you intended, often dropping the most important middle sections first. This is a silent accuracy regression that is hard to detect without explicit monitoring.

Implement a hard max_tokens ceiling in your context construction, not in your prompt. Use tiktoken to count tokens before constructing the final payload. Emit a context_utilisation_pct metric to your observability stack; alert at >85%.

! Context Poisoning via Unvalidated Memory

An error introduced early in a conversation - a mistaken assumption or a hallucinated fact - gets stored in memory and reinjected on every subsequent turn. The model treats it as established truth and compounds the mistake across turns. This is particularly dangerous in agent systems where memory updates are automated.

Validate memory writes with a confidence score. Only persist facts the model has explicitly grounded in retrieved documents or user-confirmed information. In LangGraph, implement a separate "memory validation node" that runs before any memory write and applies a faithfulness check. Never let the model's own output directly overwrite authoritative state without validation.

! Treating Long-Context Models as a RAG Replacement

Gemini 3 Pro (2M tokens) and Claude (200K tokens) seem to eliminate the need for retrieval. Just stuff everything in. In practice: attention dilution means relevant content buried in 1M tokens of filler receives a fraction of the model's attention. Computational cost grows non-linearly with context length. And "stuffing" offers no selective access - the model sees everything, whether relevant or not.

Use long-context models for their sweet spot: entire codebase analysis, book-length document comparison, multi-document synthesis where boundaries matter. Use RAG for large, dynamic knowledge bases where retrieval precision outweighs the cost of the retrieval step. Hybrid architectures - retrieve the most relevant documents, then include them in full within a long context - often outperform either approach alone.

⚖️ Choosing the Right Architecture

The right context architecture depends on your workload topology, latency budget, and knowledge base characteristics. The table below maps the key decision axes to the architectural pattern that performs best.

Workload Type	KB Size	Query Pattern	Architecture	Why
Single-turn Q&A	<10K docs	Pinpoint facts	Naive RAG	Pipeline overhead unjustified. Simple LCEL chain with top-k retrieval is sufficient.
Multi-turn chatbot	Any	Context-dependent	Context Engine	Memory accumulation and token budget management become critical after turn 3.
Enterprise search	>100K docs	Mixed precision/recall	Advanced RAG	Hybrid retrieval + cross-encoder re-ranking required for acceptable precision at scale.
Thematic analysis	Corpus-level	Global questions	GraphRAG	Entity relationships needed for cross-document theme queries. Vector retrieval alone fails.
Research assistant	~100 articles	Synthesis + recall	Karpathy Pattern	Agent-maintained wiki eliminates repeated re-derivation. Works without vector infrastructure.
Autonomous agent	Dynamic	Multi-step reasoning	Agentic RAG	Agent selects sources, retries on low confidence, self-verifies post-generation.
Full codebase reasoning	Entire repo	Cross-file dependency	Long Context	Entire codebase processing provides clear value where AST + embedding search fails at scale.

The context engine doesn't decide what to retrieve. It decides what the model actually sees. That distinction - between retrieving information and managing the context that shapes reasoning - is the architectural insight that separates production systems from demos.

Build the Missing Layer

RAG systems break when context grows beyond a few turns - not because retrieval fails, but because nothing is managing what enters the context window. A context engine is not a luxury; it is the component that converts a demo into a production system.

The practical path: start with hybrid retrieval and a cross-encoder re-ranker. Add explicit token budget allocation. Implement rolling summarisation for history. Instrument context_utilisation_pct and faithfulness scores from day one. For theme-level queries, evaluate GraphRAG. For agent-maintained knowledge, evaluate the Karpathy Pattern.

The context engineering discipline is still maturing - but the core insight is durable: the critical bottleneck has shifted from model capability to context quality. Engineers who build the layer that controls what the model sees will build the systems that actually work.

Complete code: github.com/Emmimal/context-engine →

All code examples target LangChain ≥ 0.3, Python 3.11+, and Claude / claude-haiku-4-5 as the LLM provider. HyDE retrieval uses langchain_community; hybrid retrieval uses EnsembleRetriever from langchain; token counting uses tiktoken (OpenAI). Cross-encoder re-ranking via Cohere Reranker or a local BERT cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2).

// Sources & Further Reading

𝕏

Andrej Karpathy - "Context Engineering" post on X

x.com · June 2025

→

📄

RAG Isn't Enough - I Built the Missing Context Layer - Towards Data Science

towardsdatascience.com · April 2026 · github.com/Emmimal/context-engine

→

📄

LLM Context Window Management and Long-Context Strategies 2026 - Zylos Research

zylos.ai · January 2026

→

📄

Context Engineering Guide: RAG, Memory Systems & Dynamic Context for Production AI - Meta Intelligence

meta-intelligence.tech · December 2025

→

📄

Context Engineering for Agents - LangChain Blog

blog.langchain.com · October 2025

→

📄

From RAG to Context: 2025 Year-End Review - RAGFlow

ragflow.io · December 2025

→

📄

Karpathy Shares LLM Knowledge Base Architecture That Bypasses RAG - VentureBeat

venturebeat.com · April 2026

→

📄

Advanced RAG Techniques (Chunking, Re-ranking, HyDE) - Google Codelabs

codelabs.developers.google.com · Google Cloud

→

📄

RAG in 2025: Enterprise Guide to RAG, GraphRAG & Agentic AI - Data Nucleus

datanucleus.dev · January 2026

→

⭐

Awesome Context Engineering - Comprehensive Survey (Papers, Frameworks, Guides)

github.com/Meirtz · Updated March 2026

→