1. Document Analysis with LLMs
Build a production-ready document Q&A pipeline from scratch. Load PDFs, Word files and web pages, split them into semantic chunks, embed with open-source models, store in a vector DB, and answer questions with full source attribution — zero paid APIs required.
Document Analysis
with LLMs
From blank terminal to a fully working RAG pipeline — ingest any PDF, chunk it semantically, embed it into a local vector store, and answer natural-language questions with sourced, Pydantic-validated answers.
📄Why This Matters
Every organisation accumulates PDFs. Research papers, compliance documents, technical specifications, financial reports, contracts — the knowledge locked inside these files is either retrieved manually (slow, expensive, inconsistent) or simply ignored.
The naive approach — paste the whole document into ChatGPT — breaks down immediately in any real scenario. A 60-page technical specification is ~45,000 tokens, which exceeds most context windows and costs dollars per query. More importantly, pasting everything gives the LLM no retrieval discipline: it hallucinates freely because it cannot distinguish what was actually in the document from what it knows from training.
This is not a research prototype. Notion AI, Confluence AI Assist, Adobe Acrobat AI, and every legal contract analysis tool you have encountered are all RAG pipelines at their core.
By the end of this article you will have built a complete, production-structured pipeline. You will understand not just how to call the LangChain APIs, but why each architectural decision was made and what breaks if you choose differently.
🏗️Architecture Overview
The pipeline has two distinct phases — Ingestion (runs once per document) and Query (runs per question). These are deliberately separate: embedding is expensive and happens infrequently; querying is cheap and happens constantly.
- ✓PyMuPDF reads the PDF text stream directly — 10× faster than rendering
- ✓pdfplumber uses whitespace analysis — the only reliable way to extract table cells
- ✓Using both: speed for the common case, accuracy for structured data
- ✓Embedding is expensive (~30s for 100 pages) — do it once
- ✓Query is cheap (~2s) — do it hundreds of times
- ✓ChromaDB persists to disk — no re-embedding on restart
⚙️Technology Stack
PyMuPDF 1.24
Fast text + metadata extraction. Reads the PDF content stream directly — 10× faster than OCR-based alternatives. Import as fitz.
pdfplumber 0.11
Table detection via whitespace and line analysis. Chosen over PyMuPDF table mode for reliability on real-world PDFs with merged cells and irregular layouts.
ChromaDB 0.5
Embedded vector database — no server, no Docker. Persists to SQLite on disk. Supports metadata filtering and MMR search. Handles up to ~1M vectors in embedded mode.
all-MiniLM-L6-v2
22M-parameter local embedding model. Runs on CPU, 384 dimensions, ~2 000–5 000 sentences/sec depending on hardware and batch size. Free, private, no API key. Downloaded once and cached by sentence-transformers.
LangChain 0.3 LCEL
Pipe-syntax chains: retriever | prompt | llm | parser. Every LCEL chain is automatically async-capable and LangSmith-traceable. No deprecated LLMChain.
Pydantic v2
Runtime validation at every data boundary. DocumentChunk, QAResponse, ChunkingConfig. PydanticOutputParser forces the LLM to produce schema-compliant JSON.
ChromaDB vs FAISS: FAISS is the industry standard for billion-scale ANN search but is in-memory only — you must serialize the index yourself. ChromaDB persists to disk by default and supports metadata filtering. For a CLI tool querying <50k chunks, ChromaDB is the right choice.
🔧Environment Setup
# Python 3.11+ required — verify first python --version # Must be 3.11.x or 3.12.x # Create and activate virtual environment python -m venv .venv source .venv/bin/activate # Linux/macOS # .venv\Scripts\activate # Windows # Install pinned dependencies pip install --upgrade pip pip install -r requirements.txt # Ollama (optional — local LLM, no API key) # Download from https://ollama.com, then: ollama pull llama3.1:8b # ~4.7 GB download
# LangChain ecosystem langchain==0.3.13 langchain-core==0.3.28 langchain-community==0.3.13 langchain-openai==0.2.11 langchain-ollama==0.2.3 langchain-chroma==0.1.4 langchain-huggingface==0.1.2 langchain-text-splitters==0.3.4 # PDF PyMuPDF==1.24.14 pdfplumber==0.11.4 # Vector store & embeddings chromadb==0.5.18 sentence-transformers==3.3.1 # Validation & config pydantic==2.10.3 python-dotenv==1.0.1 # CLI output rich==13.9.4
The project directory after setup:
rag-document-analysis/ ├── .venv/ # virtual environment — never commit ├── .chroma/ # ChromaDB storage — auto-created, never commit ├── requirements.txt ├── .env # NEVER commit — holds OPENAI_API_KEY ├── models.py # Pydantic v2 data models ├── ingestion.py # PyMuPDF + pdfplumber extraction ├── chunking.py # Fixed-size and recursive splitters + benchmark ├── vectorstore.py # ChromaDB management ├── chain.py # LCEL Q&A chain + map-reduce summarisation └── cli.py # argparse CLI: ingest / ask / summarize / benchmark
Add .env and .chroma/ to your .gitignore before the first commit. .env contains your OpenAI API key. .chroma/ contains your document embeddings — large, reproducible, and unnecessary in git history.
🔨Implementation — 7 Phases
Each phase produces a tested, runnable artefact before you move to the next. The phases are ordered so that every module only depends on modules from earlier phases.
Define data contracts first. Every module boundary in this pipeline is typed with a Pydantic model — this gives us runtime validation, IDE autocompletion, and the schema that PydanticOutputParser injects into the LLM prompt.
from pydantic import BaseModel, ConfigDict, field_validator import hashlib class DocumentChunk(BaseModel): """Immutable chunk of text from a PDF page.""" model_config = ConfigDict(frozen=True) chunk_id: str # 16-char SHA256 — stable across re-runs source_file: str # Absolute path — stable chunk IDs regardless of cwd page_number: int # 1-based — matches what users see in PDF viewers chunk_index: int # 0-based — reconstructs reading order text: str word_count: int char_count: int @field_validator("text") @classmethod def text_not_empty(cls, v: str) -> str: if not v.strip(): raise ValueError("Chunk text cannot be empty") return v.strip() @classmethod def create(cls, source_file, page_number, chunk_index, text): s = text.strip() chunk_id = hashlib.sha256( f"{source_file}:{chunk_index}:{s[:64]}".encode() ).hexdigest()[:16] return cls(chunk_id=chunk_id, source_file=source_file, page_number=page_number, chunk_index=chunk_index, text=s, word_count=len(s.split()), char_count=len(s)) class QAResponse(BaseModel): """Structured answer — enforced on LLM output via PydanticOutputParser.""" model_config = ConfigDict(frozen=True) question: str answer: str source_chunk_ids: list[str] page_references: list[int] confidence_explanation: str @field_validator("page_references") @classmethod def sort_pages(cls, v) -> list[int]: return sorted(set(v)) # [5,3,5] → [3,5] automatically
LangChain 0.3.x uses Pydantic v2 internally. Always use v2 syntax: @field_validator with @classmethod, ConfigDict, model_dump(). Using the deprecated v1 @validator will produce warnings and may silently break PydanticOutputParser.get_format_instructions().
Two libraries, each doing what it does best. PyMuPDF reads the PDF text stream directly for speed. pdfplumber uses whitespace and line analysis for reliable table cell detection.
import fitz # PyMuPDF — package: PyMuPDF, import: fitz import pdfplumber def extract_text_with_pymupdf(pdf_path): pages = [] with fitz.open(str(pdf_path)) as doc: for i in range(len(doc)): text = doc[i].get_text("text") # plain text with paragraph \n if text.strip(): pages.append((i + 1, text)) # 1-based page number return pages def extract_tables_with_pdfplumber(pdf_path): tables = [] with pdfplumber.open(str(pdf_path)) as pdf: for i, page in enumerate(pdf.pages): try: raw_tables = page.extract_tables() except Exception as e: logger.warning("pdfplumber error page %d: %s", i+1, e) continue for j, raw in enumerate(raw_tables or []): try: tables.append(TableData.from_pdfplumber_table(raw, i+1, j)) except ValueError: pass # skip empty/malformed tables return tables
pdfplumber returns None for merged or empty cells. Always replace with an empty string before processing — a simple [cell or "" for cell in row] handles this. Forgetting this is the most common pdfplumber crash.
Chunking is the single most impactful decision in a RAG pipeline. Two strategies are implemented and benchmarked side by side.
- →Splits every N characters, overlap M chars
- →Fast and predictable
- ✗Cuts mid-sentence — degrades retrieval
- ✗High σ (chunk size variance)
- →Use when: token-budget precision required
- →Tries \n\n → \n → ". " → " " in order
- ✓Preserves paragraph and sentence boundaries
- ✓Lower σ = more consistent retrieval
- ✓Default for most prose documents
- →Use when: natural language text
from langchain_text_splitters import RecursiveCharacterTextSplitter def chunk_with_recursive_splitter(pages, source_file, config): splitter = RecursiveCharacterTextSplitter( # Try each separator in order — uses finest split only when needed separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""], chunk_size=config.chunk_size, # default: 512 chars chunk_overlap=config.chunk_overlap, # default: 64 chars length_function=len, ) chunks, idx = [], 0 for page_num, page_text in pages: for text in splitter.split_text(page_text): if len(text.split()) < config.min_chunk_words: continue # drop headers, footers, page numbers chunks.append(DocumentChunk.create(source_file, page_num, idx, text)) idx += 1 return chunks
Running the benchmark on a 15-page research paper shows why standard deviation matters more than average size:
| Metric | Fixed-size | Recursive | Winner |
|---|---|---|---|
| Total chunks | 52 | 47 | — |
| Avg size (chars) | 489 | 502 | — |
| Std deviation (σ) | 143 | 98 | ✓ Recursive |
| Min size | 38 | 61 | ✓ Recursive |
| Max size | 512 | 512 | Tie |
| Sentence boundaries preserved | ~60% | ~95% | ✓ Recursive |
Three operations: create/load, add chunks idempotently, retrieve with MMR diversity.
from langchain_chroma import Chroma from langchain_huggingface import HuggingFaceEmbeddings # Normalise embeddings: required for cosine similarity = dot product embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={"device": "cpu"}, encode_kwargs={"normalize_embeddings": True}, ) # Load existing collection OR create new — same code path vs = Chroma(collection_name="document_analysis", embedding_function=embeddings, persist_directory=str(persist_dir)) # MMR retriever: diverse AND relevant results retriever = vs.as_retriever( search_type="mmr", search_kwargs={"k": 4, "fetch_k": 12, "lambda_mult": 0.7}, # lambda_mult: 0.7 = 70% relevance, 30% diversity # Without MMR: might return 4 near-identical chunks from same paragraph )
Before adding chunks, fetch existing IDs from the collection and skip any chunk whose chunk_id already exists. Since chunk_id is a deterministic hash of the source path + index + text, running ingest twice on the same PDF adds zero duplicates.
LangChain Expression Language composes components with the pipe operator. The chain below runs the retriever and the question in parallel, merges them into a prompt, calls the LLM, and validates the output against QAResponse.
from langchain_core.output_parsers import PydanticOutputParser from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import RunnableLambda, RunnablePassthrough parser = PydanticOutputParser(pydantic_object=QAResponse) prompt = ChatPromptTemplate.from_messages([ ("system", "You are a precise document assistant. Answer only from context.\n" "{format_instructions}"), ("human", "Context:\n{context}\n\nQuestion: {question}\n\nJSON only."), ]).partial(format_instructions=parser.get_format_instructions()) chain = ( { # Two branches run in parallel: "context": retriever | RunnableLambda(format_retrieved_docs), "question": RunnablePassthrough(), # pass question through unchanged } | prompt | llm # ChatOpenAI(temperature=0) or ChatOllama(temperature=0) | parser # parses JSON → validated QAResponse object ) response = chain.invoke("What is the main finding of this paper?") # response.answer, response.page_references, response.source_chunk_ids
Using temperature > 0 with PydanticOutputParser causes frequent OutputParserException errors — the LLM randomly adds prose around the JSON or omits required fields. Set temperature=0 for any chain that requires structured output.
For documents too large to fit in one LLM context window, map-reduce processes each chunk independently then combines the summaries.
# Map: summarise each chunk independently map_chain = ChatPromptTemplate.from_template( "Summarise in 2-4 sentences. Preserve facts, numbers, entities.\n\nPassage:\n{text}\n\nSummary:" ) | llm | StrOutputParser() # Reduce: combine all summaries into one coherent document reduce_chain = ChatPromptTemplate.from_template( "Combine these section summaries into one coherent document summary.\n\n{summaries}\n\nFinal:" ) | llm | StrOutputParser() def map_reduce_summarize(chunks): summaries = [map_chain.invoke({"text": c.text if hasattr(c, "text") else c}) for c in chunks] # map combined = "\n\n".join(f"[Section {i+1}]\n{s}" for i,s in enumerate(summaries)) return reduce_chain.invoke({"summaries": combined}) # reduce
The map step is embarrassingly parallel. Replace the list comprehension with await asyncio.gather(*[map_chain.ainvoke({"text": c}) for c in chunks]) to run all map calls concurrently. For a 100-chunk document, this reduces summarisation time from ~200s to ~3s (limited by API rate limits, not latency).
Four subcommands tie all phases together with rich output formatting.
# Ingest a PDF (extract → chunk → embed → store) python cli.py ingest paper.pdf python cli.py ingest paper.pdf --strategy fixed --chunk-size 256 # Ask a natural-language question python cli.py ask paper.pdf "What is the main conclusion?" python cli.py ask paper.pdf "What accuracy was achieved?" --top-k 6 # Map-reduce summary of the whole document python cli.py summarize paper.pdf # Benchmark chunking strategies side-by-side python cli.py benchmark paper.pdf --chunk-size 512
Always call load_dotenv() as the very first statement in cli.py, before any imports that read environment variables. If you import chain.py before calling load_dotenv(), LLM_PROVIDER and OPENAI_API_KEY will be empty when the LLM factory reads them.
▶️Running the System
$ python cli.py ingest attention_is_all_you_need.pdf ──────────────────── Ingestion Pipeline ───────────────────── PDF : attention_is_all_you_need.pdf Strategy : recursive chunk_size=512 overlap=64 ✓ Extracted 15 pages, 2 tables title: Attention Is All You Need author: Vaswani et al., 2017 ✓ Created 78 chunks (~6,240 words total) ✓ Stored in .chroma/attention_is_all_you_need/
$ python cli.py ask paper.pdf "What replaces recurrence in the Transformer?" ──────────────────── Question Answering ───────────────────── Question : What replaces recurrence in the Transformer? Top-K : 4
🔍Troubleshooting
fitz import name but the package is called PyMuPDF. Some environments have an unrelated fitz package installed that shadows it.python -c "import os; from dotenv import load_dotenv; load_dotenv(); print(repr(os.getenv('OPENAI_API_KEY')))" — if it prints None, your .env is not in the current working directory.🚀Performance & Production
| Bottleneck | Dev Setup | Production Fix | Speedup |
|---|---|---|---|
| Embedding speed | CPU, ~1k chunks/min | GPU (device="cuda") or OpenAI Batch API | 10–20× |
| Summarisation latency | Sequential map (200s for 100 chunks) | asyncio.gather() — concurrent map calls | 40–100× |
| ChromaDB query | Flat index O(N) | HNSW index for >100k vectors | 10–50× |
| Q&A cost | ~$0.0001/query | Redis cache keyed by hash(question) | 40–60% cost reduction |
extract_text_with_pymupdf.🎯Conclusion
You have built a complete, production-structured RAG pipeline for PDF document analysis. The system does everything commercial document analysis tools do at their core: reliable dual-library PDF extraction, semantically-aware chunking, persistent local vector storage with MMR retrieval, and LLM generation with Pydantic-validated structured output.
Chunking strategy matters more than LLM choice
A recursive splitter with good parameters outperforms a fixed-size splitter regardless of which LLM you use. Invest time in chunking before tuning prompts.
MMR prevents the "same paragraph 4 times" problem
Pure similarity search returns near-identical results. MMR ensures diverse, complementary chunks — the LLM sees 4 different aspects of the document, not the same fact repeated.
Pydantic models at every boundary
Validated data contracts catch bugs at the construction site, not three layers deep in the LLM prompt. QAResponse as the output schema makes the LLM accountable for its structure.
Idempotent ingestion from the start
Deterministic chunk IDs + deduplication check = safe to run ingest as many times as needed. This is cheap to build now and essential once you have a corpus of 50+ documents.
What to Build Next
The next article in this series builds the same pipeline without LangChain first — raw Python, NumPy cosine similarity, direct API calls — to demystify what the framework actually abstracts. Once you see what the abstraction hides, LCEL's value becomes obvious.
All source code for this article is available in the article-01/ directory of the series repository.