2. Build Your First RAG System from Scratch

Strip away every framework and build RAG with raw Python — numpy cosine similarity, direct OpenAI calls, no LangChain. Understand exactly what the abstractions hide before you use them.

2. Build Your First RAG System from Scratch
Series · Article 2 of 10

Build Your First RAG
from Scratch

Demystify retrieval-augmented generation by building it in raw Python first — no framework, just NumPy and FAISS — then refactor to LangChain LCEL to see exactly what the abstraction buys you.

⏱ ~50 min build 🔧 faiss · numpy · langchain-lcel · pydantic-v2 📦 Builds on Article 1

🔍Why Build Without the Framework?

In Article 1, we built a complete RAG pipeline using LangChain and ChromaDB. It worked well — but if you have ever stared at a LangChain chain and wondered what is actually happening inside that pipe operator, this article is for you.

The goal is not to avoid frameworks. The goal is to reach the point where, when you use a framework, every line of its code makes sense to you. There is a specific moment when this clicks: you have built the thing yourself, and then you see how the framework collapses five files into three lines. From that moment on, you know exactly what to reach for when the framework breaks.

A developer who has built RAG from scratch reads a LangChain stack trace in seconds. A developer who only used LangChain reads the same trace and opens a GitHub issue.

This article builds the same pipeline twice. The first time, every piece is explicit Python code — a custom chunker, direct sentence-transformers calls, NumPy matrix operations for cosine similarity, FAISS index management, and raw OpenAI SDK calls. The second time, we replace those ~200 lines with a 12-line LCEL chain. Same inputs, same outputs, identical behaviour.

By the end you will understand: what cosine similarity is (not just how to call it), why FAISS IndexFlatIP is appropriate for normalised embeddings, what each LangChain component actually does at the Python level, and when to use FAISS versus ChromaDB.

~200
Lines — Raw Pipeline
Explicit chunker + embedder + FAISS + LLM call — every step visible
~35
Lines — LCEL Pipeline
Same logic via LangChain abstractions — same result, 6× less code
6
Python Files
models, embedder, chunker, faiss_store, rag_raw, rag_lcel — one concern each
0
Hidden Steps
Every transformation in the raw pipeline is an explicit function call you wrote

🏗️Two Pipelines, One Goal

Both pipelines perform the same two operations: ingestion (turning a text file into a searchable index) and querying (retrieving relevant passages and generating a grounded answer). The difference is entirely in how much of the plumbing is explicit Python versus framework convention.

Ingestion pipeline — shared by both Raw and LCEL variants
📄
Read
Plain text file → UTF-8 string
✂️
Chunk
Sentence-boundary split with overlap window
🔢
Embed
all-MiniLM-L6-v2 → 384-dim unit vectors
💾
Index
FAISS IndexFlatIP, persisted to disk
Query pipeline — same flow, different implementation
Question
Natural language user query string
🔢
Embed
Same model → 384-dim query vector
🔎
Search
FAISS inner-product top-k retrieval
📝
Prompt
Format context + question into LLM prompt
🤖
Generate
LLM produces grounded answer

The raw pipeline implements every arrow in those diagrams as an explicit Python function. The LCEL pipeline implements the same arrows using LangChain's pipe operator — each | character is one of those arrows.

📐 Design note

Both pipelines share the same RAGResponse Pydantic model as their return type. This means they are drop-in replacements for each other — the compare CLI command can run both and display results side-by-side because they speak the same interface.

🧰Technology Stack

Every library is pinned. Every version was tested together. The stack is deliberately minimal — no database server, no API server, no Docker container.

VECTOR SEARCH FAISS

FAISS 1.8.0

Facebook AI Similarity Search. Exact nearest-neighbour on CPU. IndexFlatIP performs exhaustive inner-product search — perfect for normalised embeddings up to ~1M vectors.

EMBEDDINGS sentence transformers

sentence-transformers 3.3.1

Used directly — not via LangChain — in the raw pipeline. all-MiniLM-L6-v2 produces 384-dimensional embeddings in under 5ms per chunk on CPU.

NUMERICS np

NumPy 1.26.4

Matrix operations for the manual cosine similarity implementation. The key insight: for unit vectors, matrix @ query_vec produces cosine scores directly.

LLM FRAMEWORK Lang Chain

LangChain 0.3.13

Used only in the LCEL pipeline. Provides RecursiveCharacterTextSplitter, HuggingFaceEmbeddings, FAISS wrapper, ChatOpenAI, and the pipe-syntax chain builder.

VALIDATION Pydantic

Pydantic v2 (2.10.3)

DocumentChunk, EmbeddedChunk, RetrievedChunk, RAGResponse, IndexMetadata — all frozen models with field validators. Shared by both pipelines.

LLM PROVIDER OpenAI

OpenAI SDK 1.58.1

Used directly (no LangChain wrapper) in the raw pipeline. The Ollama OpenAI-compatible endpoint is also supported — set LLM_PROVIDER=ollama to run fully local.

Install everything at once:

bash
# Create and activate a virtual environment
python3.11 -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

# Install pinned dependencies
pip install -r requirements.txt

# Copy env template and fill in your API key
cp env.example .env
nano .env                   # or: echo "OPENAI_API_KEY=sk-..." >> .env
⚠️ PyTorch CPU vs CUDA

requirements.txt pins torch==2.5.1 (CPU build). If you have a CUDA GPU, replace it with torch==2.5.1+cu121 from the PyTorch index. Embedding time drops from ~5ms to ~0.8ms per batch — significant for large corpora, irrelevant for this tutorial.

📐Embeddings from First Principles

Before writing a line of code, it is worth understanding exactly what an embedding is and why cosine similarity works. Most tutorials skip this, which is why most developers can use embeddings but cannot reason about them when something goes wrong.

What is an embedding?

An embedding model is a function that maps any text string to a fixed-length vector of floating-point numbers. The all-MiniLM-L6-v2 model produces vectors of length 384. The key property — the property that makes embeddings useful — is that semantically similar texts produce geometrically similar vectors.

Two sentences about the same topic will produce vectors that point in roughly the same direction in 384-dimensional space. Two sentences about completely different topics will produce vectors that point in orthogonal or opposite directions.

Cosine similarity as dot product

Cosine similarity measures the angle between two vectors, ignoring their magnitude:

cos(θ) = (a · b) / (||a|| × ||b||)
Where a · b is the dot product and ||a|| is the L2 norm (magnitude) of vector a

The result ranges from −1 (opposite directions) to +1 (identical direction). For text embeddings, scores are typically in [0.2, 0.9] — you rarely see negative values because text embeddings do not span the full vector space symmetrically.

Here is the critical insight that makes our implementation efficient: if we normalise all vectors to unit length before storing them (so ||a|| = ||b|| = 1), the formula simplifies to just the dot product:

cos(θ) = a · b [when ||a|| = ||b|| = 1]
A normalised dot product IS cosine similarity — no division needed

This is why we set normalize_embeddings=True in the sentence-transformers call, and why we use FAISS IndexFlatIP (inner product = dot product) rather than IndexFlatL2 (Euclidean distance). For normalised vectors, inner product search IS cosine similarity search.

python — embedder.py (core logic)
import numpy as np
from sentence_transformers import SentenceTransformer

def embed_texts(texts: list[str]) -> np.ndarray:
    """Return L2-normalised float32 matrix, shape (N, 384)."""
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    return model.encode(
        texts,
        normalize_embeddings=True,  # ||v|| = 1 for every row
        convert_to_numpy=True,
    ).astype(np.float32)

def cosine_similarity_matrix(query_vec: np.ndarray, corpus: np.ndarray) -> np.ndarray:
    """
    For unit vectors: inner product == cosine similarity.
    corpus @ query_vec multiplies each row of corpus by query_vec.
    Result shape: (N,) — one score per corpus chunk.
    """
    scores = corpus @ query_vec     # (N, 384) @ (384,) → (N,)
    return np.clip(scores, 0.0, 1.0)  # clamp float precision drift

This is the entire cosine similarity implementation. Three lines of NumPy. The complexity people associate with "vector search" is all in efficiently indexing millions of vectors — at the scale of one document (hundreds of chunks), brute-force matrix multiplication on CPU completes in under a millisecond.

Why 384 dimensions?

The all-MiniLM-L6-v2 model was designed for sentence-level semantic similarity. Its 384-dimensional output is a deliberate trade-off: large enough to capture rich semantic meaning, small enough to be fast on CPU and memory-efficient at scale. A 1,000-chunk corpus occupies just 1000 × 384 × 4 bytes = 1.5 MB of RAM as a NumPy matrix.

💡 Embedding model parity

Both the raw pipeline and the LCEL pipeline use exactly the same model (all-MiniLM-L6-v2), with normalize_embeddings=True in both cases. This is not a coincidence — you must use the same model at ingestion time and at query time. If you swap the model, all stored vectors become invalid and you must re-embed everything.

🔧Phase 1 — The Raw Python Pipeline

We build the pipeline bottom-up: chunker → embedder → FAISS store → LLM caller → full pipeline class. Each component is a standalone Python file with no LangChain imports.

Step 1: The Chunker

The chunker in chunker.py does one thing: split a large text into smaller overlapping pieces. The algorithm is sentence-aware: it splits on sentence endings (.!?) to avoid cutting a thought mid-sentence, then accumulates sentences until the character budget is exhausted, then carries forward an overlap window so adjacent chunks share context.

python — chunker.py
import re
from models import DocumentChunk

def chunk_text(
    text: str, source: str,
    chunk_size: int = 512,
    chunk_overlap: int = 64,
) -> list[DocumentChunk]:

    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks, current, current_len, chunk_index = [], [], 0, 0

    def _flush():
        nonlocal current, current_len, chunk_index
        raw = " ".join(current).strip()
        if len(raw.split()) >= 5:
            chunks.append(DocumentChunk.create(raw, source, chunk_index))
            chunk_index += 1
        # Carry forward overlap: keep last N chars of sentences
        overlap_buf, overlap_len = [], 0
        for sent in reversed(current):
            if overlap_len + len(sent) + 1 <= chunk_overlap:
                overlap_buf.insert(0, sent)
                overlap_len += len(sent) + 1
            else: break
        current[:] = overlap_buf
        current_len = overlap_len

    for sentence in sentences:
        if current and current_len + len(sentence) > chunk_size:
            _flush()
        current.append(sentence)
        current_len += len(sentence) + 1

    if current:
        raw = " ".join(current).strip()
        if len(raw.split()) >= 5:
            chunks.append(DocumentChunk.create(raw, source, chunk_index))
    return chunks

The DocumentChunk.create() factory generates a deterministic 16-character hex ID from a SHA-256 hash of source:chunk_index:text[:64]. This means re-ingesting the same document produces identical IDs — useful for detecting and skipping duplicates.

Step 2: The Embedder

The embedder in embedder.py wraps a singleton SentenceTransformer instance (loaded once, reused across calls) and exposes two functions: embed_texts() for raw strings and embed_chunks() for DocumentChunk objects.

python — embedder.py (singleton pattern)
_model: SentenceTransformer | None = None

def get_model() -> SentenceTransformer:
    global _model
    if _model is None:
        logger.info("Loading embedding model: %s", MODEL_NAME)
        _model = SentenceTransformer(MODEL_NAME)
    return _model  # ~400 MB model, loaded once per process

def embed_chunks(chunks: list[DocumentChunk]) -> list[EmbeddedChunk]:
    texts = [c.text for c in chunks]
    matrix = embed_texts(texts)      # shape (N, 384), normalised
    return [
        EmbeddedChunk(chunk=chunk, embedding=matrix[i].tolist())
        for i, chunk in enumerate(chunks)
    ]

The singleton pattern is important here. Loading the all-MiniLM-L6-v2 model from disk takes about 800ms and allocates ~400 MB of RAM. If get_model() created a new instance on every call, embedding 200 chunks would require 200 model loads — an 160-second overhead versus a few seconds with the singleton.

Step 3: FAISS Index Wrapper

The FAISSStore class in faiss_store.py wraps three FAISS operations: add() (insert normalised vectors), search() (embed a query string and retrieve top-k), and save/load persistence.

python — faiss_store.py (search method)
def search(self, query: str, k: int = 4) -> list[RetrievedChunk]:
    if self.index.ntotal == 0:
        return []

    query_vec = embed_texts([query])          # (1, 384)
    faiss.normalize_L2(query_vec)             # safety: double-normalise
    n = min(k, self.index.ntotal)
    distances, indices = self.index.search(query_vec, n)

    results = []
    for rank, (dist, idx) in enumerate(zip(distances[0], indices[0]), start=1):
        if idx == -1:   # FAISS pads with -1 when ntotal < k
            continue
        results.append(RetrievedChunk(
            chunk=self.chunks[idx],
            score=float(np.clip(dist, 0.0, 1.0)),
            rank=rank,
        ))
    return results

Notice faiss.normalize_L2(query_vec) — we call this even though embed_texts() already normalises. This is a defensive pattern: if the query vector arrives from any other source, or if floating-point operations have introduced drift, we ensure it is unit-length before the inner-product search. The cost is negligible (one division per 384 floats).

Step 4: Direct LLM Call

The raw pipeline calls the LLM directly using the OpenAI Python SDK — no LangChain involved. The function is intentionally small: format a prompt, call the API, return the string.

python — rag_raw.py (LLM call)
_RAW_PROMPT = """\
You are a precise technical assistant.
Answer the question using ONLY the context passages below.
If the answer is not found in the context, respond with exactly:
  "I don't know based on the provided context."

Context:
{context}

Question: {question}

Answer:"""

def _call_llm_direct(prompt: str) -> str:
    import openai
    client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=1024,
    )
    return response.choices[0].message.content or ""

Step 5: The Complete Raw Pipeline

With those four components in place, the RawRAGPipeline class in rag_raw.py composes them into ingestion and query methods. Every step is an explicit function call — there is no framework magic:

python — rag_raw.py (query method — the full pipeline)
def query(self, question: str, k: int = 4) -> RAGResponse:
    self._ensure_loaded()

    # Step 1: embed question + FAISS inner-product search
    results = self._store.search(question, k=k)

    # Step 2: format retrieved passages as numbered context block
    context = "\n\n---\n\n".join(
        f"[Passage {r.rank} | similarity={r.score:.3f}]\n{r.chunk.text}"
        for r in results
    )

    # Step 3: inject context into prompt template and call LLM
    prompt = _RAW_PROMPT.format(context=context, question=question)
    answer = _call_llm_direct(prompt)

    # Step 4: wrap result in Pydantic model
    return RAGResponse(
        question=question,
        answer=answer,
        sources=list({r.chunk.source for r in results}),
        chunk_ids=[r.chunk.chunk_id for r in results],
        confidence_note=f"Top similarity: {results[0].score:.3f}",
        retrieved_chunks=len(results),
        pipeline="raw",
    )

Run it:

bash — ingest then query
# Index your document
python cli.py ingest my_document.txt

# Query using the raw pipeline
python cli.py ask-raw my_document.txt "What is the main conclusion?"

# Output includes pipeline="raw" and source chunk IDs

Phase 2 — FAISS Deep Dive

FAISS is a library for efficient similarity search over dense vectors. It was built at Meta AI Research and is the most widely deployed vector search library in production AI systems. Understanding its index types is essential for scaling beyond prototype size.

Index types and when to use them

Index Type Search Method Memory Accuracy Use When
IndexFlatIPExact — brute force inner product4 bytes × D × N100%N < 100K vectors (this article)
IndexFlatL2Exact — brute force Euclidean distance4 bytes × D × N100%N < 100K, non-normalised vectors
IndexIVFFlatApproximate — inverted file + flat scanSlightly larger than Flat95–99%100K–10M vectors, fast queries needed
IndexHNSWFlatApproximate — Hierarchical NSW graph~6× larger than Flat99%+10M+ vectors, latency-critical queries
IndexIVFPQApproximate — IVF + Product Quantisation8–32× smaller than Flat90–95%Billion-scale, RAM-constrained

For this article's use case — one document, hundreds to low thousands of chunks — IndexFlatIP is the correct choice. It is exact (100% recall), straightforward to understand, requires no training step, and the brute-force search over 1,000 × 384-float vectors completes in under 2ms on any modern CPU.

⚠️ IVF indexes require training

If you decide to use IndexIVFFlat for larger corpora, you must call index.train(matrix) with a representative sample of your vectors before you can call index.add(). Skipping this step causes a silent failure in older FAISS versions and a crash in newer ones. The flat indexes (IndexFlatIP, IndexFlatL2) require no training and are always add-ready.

Why IndexFlatIP instead of IndexFlatL2?

Both are exact brute-force searches. The difference is the distance metric:

IPIndexFlatIP
  • Computes inner product (dot product) between query and each stored vector
  • For unit vectors: inner product = cosine similarity
  • Returns scores in [0, 1] — higher is more similar
  • Use this when normalize_embeddings=True
L2IndexFlatL2
  • Computes squared Euclidean distance between query and each stored vector
  • Returns distances — lower is more similar (opposite of IP)
  • Convert to similarity: cosine = 1 - (l2_dist / 2)
  • LangChain's FAISS wrapper uses this internally

LangChain's FAISS wrapper uses IndexFlatL2 by default and handles the distance-to-similarity conversion internally. Our raw pipeline uses IndexFlatIP and avoids that conversion entirely by requiring unit vectors — a simpler mental model.

Persistence: how it works

FAISS provides faiss.write_index() and faiss.read_index() for serialising the index to and from disk. However, FAISS only knows about vectors — it has no concept of metadata (source file, chunk text, page numbers). We therefore persist three files together:

index.faiss
The FAISS binary file — contains the raw float32 vectors. The position of each vector in this file corresponds to an integer index that FAISS returns during search.
chunks.pkl
A Python pickle file containing a list[DocumentChunk] in insertion order. The n-th element of this list is the chunk that corresponds to FAISS vector index n.
metadata.json
An IndexMetadata JSON file recording how the index was built: embedding model, chunk size, overlap, total chunk count, creation timestamp.

When FAISS returns indices = [42, 17, 8, 91] for a query, we look up chunks[42], chunks[17], etc. to get the actual text. The index in the list is the bridge between FAISS vector space and our application data.

⚠️ Never deserialise untrusted pickle files

The chunks.pkl file uses Python's pickle protocol, which can execute arbitrary code on load. Only load FAISS stores that you created yourself, from trusted locations. The LangChain FAISS wrapper surfaces this with the allow_dangerous_deserialization=True flag — an explicit acknowledgment that you understand the risk.

🔗Phase 3 — Refactor to LangChain LCEL

Now that every step of the raw pipeline is clear, we rebuild the same pipeline using LangChain LCEL. The purpose is not to show that LCEL is better — it is to make the mapping between the raw code and the framework code explicit, so you can read LCEL chains fluently.

What LCEL replaces

RAWRaw Python (~200 lines, 5 files)
  • chunker.py — custom sentence splitter
  • embedder.py — SentenceTransformer singleton + NumPy cosine
  • faiss_store.py — FAISSStore class, add/search/save/load
  • _call_llm_direct() — OpenAI SDK call
  • _build_context() — manual context formatter
  • _RAW_PROMPT — plain string template
LCELLangChain LCEL (~35 lines, 1 file)
  • RecursiveCharacterTextSplitter
  • HuggingFaceEmbeddings (wraps sentence-transformers)
  • LangChainFAISS.from_documents() + as_retriever()
  • ChatOpenAI(model="gpt-4o-mini", temperature=0)
  • RunnableLambda(_format_docs)
  • ChatPromptTemplate.from_template()

The LCEL query chain

Here is the complete LCEL query pipeline — eleven lines including blank lines:

python — rag_lcel.py (query method)
def query(self, question: str, k: int = 4) -> RAGResponse:
    vs = self._load()
    retriever = vs.as_retriever(search_kwargs={"k": k})
    llm = _get_llm()

    # Each | connects one Runnable to the next — this IS the pipeline
    chain = (
        {
            "context":  retriever | RunnableLambda(_format_docs),
            "question": RunnablePassthrough(),
        }
        | _PROMPT
        | llm
        | StrOutputParser()
    )

    answer = chain.invoke(question)
    docs   = retriever.invoke(question)

    return RAGResponse(
        question=question, answer=answer,
        sources=list({d.metadata.get("source") for d in docs}),
        chunk_ids=[], confidence_note=f"{len(docs)} passages via LCEL",
        retrieved_chunks=len(docs), pipeline="lcel",
    )

Reading this chain left-to-right: the question string flows into a dict with two keys. For "context", it flows through the retriever (embed + FAISS search → list of Documents), then through _format_docs (list of Documents → formatted string). For "question", it passes through unchanged via RunnablePassthrough(). The dict is then injected into the prompt template, which produces a ChatPromptValue. That flows into the LLM, which returns a ChatMessage. That flows into StrOutputParser(), which extracts the string content.

Every step in that description corresponds to exactly one step in the raw pipeline. LCEL did not change the logic — it compressed the plumbing.

Running the comparison

bash — side-by-side comparison
# Run both pipelines on the same question
python cli.py compare my_document.txt "Summarise the key findings."

# Output: two panels side by side + metrics table
#   Raw Python  |  LCEL
#   The key...  |  The key...
# ─────────────────────────────────────────
#   Chunks/docs: 4      | 4
#   Top score:   0.847  | N/A (LCEL does not expose raw scores)
💡 One practical advantage of the raw pipeline

The raw pipeline returns chunk_ids and similarity score for every retrieved chunk — you can inspect exactly which pieces of text influenced the answer. The LCEL pipeline, by default, does not expose these details. For production debugging and answer attribution, this matters: the raw approach gives you a complete audit trail that LCEL requires extra wiring to replicate.

⚖️ChromaDB vs FAISS

Article 1 used ChromaDB as the vector store. This article uses FAISS. Both are correct choices — they have different strengths, and the right choice depends on your specific requirements.

Dimension ChromaDB (Article 1) FAISS (This Article)
BackendSQLite (embedded mode) or HTTP serverPure in-memory binary, files for persistence
Metadata filteringYes — filter by page, source, custom fieldsNo — post-query filtering only
PersistenceAutomatic — writes to disk on every addManual — call save() explicitly
Update/deleteSupported via collection IDNot supported — rebuild index on changes
Scale (vectors)Up to ~1M comfortably embeddedFlat: up to 1M; IVF/HNSW: billions
DependenciesPython package onlyPython + native C++ library (faiss-cpu/gpu)
LangChain integrationlangchain-chromalangchain-community FAISS wrapper
Index typesOne (HNSW under the hood)20+ (Flat, IVF, HNSW, PQ, ScaNN, …)
MMR searchNative search_type="mmr"Requires custom post-processing
Best forSingle-document RAG, prototypes, apps needing metadata filtersMulti-document corpora, scale benchmarks, production at 10M+ vectors

The key practical difference is metadata filtering. In Article 1, we used ChromaDB's where={"page": 3} filter to restrict retrieval to specific pages. FAISS has no such concept — all filtering must happen after retrieval, which is less efficient when you only want results from a subset of the corpus.

For a single-document Q&A tool serving hundreds of users, ChromaDB's simplicity and built-in metadata filtering make it the better choice. For a multi-tenant system ingesting thousands of documents daily, FAISS's index variety and raw performance at scale tip the balance.

💡 Switching vector stores without changing your RAG logic

In both this article and Article 1, the vector store is isolated behind a thin abstraction (FAISSStore or ChromaDB's vectorstore object). The retrieval interface is always search(query, k)list[RetrievedChunk]. This means you can swap ChromaDB for FAISS (or Pinecone, or Weaviate) without touching rag_raw.py or cli.py. Design your vector store as a replaceable component from the start.

🚀Production Decisions

Before deploying a RAG system, you will face a predictable set of design decisions. Here is how to reason about each one.

Common failure modes

⚠ Using different embedding models at ingest and query time
You ingested documents with all-MiniLM-L6-v2, then upgraded to bge-large-en-v1.5. Now queries return irrelevant chunks — the stored vectors are in a different vector space than your query vectors. There is no error; the system silently returns garbage.
Store the embedding model name in IndexMetadata (we do this). On load, assert that the model in metadata matches the model currently configured. Rebuild the index whenever you change models.
⚠ Chunking by token count with a character-count check
LLM context windows are measured in tokens, but our chunker measures characters. At 512 characters per chunk, average English text produces ~100–130 tokens — well within gpt-4o-mini's 128k context window. At chunk_size=8192, you may create chunks of 1,500+ tokens that exceed retrieval assumptions downstream.
For most RAG use cases, character-based chunking at 512–1024 chars with 64 overlap is sufficient. If you need precise token budgets, use tiktoken for OpenAI models to count tokens before splitting.
⚠ Storing embeddings in FAISS but text in a separate unlinked database
A common production pattern is to store vectors in FAISS and text in PostgreSQL, linking them by a UUID primary key. This works until a partial failure — e.g., the FAISS write succeeds but the Postgres write fails. Now you have orphaned vectors with no corresponding text, and FAISS returns indices that resolve to nothing.
Write both stores atomically (same transaction if possible), or write text first and only add to FAISS after the text write is confirmed. Consider a reconciliation job that periodically validates FAISS indices against the text store.
⚠ Loading the embedding model on every request in a web application
In a Flask or FastAPI app, instantiating SentenceTransformer(MODEL_NAME) on each request adds 800ms+ of cold-start latency and may OOM-kill the process if concurrent requests each try to load a 400MB model.
Use a module-level singleton (as in our get_model() function) or a dependency-injection pattern (FastAPI's lifespan event) to load the model once at startup. In containerised deployments, pre-warm the model in the container's startup probe rather than on first request.

Production decision tree

CORPUS SIZE?
Under 500K vectorsIndexFlatIP (exact, simple, fast). 500K–10MIndexIVFFlat (requires training step; use nlist=4096). 10M+IndexHNSWFlat or IndexIVFPQ (latency vs RAM trade-off).
METADATA FILTERING?
Yes, frequently → Use ChromaDB (Article 1 stack) — native filter on any metadata field. Rarely, or never → FAISS + post-retrieval Python filter. Complex filters at 10M+ scale → Pinecone or Weaviate with dedicated filter indexes.
DEPLOYMENT TARGET?
Laptop / single server → embedded FAISS or ChromaDB, no infra required. Multi-instance / Kubernetes → shared vector store behind an API (Qdrant, Weaviate) so all replicas read the same index. Serverless / Lambda → S3-backed ChromaDB with cold-start warming strategy.
NEED TO UPDATE CHUNKS?
Documents never change → FAISS (no update support needed). Documents change frequently → ChromaDB (update by ID) or Qdrant (upsert by ID). FAISS requires a full index rebuild on any change — acceptable for batch pipelines, unacceptable for live content.
LATENCY REQUIREMENT?
Under 50ms P99 at 10M+ vectors → GPU FAISS or Approximate Nearest Neighbour (HNSW). Under 200ms P99 at 1M vectors → CPU IndexFlatIP is sufficient. Under 500ms P99 (interactive) → any approach, bottleneck is almost always the LLM call, not retrieval.
FRAMEWORK OR RAW?
Build once, well-understood requirements → LCEL for speed, raw pipeline for auditability. Rapidly iterating on the retrieval strategy → raw pipeline — full visibility, no framework surprises. Team with varied LangChain familiarity → start raw, refactor to LCEL once the design stabilises.

You now understand what the framework abstracts.

You have built a complete RAG system twice: once in 200 lines of raw Python where every step is explicit, and once in 35 lines of LangChain LCEL where the framework handles the plumbing. The logic is identical. The difference is visibility versus brevity.

The next article in the series applies these foundations to a multi-tenant scenario — multiple documents, isolated namespaces, per-user retrieval quotas, and a FastAPI backend. The vector store management patterns you learned here become the foundation for that architecture.

→ Article 3: Multi-Tenant RAG API with FastAPI

All code in this article is production-ready Python 3.11+. Pinned dependency versions tested: faiss-cpu 1.8.0 · sentence-transformers 3.3.1 · numpy 1.26.4 · langchain 0.3.13 · pydantic 2.10.3 · openai 1.58.1.