Why does chunking strategy matter so much in a RAG pipeline?

Chunking is the most impactful decision in a RAG pipeline because it determines whether the right context is retrievable at all. Fixed-size chunking is easy to implement but semantically blind — it can split tables, code blocks, or numbered steps mid-way. Switching from fixed-size to semantic chunking (splitting at paragraph or heading boundaries) produced a 25% improvement in answer relevance.

What is HyDE and why does it improve retrieval quality?

HyDE (Hypothetical Document Embedder) works by first using a cheap LLM to generate a hypothetical answer to the user's question, then embedding that answer instead of the raw question to drive the vector search. Because a hypothetical answer is semantically closer to relevant documents than the question itself, HyDE consistently outperforms direct question embedding and can yield 20–30% gains in retrieval recall.

How many retrieved chunks should be passed to the LLM for generation?

Passing more chunks does not reliably improve answer quality. Research (the 'Lost in the Middle' paper) shows LLMs consistently underweight information placed in the middle of long contexts. Keeping the retrieved context to 3–5 high-quality, reranked chunks rather than 10 or more mediocre ones produces better answers.

What is a reranker and when is it worth the cost?

A reranker is a cross-encoder model that scores (query, document_chunk) pairs directly for relevance — more accurate than vector similarity but too slow to scan a full index. The recommended pattern is to run fast vector search over the full index, retrieve the top 20 candidates, then rerank and select the top 5 for generation. Cohere Rerank is priced at $2 per 1,000 queries, making it practical for most production workloads.

How should RAG systems be monitored in production?

RAG systems degrade silently, so dedicated monitoring is essential. Key signals to track are retrieval latency, context quality score, answer groundedness, and document freshness. Stale documents should be re-indexed automatically via a weekly job, and the embedding model version should be tracked so that embeddings are regenerated after any model upgrade.

RAG Pipeline Production Lessons: What Nobody Tells You

August 202512 min read

Retrieval-Augmented Generation looks deceptively simple in tutorials: chunk your documents, embed them, store in a vector database, retrieve on query, pass to LLM. Reality: production RAG is a reliability and quality engineering problem that takes months to get right.

Chunking: The Foundation Everything Else Depends On

The most impactful decision in a RAG pipeline is how you chunk documents. My starting point: 512-token chunks with 128-token overlap for dense technical content, 256-token chunks with 64-token overlap for FAQ-style content.

Semantic Chunking vs Fixed-Size Chunking

Fixed-size chunking is easy to implement but semantically blind — it will split a table in the middle, cut a list of steps, and fragment code blocks. Semantic chunking respects document structure: split at paragraph boundaries, heading boundaries, or sentence boundaries. I saw a 25% improvement in answer relevance when switching from fixed to semantic chunking.

Document Metadata as a Retrieval Multiplier

Every chunk you store should carry rich metadata: document title, section heading, document type, date created, and domain-specific tags. This metadata enables hybrid retrieval: filter by metadata before or after vector search. Qdrant, Weaviate, and pgvector all support metadata filtering natively.

Production RAG Pipeline Architecture

  Documents
      │
      ▼
  ┌──────────────────────────────────────────┐
  │  Ingestion Pipeline                      │
  │  1. Semantic Chunking (respect headings) │
  │  2. Metadata Extraction (title, type,    │
  │     date, tags)                          │
  │  3. Embedding (text-embedding-3-small)   │
  │  4. Store in pgvector / Qdrant           │
  └──────────────────────────────────────────┘

  User Query
      │
      ▼
  ┌──────────────────────────────────────────┐
  │  Query Pipeline                          │
  │                                          │
  │  1. HyDE: Generate hypothetical answer  │
  │     (cheap LLM call)                    │
  │  2. Embed hypothetical answer           │
  │  3. Vector search top-20                │
  │  4. Metadata filter (optional)          │
  │  5. Rerank top-5 (Cohere / bge)        │
  │  6. Pass to LLM with retrieved context │
  └──────────────────────────────────────────┘
      │
      ▼
  Generated Answer + Source Citations

The single most effective RAG improvement I've found is adding a 'hypothetical document embedder' (HyDE) step. First use a cheap LLM to generate a hypothetical answer to the user's question, then embed that hypothetical answer and use it as the search query. HyDE consistently outperforms direct question embedding because the hypothetical answer is semantically closer to the relevant documents.

Vector Database Choice: pgvector vs Pinecone vs Qdrant

I've used three vector databases in production. pgvector is the obvious choice if you're already on PostgreSQL. Pinecone is fully managed and handles billions of vectors, but costs add up fast. Qdrant is self-hostable, has the best filtering performance I've seen, and supports sparse+dense hybrid search natively.

Evaluating Retrieval Quality

The biggest RAG mistake is optimizing end-to-end answer quality without understanding retrieval quality separately. Measure retrieval precision and recall independently using a test set of query-document pairs. I run 50 evaluation queries weekly and track recall@5 separately from answer quality.

-- pgvector: HNSW index for production performance
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id          BIGSERIAL PRIMARY KEY,
  content     TEXT NOT NULL,
  embedding   VECTOR(1536) NOT NULL,
  metadata    JSONB NOT NULL DEFAULT '{}'
);

-- HNSW index — tune m and ef_construction for your dataset
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 128);

-- Retrieval with metadata filter + vector search
SELECT id, content, metadata,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE metadata->>'type' = 'exercise_technique'   -- metadata filter
ORDER BY embedding <=> $1::vector                 -- vector search
LIMIT 20;  -- retrieve 20, then rerank to top-5

Reranking: The Bridge Between Retrieval and Generation

A reranker is a cross-encoder model that takes (query, document_chunk) pairs and scores their relevance directly. Running a reranker on your top-20 vector search results and selecting the top-5 for generation consistently improves answer quality. Cohere Rerank costs $2 per 1,000 queries.

Retrieving more chunks seems like it should always improve quality. In practice, stuffing 10+ chunks into the context window often degrades answer quality. The 'Lost in the Middle' paper showed LLMs consistently underweight information in the middle of long contexts. Keep your retrieved context to 3-5 high-quality, reranked chunks rather than 10+ mediocre ones.

Handling Multi-Step and Multi-Hop Queries

Simple RAG retrieves once and generates. But complex queries require multi-hop retrieval — first retrieve user profiles, then retrieve exercises, then retrieve progress data. Build your RAG pipeline to support iterative retrieval.

Production Monitoring for RAG Systems

RAG systems degrade silently. Monitor: retrieval latency, context quality score, answer groundedness, and document freshness. Re-index stale documents automatically via a weekly job. Track embedding model version and re-embed when you upgrade the model.

Sources & Further Reading

Chunking: The Foundation Everything Else Depends On

Semantic Chunking vs Fixed-Size Chunking

Document Metadata as a Retrieval Multiplier

Production RAG Pipeline Architecture Documents │ ▼ ┌──────────────────────────────────────────┐ │ Ingestion Pipeline │ │ 1. Semantic Chunking (respect headings) │ │ 2. Metadata Extraction (title, type, │ │ date, tags) │ │ 3. Embedding (text-embedding-3-small) │ │ 4. Store in pgvector / Qdrant │ └──────────────────────────────────────────┘ User Query │ ▼ ┌──────────────────────────────────────────┐ │ Query Pipeline │ │ │ │ 1. HyDE: Generate hypothetical answer │ │ (cheap LLM call) │ │ 2. Embed hypothetical answer │ │ 3. Vector search top-20 │ │ 4. Metadata filter (optional) │ │ 5. Rerank top-5 (Cohere / bge) │ │ 6. Pass to LLM with retrieved context │ └──────────────────────────────────────────┘ │ ▼ Generated Answer + Source Citations

Vector Database Choice: pgvector vs Pinecone vs Qdrant

Evaluating Retrieval Quality

-- pgvector: HNSW index for production performance CREATE EXTENSION IF NOT EXISTS vector; CREATE TABLE documents ( id BIGSERIAL PRIMARY KEY, content TEXT NOT NULL, embedding VECTOR(1536) NOT NULL, metadata JSONB NOT NULL DEFAULT '{}' ); -- HNSW index — tune m and ef_construction for your dataset CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 32, ef_construction = 128); -- Retrieval with metadata filter + vector search SELECT id, content, metadata, 1 - (embedding <=> $1::vector) AS similarity FROM documents WHERE metadata->>'type' = 'exercise_technique' -- metadata filter ORDER BY embedding <=> $1::vector -- vector search LIMIT 20; -- retrieve 20, then rerank to top-5

Reranking: The Bridge Between Retrieval and Generation

Frequently Asked Questions

RAG Pipeline Production Lessons: What Nobody Tells You

Frequently Asked Questions

RAG Pipeline Production Lessons: What Nobody Tells You

Chunking: The Foundation Everything Else Depends On

Semantic Chunking vs Fixed-Size Chunking

Document Metadata as a Retrieval Multiplier

Vector Database Choice: pgvector vs Pinecone vs Qdrant

Evaluating Retrieval Quality

Reranking: The Bridge Between Retrieval and Generation

Handling Multi-Step and Multi-Hop Queries

Production Monitoring for RAG Systems

Sources & Further Reading

Related Articles

Chunking: The Foundation Everything Else Depends On

Semantic Chunking vs Fixed-Size Chunking

Document Metadata as a Retrieval Multiplier

Vector Database Choice: pgvector vs Pinecone vs Qdrant

Evaluating Retrieval Quality

Reranking: The Bridge Between Retrieval and Generation

Handling Multi-Step and Multi-Hop Queries

Production Monitoring for RAG Systems

Sources & Further Reading

Related Articles