Retrieval-Augmented Generation looks deceptively simple in tutorials: chunk your documents, embed them, store in a vector database, retrieve on query, pass to LLM. Reality: production RAG is a reliability and quality engineering problem that takes months to get right.
The most impactful decision in a RAG pipeline is how you chunk documents. My starting point: 512-token chunks with 128-token overlap for dense technical content, 256-token chunks with 64-token overlap for FAQ-style content.
Fixed-size chunking is easy to implement but semantically blind — it will split a table in the middle, cut a list of steps, and fragment code blocks. Semantic chunking respects document structure: split at paragraph boundaries, heading boundaries, or sentence boundaries. I saw a 25% improvement in answer relevance when switching from fixed to semantic chunking.
Every chunk you store should carry rich metadata: document title, section heading, document type, date created, and domain-specific tags. This metadata enables hybrid retrieval: filter by metadata before or after vector search. Qdrant, Weaviate, and pgvector all support metadata filtering natively.
Production RAG Pipeline Architecture
Documents
│
▼
┌──────────────────────────────────────────┐
│ Ingestion Pipeline │
│ 1. Semantic Chunking (respect headings) │
│ 2. Metadata Extraction (title, type, │
│ date, tags) │
│ 3. Embedding (text-embedding-3-small) │
│ 4. Store in pgvector / Qdrant │
└──────────────────────────────────────────┘
User Query
│
▼
┌──────────────────────────────────────────┐
│ Query Pipeline │
│ │
│ 1. HyDE: Generate hypothetical answer │
│ (cheap LLM call) │
│ 2. Embed hypothetical answer │
│ 3. Vector search top-20 │
│ 4. Metadata filter (optional) │
│ 5. Rerank top-5 (Cohere / bge) │
│ 6. Pass to LLM with retrieved context │
└──────────────────────────────────────────┘
│
▼
Generated Answer + Source CitationsThe single most effective RAG improvement I've found is adding a 'hypothetical document embedder' (HyDE) step. First use a cheap LLM to generate a hypothetical answer to the user's question, then embed that hypothetical answer and use it as the search query. HyDE consistently outperforms direct question embedding because the hypothetical answer is semantically closer to the relevant documents.
I've used three vector databases in production. pgvector is the obvious choice if you're already on PostgreSQL. Pinecone is fully managed and handles billions of vectors, but costs add up fast. Qdrant is self-hostable, has the best filtering performance I've seen, and supports sparse+dense hybrid search natively.
The biggest RAG mistake is optimizing end-to-end answer quality without understanding retrieval quality separately. Measure retrieval precision and recall independently using a test set of query-document pairs. I run 50 evaluation queries weekly and track recall@5 separately from answer quality.
-- pgvector: HNSW index for production performance
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding VECTOR(1536) NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}'
);
-- HNSW index — tune m and ef_construction for your dataset
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 128);
-- Retrieval with metadata filter + vector search
SELECT id, content, metadata,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE metadata->>'type' = 'exercise_technique' -- metadata filter
ORDER BY embedding <=> $1::vector -- vector search
LIMIT 20; -- retrieve 20, then rerank to top-5A reranker is a cross-encoder model that takes (query, document_chunk) pairs and scores their relevance directly. Running a reranker on your top-20 vector search results and selecting the top-5 for generation consistently improves answer quality. Cohere Rerank costs $2 per 1,000 queries.
Retrieving more chunks seems like it should always improve quality. In practice, stuffing 10+ chunks into the context window often degrades answer quality. The 'Lost in the Middle' paper showed LLMs consistently underweight information in the middle of long contexts. Keep your retrieved context to 3-5 high-quality, reranked chunks rather than 10+ mediocre ones.
Simple RAG retrieves once and generates. But complex queries require multi-hop retrieval — first retrieve user profiles, then retrieve exercises, then retrieve progress data. Build your RAG pipeline to support iterative retrieval.
RAG systems degrade silently. Monitor: retrieval latency, context quality score, answer groundedness, and document freshness. Re-index stale documents automatically via a weekly job. Track embedding model version and re-embed when you upgrade the model.