Which chunking strategy does the post recommend as a production baseline?

The post recommends RecursiveCharacterTextSplitter with 512-token chunks and 64-token overlap as a strong baseline for most production use cases. It also advises always storing the original document text alongside the embedded chunk so that surrounding paragraphs can be fetched at query time using the parent document retriever pattern.

How does the two-stage retrieval pattern improve answer quality?

Embedding-based similarity search quickly retrieves a broad candidate set (top-20 chunks), but it can surface semantically similar text that does not actually answer the question. A cross-encoder re-ranker — such as a small BERT model fine-tuned on MS-MARCO — then scores each candidate against the full query and keeps only the top-5, dramatically improving precision before the context is passed to the LLM.

What RAGAS faithfulness threshold signals that a pipeline needs tuning?

The post states that a faithfulness score below 0.8 indicates the prompt or retrieval needs tuning. Faithfulness measures the fraction of answer claims that are supported by the retrieved context, making it the most important RAGAS metric for catching hallucinations before a pipeline goes live.

How can teams reduce LLM API costs and latency in a production RAG pipeline?

The post describes two complementary caching layers: a content-addressed embedding cache (hash the document text, store the embedding in Redis or a dedicated table) to avoid re-embedding unchanged documents, and a semantic query cache that serves stored answers when a new question falls within cosine distance 0.05 of a previously answered one. On the UX side, LangChain streaming callbacks let the frontend display partial responses progressively, reducing perceived latency for answers that take 3–8 seconds to generate.

Building Production RAG Applications: Architecture & Evaluation

Q: When should you choose pgvector over Qdrant for a RAG vector store?

The post recommends pgvector when you already run PostgreSQL and want to minimise infrastructure complexity, noting it is sufficient for most Indonesian startups with fewer than 10 million documents. Qdrant is the better choice for billion-vector workloads that require rich payload filtering and horizontal scalability.

Retrieval-Augmented Generation (RAG) has become the dominant pattern for building LLM applications that need to answer questions about proprietary or frequently updated data — internal documentation, product catalogs, support tickets, and regulatory texts. Getting a prototype working in a notebook is straightforward, but taking a RAG pipeline to production requires careful decisions about chunking strategy, vector store selection, retrieval quality, and evaluation methodology.

RAG Architecture: From Documents to Answers

A production RAG pipeline consists of two distinct phases: an offline ingestion pipeline that processes and indexes documents, and an online query pipeline that retrieves context and generates answers. The ingestion phase handles loading raw documents from various sources (PDFs, databases, APIs), splitting them into chunks, generating embeddings, and storing them in a vector database. The query phase embeds the user's question, performs similarity search to retrieve relevant chunks, constructs a prompt with the retrieved context, and calls the LLM to generate a grounded response.

Document Ingestion and Preprocessing

Raw documents rarely arrive in a clean, chunk-ready format. PDFs contain headers, footers, and multi-column layouts that break naive text extraction. HTML pages include navigation and boilerplate that dilutes semantic relevance. Preprocessing should normalize whitespace, remove boilerplate, and preserve structural metadata (headings, section numbers, source URL, document date) that will be stored alongside each chunk. This metadata is critical for filtering during retrieval and for citing sources in the generated answer.

Chunking Strategies: Fixed-Size vs Semantic vs Hierarchical

Fixed-size chunking splits text every N characters with a configurable overlap — simple but often cuts sentences mid-thought. Semantic chunking uses sentence boundaries and groups sentences until a similarity threshold drops, preserving complete thoughts at the cost of variable chunk sizes. Hierarchical chunking creates parent-child relationships: large summary chunks for broad retrieval with smaller detail chunks for precision, then fetches the parent context when a child chunk matches. For most production use cases, RecursiveCharacterTextSplitter with 512-token chunks and 64-token overlap provides a strong baseline.

Always store the original document text alongside the embedded chunk, not just the chunk itself. When a chunk matches during retrieval, you can fetch its surrounding paragraphs (the 'parent document retriever' pattern) to provide richer context to the LLM without embedding the full document as a single vector.

Vector Store Selection: pgvector vs Qdrant

pgvector extends your existing PostgreSQL database with approximate nearest neighbor (ANN) search using IVFFlat or HNSW indexes — ideal if you already run Postgres and want to minimize infrastructure complexity. Qdrant is a purpose-built vector database with a rich filtering API, payload indexing, and horizontal scalability for billion-vector workloads. For most Indonesian startups with fewer than 10 million documents, pgvector on a well-tuned Postgres instance is sufficient and eliminates an additional service to operate and secure.

Building the LangChain + pgvector Pipeline

LangChain's PGVector integration handles embedding generation, upsert, and similarity search through a consistent interface. The RetrievalQA chain wires together the retriever, a custom prompt template that instructs the LLM to answer only from provided context, and the LLM itself. Using search_type='mmr' (Maximal Marginal Relevance) during retrieval balances relevance to the query with diversity among returned chunks, reducing the risk of retrieving five nearly identical passages that waste context window space.

# rag_pipeline.py
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_postgres import PGVector
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import psycopg2

# Connection string for pgvector
CONNECTION_STRING = (
    "postgresql+psycopg2://user:password@localhost:5432/ragdb"
)

# 1. Chunking strategy — overlapping chunks with metadata
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""],
)

def ingest_documents(docs: list[dict]) -> None:
    """Ingest documents into pgvector with metadata enrichment."""
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = PGVector(
        embeddings=embeddings,
        collection_name="knowledge_base",
        connection=CONNECTION_STRING,
    )
    for doc in docs:
        chunks = text_splitter.create_documents(
            texts=[doc["content"]],
            metadatas=[{
                "source": doc["source"],
                "doc_type": doc.get("type", "unknown"),
                "created_at": doc.get("created_at", ""),
            }]
        )
        vectorstore.add_documents(chunks)

def build_rag_chain():
    """Build a production RAG chain with custom prompt."""
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = PGVector(
        embeddings=embeddings,
        collection_name="knowledge_base",
        connection=CONNECTION_STRING,
    )
    retriever = vectorstore.as_retriever(
        search_type="mmr",            # Maximal Marginal Relevance
        search_kwargs={"k": 5, "fetch_k": 20},
    )
    prompt = PromptTemplate(
        input_variables=["context", "question"],
        template=(
            "Use only the context below to answer the question. "
            "If unsure, say you don't know.\n\n"
            "Context:\n{context}\n\nQuestion: {question}\nAnswer:"
        ),
    )
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    return RetrievalQA.from_chain_type(
        llm=llm,
        retriever=retriever,
        chain_type_kwargs={"prompt": prompt},
        return_source_documents=True,
    )

Prompt Engineering for RAG

The RAG prompt must explicitly instruct the LLM to base its answer only on the provided context and to say 'I don't know' when the context is insufficient — without this constraint, capable models will hallucinate plausible-sounding answers from training data. Include the retrieved source metadata (document title, page number, URL) in the context block so the LLM can cite them in its answer. A system prompt that defines the assistant's role (e.g., 'You are a helpful assistant for PT XYZ's internal HR policies') further reduces off-topic hallucinations.

Retrieval Quality and Re-Ranking

Embedding-based similarity search (cosine or dot-product) is fast but imperfect — it retrieves semantically similar text even when it doesn't actually answer the question. Adding a cross-encoder re-ranker (e.g., a small BERT model fine-tuned on MS-MARCO) as a second-stage filter dramatically improves precision by scoring each retrieved chunk against the full query text. This two-stage approach (embed → retrieve top-20 → re-rank → take top-5) is the standard pattern in production systems where answer quality matters.

Hybrid Search: Dense + Sparse Retrieval

Dense vector search excels at semantic similarity but struggles with exact keyword matches — product codes, names, and technical identifiers. Sparse retrieval (BM25 or TF-IDF) handles exact matches well but misses paraphrases. Hybrid search combines both scores using Reciprocal Rank Fusion (RRF) to get the best of both worlds. Qdrant and newer versions of pgvector support hybrid search natively; LangChain's EnsembleRetriever provides a framework-level implementation.

Naive Chunking Destroys Context

Splitting a document at fixed character counts without regard for sentence or paragraph boundaries is the single most common cause of poor RAG performance. A chunk that begins mid-sentence loses its subject; a chunk that cuts a table in half produces garbled context. Always validate your chunking output visually on a sample of real documents before indexing your full corpus. Use overlapping chunks (at minimum 10–15% overlap) to ensure boundary information is captured.

Evaluation with RAGAS Metrics

RAGAS provides a suite of reference-free metrics that evaluate RAG quality without requiring expensive human-labeled ground truth. The four core metrics are faithfulness (fraction of answer claims supported by context), answer relevancy (how well the answer addresses the question), context precision (fraction of retrieved chunks that are actually relevant), and context recall (fraction of relevant information that was retrieved). Running these metrics against a curated question set after each pipeline change gives an objective measure of quality.

Caching Embeddings and Query Results

Embedding generation is the dominant cost in a RAG pipeline — OpenAI's text-embedding-3-small charges per token even for repeated documents. Implement a content-addressed cache (hash the text, store the embedding in Redis or a dedicated table) to avoid re-embedding unchanged documents on incremental ingestion. For query results, a semantic cache (check if the new query is within cosine distance 0.05 of a previously answered query) can serve cached answers for near-duplicate questions, reducing latency and LLM API costs.

Streaming Responses for Better UX

LLM generation latency for a full answer can be 3–8 seconds, which feels unresponsive in a chat UI. LangChain supports streaming callbacks that emit tokens as they are generated, enabling the frontend to display partial responses progressively. Pair streaming with a 'sources' panel that appears immediately (before the LLM finishes) to show users which documents were retrieved while they read the answer — this also builds trust by making the retrieval process transparent.

Run RAGAS evaluation on 50–100 question-answer pairs sampled from your real use cases before going live. RAGAS's faithfulness metric (does the answer only contain claims supported by the context?) is the most important signal — a faithfulness score below 0.8 indicates your prompt or retrieval needs tuning. Automate this evaluation to run on every significant change to the pipeline.

Scaling and Cost Optimization

Production RAG systems must balance answer quality against API cost and latency. The main cost levers are: embedding model selection (text-embedding-3-small at $0.02/1M tokens vs ada-002 at $0.10/1M tokens), context window size (sending fewer but higher-quality chunks), caching, and batching ingestion requests. For high-volume applications, consider running a self-hosted embedding model (e5-large-v2 or BGE-M3) on a GPU instance to eliminate per-token embedding costs entirely.

Monitoring RAG in Production

Instrument your RAG pipeline with latency histograms (retrieval time, LLM time, total response time), faithfulness scores computed async after each response, and a thumbs-up/thumbs-down feedback mechanism in the UI. Export these metrics to Prometheus and visualize in Grafana. Set alert thresholds: if average faithfulness drops below 0.75 over a rolling 1-hour window, page the on-call engineer — it often means new documents with poor chunking have been ingested.

Key terms in this article include RAG, pgvector, RAGAS, and MMR (Maximal Marginal Relevance).

Sources

Frequently Asked Questions

Building Production RAG LLM Applications: Architecture, Chunking, and Evaluation

Frequently Asked Questions

Building Production RAG LLM Applications: Architecture, Chunking, and Evaluation

RAG Architecture: From Documents to Answers

Document Ingestion and Preprocessing

Chunking Strategies: Fixed-Size vs Semantic vs Hierarchical

Vector Store Selection: pgvector vs Qdrant

Building the LangChain + pgvector Pipeline

Prompt Engineering for RAG

Retrieval Quality and Re-Ranking

Hybrid Search: Dense + Sparse Retrieval

Evaluation with RAGAS Metrics

Caching Embeddings and Query Results

Streaming Responses for Better UX

Scaling and Cost Optimization

Monitoring RAG in Production

Related Articles

RAG Architecture: From Documents to Answers

Document Ingestion and Preprocessing

Chunking Strategies: Fixed-Size vs Semantic vs Hierarchical

Vector Store Selection: pgvector vs Qdrant

Building the LangChain + pgvector Pipeline

Prompt Engineering for RAG

Retrieval Quality and Re-Ranking

Hybrid Search: Dense + Sparse Retrieval

Evaluation with RAGAS Metrics

Caching Embeddings and Query Results

Streaming Responses for Better UX

Scaling and Cost Optimization

Monitoring RAG in Production

Related Articles