Building Production RAG LLM Applications: Architecture, Chunking, and Evaluation

Photo by Unsplash

Photo by Unsplash
Retrieval-Augmented Generation (RAG) has become the dominant pattern for building LLM applications that need to answer questions about proprietary or frequently updated data — internal documentation, product catalogs, support tickets, and regulatory texts. Getting a prototype working in a notebook is straightforward, but taking a RAG pipeline to production requires careful decisions about chunking strategy, vector store selection, retrieval quality, and evaluation methodology.
A production RAG pipeline consists of two distinct phases: an offline ingestion pipeline that processes and indexes documents, and an online query pipeline that retrieves context and generates answers. The ingestion phase handles loading raw documents from various sources (PDFs, databases, APIs), splitting them into chunks, generating embeddings, and storing them in a vector database. The query phase embeds the user's question, performs similarity search to retrieve relevant chunks, constructs a prompt with the retrieved context, and calls the LLM to generate a grounded response.
Raw documents rarely arrive in a clean, chunk-ready format. PDFs contain headers, footers, and multi-column layouts that break naive text extraction. HTML pages include navigation and boilerplate that dilutes semantic relevance. Preprocessing should normalize whitespace, remove boilerplate, and preserve structural metadata (headings, section numbers, source URL, document date) that will be stored alongside each chunk. This metadata is critical for filtering during retrieval and for citing sources in the generated answer.
Fixed-size chunking splits text every N characters with a configurable overlap — simple but often cuts sentences mid-thought. Semantic chunking uses sentence boundaries and groups sentences until a similarity threshold drops, preserving complete thoughts at the cost of variable chunk sizes. Hierarchical chunking creates parent-child relationships: large summary chunks for broad retrieval with smaller detail chunks for precision, then fetches the parent context when a child chunk matches. For most production use cases, RecursiveCharacterTextSplitter with 512-token chunks and 64-token overlap provides a strong baseline.
Always store the original document text alongside the embedded chunk, not just the chunk itself. When a chunk matches during retrieval, you can fetch its surrounding paragraphs (the 'parent document retriever' pattern) to provide richer context to the LLM without embedding the full document as a single vector.
pgvector extends your existing PostgreSQL database with approximate nearest neighbor (ANN) search using IVFFlat or HNSW indexes — ideal if you already run Postgres and want to minimize infrastructure complexity. Qdrant is a purpose-built vector database with a rich filtering API, payload indexing, and horizontal scalability for billion-vector workloads. For most Indonesian startups with fewer than 10 million documents, pgvector on a well-tuned Postgres instance is sufficient and eliminates an additional service to operate and secure.
LangChain's PGVector integration handles embedding generation, upsert, and similarity search through a consistent interface. The RetrievalQA chain wires together the retriever, a custom prompt template that instructs the LLM to answer only from provided context, and the LLM itself. Using search_type='mmr' (Maximal Marginal Relevance) during retrieval balances relevance to the query with diversity among returned chunks, reducing the risk of retrieving five nearly identical passages that waste context window space.
# rag_pipeline.py
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_postgres import PGVector
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import psycopg2
# Connection string for pgvector
CONNECTION_STRING = (
"postgresql+psycopg2://user:password@localhost:5432/ragdb"
)
# 1. Chunking strategy — overlapping chunks with metadata
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " ", ""],
)
def ingest_documents(docs: list[dict]) -> None:
"""Ingest documents into pgvector with metadata enrichment."""
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PGVector(
embeddings=embeddings,
collection_name="knowledge_base",
connection=CONNECTION_STRING,
)
for doc in docs:
chunks = text_splitter.create_documents(
texts=[doc["content"]],
metadatas=[{
"source": doc["source"],
"doc_type": doc.get("type", "unknown"),
"created_at": doc.get("created_at", ""),
}]
)
vectorstore.add_documents(chunks)
def build_rag_chain():
"""Build a production RAG chain with custom prompt."""
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PGVector(
embeddings=embeddings,
collection_name="knowledge_base",
connection=CONNECTION_STRING,
)
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximal Marginal Relevance
search_kwargs={"k": 5, "fetch_k": 20},
)
prompt = PromptTemplate(
input_variables=["context", "question"],
template=(
"Use only the context below to answer the question. "
"If unsure, say you don't know.\n\n"
"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
),
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
return RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True,
)The RAG prompt must explicitly instruct the LLM to base its answer only on the provided context and to say 'I don't know' when the context is insufficient — without this constraint, capable models will hallucinate plausible-sounding answers from training data. Include the retrieved source metadata (document title, page number, URL) in the context block so the LLM can cite them in its answer. A system prompt that defines the assistant's role (e.g., 'You are a helpful assistant for PT XYZ's internal HR policies') further reduces off-topic hallucinations.
Embedding-based similarity search (cosine or dot-product) is fast but imperfect — it retrieves semantically similar text even when it doesn't actually answer the question. Adding a cross-encoder re-ranker (e.g., a small BERT model fine-tuned on MS-MARCO) as a second-stage filter dramatically improves precision by scoring each retrieved chunk against the full query text. This two-stage approach (embed → retrieve top-20 → re-rank → take top-5) is the standard pattern in production systems where answer quality matters.
Dense vector search excels at semantic similarity but struggles with exact keyword matches — product codes, names, and technical identifiers. Sparse retrieval (BM25 or TF-IDF) handles exact matches well but misses paraphrases. Hybrid search combines both scores using Reciprocal Rank Fusion (RRF) to get the best of both worlds. Qdrant and newer versions of pgvector support hybrid search natively; LangChain's EnsembleRetriever provides a framework-level implementation.
Splitting a document at fixed character counts without regard for sentence or paragraph boundaries is the single most common cause of poor RAG performance. A chunk that begins mid-sentence loses its subject; a chunk that cuts a table in half produces garbled context. Always validate your chunking output visually on a sample of real documents before indexing your full corpus. Use overlapping chunks (at minimum 10–15% overlap) to ensure boundary information is captured.
RAGAS provides a suite of reference-free metrics that evaluate RAG quality without requiring expensive human-labeled ground truth. The four core metrics are faithfulness (fraction of answer claims supported by context), answer relevancy (how well the answer addresses the question), context precision (fraction of retrieved chunks that are actually relevant), and context recall (fraction of relevant information that was retrieved). Running these metrics against a curated question set after each pipeline change gives an objective measure of quality.
Embedding generation is the dominant cost in a RAG pipeline — OpenAI's text-embedding-3-small charges per token even for repeated documents. Implement a content-addressed cache (hash the text, store the embedding in Redis or a dedicated table) to avoid re-embedding unchanged documents on incremental ingestion. For query results, a semantic cache (check if the new query is within cosine distance 0.05 of a previously answered query) can serve cached answers for near-duplicate questions, reducing latency and LLM API costs.
LLM generation latency for a full answer can be 3–8 seconds, which feels unresponsive in a chat UI. LangChain supports streaming callbacks that emit tokens as they are generated, enabling the frontend to display partial responses progressively. Pair streaming with a 'sources' panel that appears immediately (before the LLM finishes) to show users which documents were retrieved while they read the answer — this also builds trust by making the retrieval process transparent.
Run RAGAS evaluation on 50–100 question-answer pairs sampled from your real use cases before going live. RAGAS's faithfulness metric (does the answer only contain claims supported by the context?) is the most important signal — a faithfulness score below 0.8 indicates your prompt or retrieval needs tuning. Automate this evaluation to run on every significant change to the pipeline.
Production RAG systems must balance answer quality against API cost and latency. The main cost levers are: embedding model selection (text-embedding-3-small at $0.02/1M tokens vs ada-002 at $0.10/1M tokens), context window size (sending fewer but higher-quality chunks), caching, and batching ingestion requests. For high-volume applications, consider running a self-hosted embedding model (e5-large-v2 or BGE-M3) on a GPU instance to eliminate per-token embedding costs entirely.
Instrument your RAG pipeline with latency histograms (retrieval time, LLM time, total response time), faithfulness scores computed async after each response, and a thumbs-up/thumbs-down feedback mechanism in the UI. Export these metrics to Prometheus and visualize in Grafana. Set alert thresholds: if average faithfulness drops below 0.75 over a rolling 1-hour window, page the on-call engineer — it often means new documents with poor chunking have been ingested.
Key terms in this article include RAG, pgvector, RAGAS, and MMR (Maximal Marginal Relevance).
Sources