The most common failure mode in AI agent deployments is not intelligence — it is amnesia. An agent that cannot remember what it did yesterday, what the user prefers, or what happened in the last conversation is fundamentally limited. Memory is what separates a stateless chatbot from an agent that actually gets smarter about your specific context over time. I have built memory systems for several AI integrations at Commsult Indonesia, and the pattern choices you make early have significant architectural implications. This post covers the four main memory patterns, when to use each, and the production pitfalls I have run into.
AI agent memory falls into four categories based on scope and persistence. In-context memory (conversation history injected into the prompt) is simplest but limited by context window size and costs tokens on every call. External semantic memory (vector databases like Pinecone, pgvector, or Chroma) enables long-term recall based on semantic similarity. Structured episodic memory (traditional databases storing facts, preferences, events) gives you queryable, auditable history. Procedural memory (learned skills or tools the agent can create and store) is the most advanced pattern, used by systems like Hermes Agent.
For most applications, start with conversation history trimming. Keep the last N turns in context, summarize older content using a lightweight model call, and inject the summary at the top. This works well for single-session agents and requires no external infrastructure. The challenge is cost: every token in history costs money on every API call. With GPT-4o at $2.50/M input tokens, a 50-turn conversation history (roughly 10K tokens) adds $0.025 per subsequent message — acceptable for enterprise use, significant for consumer apps at scale.
┌─────────────────────────────────────────────────────────────┐
│ AI Agent Memory Architecture │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Context Assembly Pipeline (runs on every message) │ │
│ │ │ │
│ │ 1. User Profile (structured DB) │ │
│ │ └─ preferences, role, tenant_id │ │
│ │ │ │
│ │ 2. Semantic Memory (pgvector) │ │
│ │ └─ top-3 relevant past interactions │ │
│ │ │ │
│ │ 3. Recent History (sliding window) │ │
│ │ └─ last 8 conversation turns verbatim │ │
│ │ │ │
│ │ 4. Current Message │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ LLM (with full context) │
│ │ │
│ ▼ │
│ Memory Write Tools: store_fact, update_preference │
└─────────────────────────────────────────────────────────────┘From my experience: implement a two-tier context strategy. Keep the last 5-10 turns verbatim (for immediate context coherence) and maintain a structured summary for older history. The summary should store facts, not dialogue — 'User manages 3 inventory warehouses in Jakarta and prefers metric units' is more token-efficient than reproducing the conversation where that was established.
Vector databases store embeddings of text and enable retrieval by semantic similarity — the agent can recall relevant past interactions even if they use different words. This is ideal for knowledge bases, document retrieval, and remembering user preferences expressed in natural language. For production, I use PostgreSQL with the pgvector extension rather than a separate vector database service. It simplifies the stack considerably: one database for structured data and vector embeddings, with familiar tooling for backups, replication, and access control.
The pattern I use: on every agent interaction, embed the user message and store it alongside metadata (user ID, session ID, timestamp, extracted entities). At the start of each new session, retrieve the top-K most relevant past interactions using cosine similarity on the embedding. Inject these as a 'relevant history' block in the system prompt. For entity extraction, run a lightweight structured extraction call (or use regex for simple cases) to pull out names, dates, and domain-specific terms before embedding.
-- pgvector setup
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE agent_memories (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id),
session_id UUID,
domain TEXT NOT NULL DEFAULT 'general',
content TEXT NOT NULL,
embedding vector(1536), -- OpenAI ada-002 dimensions
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ -- optional TTL
);
-- Index for fast similarity search with metadata filter
CREATE INDEX idx_memories_embedding
ON agent_memories USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
CREATE INDEX idx_memories_user_domain
ON agent_memories (user_id, domain);
-- Retrieve top-K relevant memories
SELECT content, metadata,
1 - (embedding <=> $1::vector) AS similarity
FROM agent_memories
WHERE user_id = $2
AND domain = $3
AND (expires_at IS NULL OR expires_at > NOW())
ORDER BY embedding <=> $1::vector
LIMIT 3;For facts that need to be reliable and queryable — user preferences, completed tasks, decisions made — use a structured database rather than a vector store. Define a schema for what your agent needs to remember: user preferences table (key-value), task history table (task, result, timestamp), entity registry (names, IDs, relationships mentioned by user). The agent uses tools to read and write this memory store explicitly. This pattern gives you auditability, easy deletion (GDPR compliance), and precise retrieval.
Vector similarity recall has a precision problem: if your embedding model is not tuned to your domain, it will retrieve semantically similar but contextually irrelevant memories. I had an agent for an HR ERP system that kept retrieving leave approval conversations when users asked about inventory — because 'approval' has high semantic overlap. The fix was to add metadata filtering (user_id + domain tag) as a pre-filter before vector similarity search. Never rely on semantic search alone; always combine with structured metadata filters.
The architecture I use for production agents combines all three patterns: PostgreSQL with pgvector for both structured facts and semantic embeddings, a sliding window context manager that assembles the prompt from multiple memory tiers, and explicit memory write tools the agent can call to store important information. The context assembly pipeline runs on every message: fetch user profile from structured store, retrieve top-3 semantically relevant past interactions, inject last 8 conversation turns verbatim. Total added context per message: roughly 1,500-2,500 tokens.
Memory poisoning is a real risk: if the agent stores incorrect information (because a user lied, the agent misunderstood, or an indirect injection occurred), that bad memory persists and influences future behavior. Mitigations: add confidence scores to stored facts, implement a memory correction tool users can invoke, periodically re-validate stored facts against authoritative sources, and log all memory writes for audit. Also implement memory TTLs — preferences from 2 years ago may no longer be valid. Build the infrastructure to expire or refresh stale memories.