AI Agents in Production: Lessons from a Year of Agentic Workflows

Photo by Google DeepMind

Photo by Google DeepMind
I use agentic coding tools daily — Claude Code does a meaningful share of my refactoring and infrastructure scripting — and I have shipped agent-shaped features into products. The gap between an agent demo and an agent in production is wider than any other gap in software I have worked on. The demo is a weekend; the production version is months of tool design, guardrails, and evals.
This post is the lessons file I keep updating: what actually breaks, what actually helps, and the order in which to spend your effort. Most of it converges on the advice in Anthropic's building-effective-agents research post, which I consider mandatory reading — but here it is filtered through my own production scars.
Anthropic draws the line precisely: workflows orchestrate LLM calls through predefined code paths, while agents let the model dynamically direct its own process and tool usage. The single most common production failure I see is choosing the second when the task wanted the first. If your task has a known sequence — fetch invoice, extract fields, validate, post to the ERP — write the sequence in code and put the LLM inside the steps, not in charge of them.
Genuine agent territory is where the trajectory cannot be specified in advance: debugging, open-ended research, multi-file code changes. The honest test I apply: can I draw the flowchart? If yes, it is a workflow, and the deterministic version will be cheaper, faster, and dramatically easier to debug. Start simple and add autonomy only when the simpler system demonstrably underperforms.
Agents are exactly as good as their tools. Anthropic's engineering team treats the agent-computer interface with the same rigor as a human interface, and after a year of writing tools I agree with every word. The rules that pay rent:
Fewer, task-shaped tools
Do not wrap every API endpoint. One schedule_meeting tool that handles availability internally beats list_users plus list_events plus create_event — three tools means three chances to compose them wrong.
Descriptions are prompts
Write each description as if onboarding a new teammate, and state when to call the tool, not just what it does. Small wording refinements produce measurable behavior changes — treat descriptions as tunable, evaluated artifacts.
Make mistakes hard to express
Require absolute paths, use enums instead of free strings, validate aggressively, and return actionable error messages. An agent retries what the error tells it; a vague 400 produces a loop.
Return lean, meaningful context
Agents have a context budget. Return names instead of cryptic IDs, paginate big results, and filter server-side. Every junk token you return is reasoning capacity you took away.
// Tool descriptions are prompts. Say WHEN to call it, not just what it does.
{
"name": "search_orders",
"description": "Search customer orders by status, date range, or customer email.
Call this whenever the user asks about a specific order, a refund,
or delivery status. Do NOT answer order questions from memory.",
"input_schema": {
"type": "object",
"properties": {
"query": { "type": "string", "description": "Free-text search, e.g. an order ID or email" },
"status": { "type": "string", "enum": ["pending", "paid", "shipped", "refunded"] }
},
"required": ["query"]
}
}Telling the model to be careful is not a guardrail. The layers that have actually saved me, in priority order:
The design heuristic underneath all four: sort every action by reversibility. Reversible actions can be autonomous; hard-to-reverse actions get gates. This is also why dedicated tools beat a generic bash tool for sensitive operations — a send_email tool is easy to intercept and confirm, while a shell command that happens to call curl is opaque to your harness.
An agent without an eval suite is a system whose quality you discover from users. Before any agent feature ships, I want three layers in place:
Anthropic's tool-writing guide pushes this loop further: use the agent itself to analyze its failed transcripts and propose tool improvements. It works embarrassingly well — agents are good at spotting where their tools confused them, and the fix is often one sentence in a description.
Long-running, multi-step, non-deterministic processes — infrastructure people have a word for this, and it is not intelligence. Production agents need what every queue worker needs: timeouts, retries with idempotency keys so a retried step does not double-send an email, checkpointing so a 20-step run can resume from step 14, and dead-letter handling for tasks that keep failing.
On my stack that means agent runs are jobs in a queue with state in PostgreSQL, traces shipped to Loki, and per-run token spend graphed in Grafana next to CPU charts. The model is the least observable part of the system, which is precisely why everything around it must be the most observable.
The teams succeeding with agents are not the ones with the cleverest prompts — they are the ones doing the most disciplined engineering around the model: boring tools, layered guardrails, evals in CI, and ops-grade observability. Build the right system for your needs, start with the simplest one that works, and earn each step of autonomy with evidence.
Sources and further reading