How do you decide whether to build a workflow or a true AI agent?

Apply the flowchart test: if you can draw the sequence of steps in advance, build a deterministic workflow and place the LLM inside the steps rather than in charge of them. True agent territory is reserved for tasks whose trajectory cannot be specified ahead of time, such as open-ended debugging or multi-file code changes. Starting with the simpler system and only adding autonomy when it demonstrably underperforms is the recommended approach.

What makes tool design so critical for production agents?

Agents are exactly as good as their tools. Task-shaped tools that handle related operations internally beat a collection of fine-grained API wrappers, because fewer tools means fewer chances to compose them incorrectly. Tool descriptions must state when to call the tool, not just what it does, since small wording changes produce measurable behavior differences and should be treated as tunable, evaluated artifacts.

What guardrails are actually effective in production, and in what order should they be applied?

Effective guardrails form a layered architecture: permission tiers that require human confirmation for irreversible actions (sending email, deleting data, touching money), scoped credentials with minimal grants, per-run budget caps on tool calls and tokens, and full trace logging of every tool call with inputs and outputs. The underlying design principle is to sort every action by reversibility — reversible actions can be autonomous while hard-to-reverse actions require explicit gates.

What evaluation layers should be in place before an agent feature ships?

Three layers are recommended: tool-level tests that verify the model selects the correct tool with the right arguments for a given state, trajectory evals covering 20–50 realistic end-to-end tasks scored on completion, step count, and cost, and daily production sampling of real traces reviewed by a human or LLM judge. A useful technique is using the agent itself to analyze its failed transcripts and propose improvements to tool descriptions.

Why does the post treat production agents as an ops problem rather than an AI problem?

Long-running, multi-step, non-deterministic processes require the same operational primitives as any queue worker: timeouts, retries with idempotency keys, checkpointing so a run can resume mid-sequence, and dead-letter handling for persistently failing tasks. The post runs agent jobs in a queue with state in PostgreSQL, traces shipped to Loki, and per-run token spend graphed in Grafana — because the model is the least observable part of the system, everything around it must be the most observable.

AI Agents in Production: Lessons from a Year of Agentic Workflows

I use agentic coding tools daily — Claude Code does a meaningful share of my refactoring and infrastructure scripting — and I have shipped agent-shaped features into products. The gap between an agent demo and an agent in production is wider than any other gap in software I have worked on. The demo is a weekend; the production version is months of tool design, guardrails, and evals.

This post is the lessons file I keep updating: what actually breaks, what actually helps, and the order in which to spend your effort. Most of it converges on the advice in Anthropic's building-effective-agents research post, which I consider mandatory reading — but here it is filtered through my own production scars.

Lesson One: Most Agents Should Be Workflows

Anthropic draws the line precisely: workflows orchestrate LLM calls through predefined code paths, while agents let the model dynamically direct its own process and tool usage. The single most common production failure I see is choosing the second when the task wanted the first. If your task has a known sequence — fetch invoice, extract fields, validate, post to the ERP — write the sequence in code and put the LLM inside the steps, not in charge of them.

Genuine agent territory is where the trajectory cannot be specified in advance: debugging, open-ended research, multi-file code changes. The honest test I apply: can I draw the flowchart? If yes, it is a workflow, and the deterministic version will be cheaper, faster, and dramatically easier to debug. Start simple and add autonomy only when the simpler system demonstrably underperforms.

Lesson Two: Tool Design Is Most of the Job

Agents are exactly as good as their tools. Anthropic's engineering team treats the agent-computer interface with the same rigor as a human interface, and after a year of writing tools I agree with every word. The rules that pay rent:

Fewer, task-shaped tools

Do not wrap every API endpoint. One schedule_meeting tool that handles availability internally beats list_users plus list_events plus create_event — three tools means three chances to compose them wrong.

Descriptions are prompts

Write each description as if onboarding a new teammate, and state when to call the tool, not just what it does. Small wording refinements produce measurable behavior changes — treat descriptions as tunable, evaluated artifacts.

Make mistakes hard to express

Require absolute paths, use enums instead of free strings, validate aggressively, and return actionable error messages. An agent retries what the error tells it; a vague 400 produces a loop.

Return lean, meaningful context

Agents have a context budget. Return names instead of cryptic IDs, paginate big results, and filter server-side. Every junk token you return is reasoning capacity you took away.

// Tool descriptions are prompts. Say WHEN to call it, not just what it does.
{
  "name": "search_orders",
  "description": "Search customer orders by status, date range, or customer email.
    Call this whenever the user asks about a specific order, a refund,
    or delivery status. Do NOT answer order questions from memory.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query":  { "type": "string", "description": "Free-text search, e.g. an order ID or email" },
      "status": { "type": "string", "enum": ["pending", "paid", "shipped", "refunded"] }
    },
    "required": ["query"]
  }
}

Lesson Three: Guardrails Are an Architecture, Not a Prompt

Telling the model to be careful is not a guardrail. The layers that have actually saved me, in priority order:

Permission tiers per tool. Read-only tools run free; mutating tools require policy checks; irreversible tools — sending email, deleting data, touching money — require explicit human confirmation in the loop.
Scoped credentials. The agent's database user gets row-level scopes, its API tokens get minimal grants, and its filesystem is a sandbox. Assume the model will eventually do the dumbest permitted thing.
Budget caps per run: max tool calls, max tokens, max wall-clock minutes. Runaway loops should hit a wall you chose, not a bill you discover.
Full trace logging of every tool call with inputs and outputs. When an agent misbehaves, the trace is the difference between a fix and a shrug.

The design heuristic underneath all four: sort every action by reversibility. Reversible actions can be autonomous; hard-to-reverse actions get gates. This is also why dedicated tools beat a generic bash tool for sensitive operations — a send_email tool is easy to intercept and confirm, while a shell command that happens to call curl is opaque to your harness.

Lesson Four: No Evals, No Agent

An agent without an eval suite is a system whose quality you discover from users. Before any agent feature ships, I want three layers in place:

Tool-level tests: for a given state, does the model pick the right tool with the right arguments? Cheap to run, catches most regressions from prompt or description changes.
Trajectory evals: a set of 20 to 50 realistic tasks scored end to end — did it finish, in how many steps, at what cost. Run on every prompt change and every model upgrade.
Production sampling: a daily slice of real traces reviewed by a human or an LLM judge, because real users find trajectories your test set never imagined.

Anthropic's tool-writing guide pushes this loop further: use the agent itself to analyze its failed transcripts and propose tool improvements. It works embarrassingly well — agents are good at spotting where their tools confused them, and the fix is often one sentence in a description.

Lesson Five: Agents Are an Ops Problem Wearing an AI Costume

Long-running, multi-step, non-deterministic processes — infrastructure people have a word for this, and it is not intelligence. Production agents need what every queue worker needs: timeouts, retries with idempotency keys so a retried step does not double-send an email, checkpointing so a 20-step run can resume from step 14, and dead-letter handling for tasks that keep failing.

On my stack that means agent runs are jobs in a queue with state in PostgreSQL, traces shipped to Loki, and per-run token spend graphed in Grafana next to CPU charts. The model is the least observable part of the system, which is precisely why everything around it must be the most observable.

The Pre-Launch Checklist

Flowchart test passed: anything with a fixed sequence got built as a workflow, not an agent.
Tools are task-shaped, with when-to-use descriptions and enum-constrained inputs.
Permission tiers, scoped credentials, and per-run budget caps are enforced in code.
Tool-level and trajectory evals run in CI; a baseline score is recorded before every model or prompt change.
Every run leaves a complete trace with costs, and irreversible actions show up in a human review queue.

The Takeaway

The teams succeeding with agents are not the ones with the cleverest prompts — they are the ones doing the most disciplined engineering around the model: boring tools, layered guardrails, evals in CI, and ops-grade observability. Build the right system for your needs, start with the simplest one that works, and earn each step of autonomy with evidence.

Sources and further reading

Frequently Asked Questions

AI Agents in Production: Lessons from a Year of Agentic Workflows

Frequently Asked Questions

AI Agents in Production: Lessons from a Year of Agentic Workflows

Lesson One: Most Agents Should Be Workflows

Lesson Two: Tool Design Is Most of the Job

Lesson Three: Guardrails Are an Architecture, Not a Prompt

Lesson Four: No Evals, No Agent

Lesson Five: Agents Are an Ops Problem Wearing an AI Costume

The Pre-Launch Checklist

The Takeaway

Lesson One: Most Agents Should Be Workflows

Lesson Two: Tool Design Is Most of the Job

Lesson Three: Guardrails Are an Architecture, Not a Prompt

Lesson Four: No Evals, No Agent

Lesson Five: Agents Are an Ops Problem Wearing an AI Costume

The Pre-Launch Checklist

The Takeaway