I've been using Claude's tool use in production for my AI Gymbro fitness app for several months. The Anthropic documentation is good, but there's a gap between the tutorial examples and the messiness of real production.
Claude's tool use works by providing the model with a JSON Schema describing available tools. The quality of your tool descriptions is the single most important factor in tool call accuracy. A vague description like 'logs a workout' leads to the model guessing argument values.
Use enum types wherever possible to constrain the model's choices. Use minItems/maxItems on arrays to prevent the model from generating empty or excessively long arrays. Add examples in the description field — the model uses these to self-correct when unsure of the correct format.
Claude's API supports a tool_choice parameter that lets you control whether the model must use a tool. For agentic flows where you expect the model to always call a specific tool to return structured data, use tool_choice with the specific tool name. This eliminates the possibility of the model returning a free-text response instead of a tool call.
Claude Tool Use Flow
User: "Log 3 sets of 8 reps bench press at 80kg"
│
▼
┌──────────────────────────────────────┐
│ Claude receives: │
│ - User message │
│ - Tool definitions (JSON Schema) │
│ - System prompt (cached) │
└──────────────────┬───────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Claude returns: │
│ stop_reason: "tool_use" │
│ content: [{ │
│ type: "tool_use", │
│ name: "log_workout_set", │
│ input: { │
│ exercise: "Barbell Bench Press",│
│ sets: 3, reps: 8, weight_kg: 80│
│ } │
│ }] │
└──────────────────┬───────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Your App: Execute Tool │
│ - Validate args (enum, range) │
│ - Write to database │
│ - Return tool_result │
└──────────────────┬───────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Second Claude call with result │
│ → Natural language confirmation │
│ "Logged 3x8 Barbell Bench Press..."│
└──────────────────────────────────────┘For complex multi-tool workflows in Claude API, I define a 'done' tool that the agent calls when it has finished its task. This tool takes a 'summary' parameter describing what was accomplished. This pattern solves the 'how do I know when the agent is done' problem cleanly.
Tool calls fail for two reasons: the model calls a tool with invalid arguments (your schema validation catches this), or the tool execution itself fails. Both cases require returning a result back to the model with the error information — done by returning a tool_result with is_error: true.
Claude's streaming API sends tool use events as they're generated, but tool call arguments arrive in fragments. Building a streaming parser requires buffering the argument stream and only triggering tool execution once the stop_reason: 'tool_use' event arrives. Anthropic's official SDK handles this buffering automatically.
// Claude tool definition with precise schema
const tools = [
{
name: "log_workout_set",
description:
"Records a single completed exercise set to the user's workout log. " +
"Call this ONCE per set, not once per exercise. " +
"Use the exact exercise name from our library, e.g. 'Barbell Back Squat'.",
input_schema: {
type: "object",
properties: {
exercise_name: {
type: "string",
description: "Exercise name from library, e.g. 'Barbell Bench Press', 'Cable Row'",
},
reps: { type: "integer", minimum: 1, maximum: 100 },
weight_kg: { type: "number", minimum: 0, maximum: 1000 },
set_type: {
type: "string",
enum: ["working", "warmup", "dropset", "failure"],
description: "Type of set — default 'working' if not specified",
},
},
required: ["exercise_name", "reps"],
},
},
{
name: "done",
description: "Call when you have finished all requested actions.",
input_schema: {
type: "object",
properties: {
summary: { type: "string", description: "Brief summary of what was accomplished" },
},
required: ["summary"],
},
},
]
// Handle tool call errors gracefully
async function executeTool(name: string, args: Record<string, unknown>) {
try {
const result = await toolHandlers[name](args)
return { type: "tool_result", content: JSON.stringify(result), is_error: false }
} catch (error) {
return {
type: "tool_result",
content: JSON.stringify({
error: String(error),
suggestion: "Try using a different exercise name or check the arguments",
}),
is_error: true,
}
}
}Tool definitions contribute to your input token count. A typical tool with a detailed schema and description costs 100-300 tokens. If you have 10 tools, that's 1,000-3,000 tokens of overhead on every request. I implemented a context-aware tool selector that reduced my tool overhead by 60%.
Claude sometimes returns multiple tool calls in a single response. If your tool implementations have side effects or read from shared state, parallel execution can cause race conditions. I learned this when two parallel log_workout calls created duplicate workout records because both read the same empty workout state before either had written. Either enforce sequential execution or design tools to be idempotent.
Testing LLM applications that rely on tool calls requires a two-layer strategy: mock tests that stub the LLM and verify tool execution logic; and integration tests that use a real LLM call against test conversations with known correct tool call sequences.
After six months of Claude tool use in production: tool call success rate is 94.2% on first attempt, 98.8% after one retry. Average tokens per tool definition: 180. Average tool calls per user session: 4.3. Most common failure mode: the model calls a tool with a valid schema but semantically wrong arguments.