Everyone's blog post about prompt engineering covers chain-of-thought, few-shot examples, and role instructions. That's fine for demos. Production prompts have different requirements: they need to be versioned, tested, monitored for regression, and optimized for cost.
The biggest mistake I made early on was storing prompts inline in application code. The fix: treat prompts as artifacts with their own versioning. I store prompts in a prompts/ directory as Markdown files with YAML frontmatter containing version, model, max_tokens, temperature, and a description.
Use a proper template library: Handlebars (JavaScript) or Jinja2 (Python) give you partials (reusable prompt components), conditional sections, and loop constructs. Extract common prompt components into shared partials that all prompts import.
Few-shot examples are some of the most powerful prompt engineering tools and some of the hardest to manage. Store examples in a separate examples/ directory, tagged with the prompt they belong to and a validity date. Review and refresh examples quarterly.
Prompt Ops Pipeline
prompts/
├── workout-recommendation.md
│ ├── --- (YAML frontmatter)
│ │ version: "1.4.2"
│ │ model: "claude-3-5-haiku-20241022"
│ │ temperature: 0.1
│ │ max_tokens: 1024
│ │ description: "Workout plan generation"
│ ├── ---
│ └── [prompt content with Handlebars templates]
│
├── examples/
│ └── workout-recommendation/
│ ├── example-01.json { input, expected_output }
│ └── example-02.json
│
└── evals/
└── workout-recommendation/
├── test-01.json { input, assertions: [...] }
└── test-02.json
CI Pipeline:
PR opened
│
▼
Load prompt change diff
│
▼
Run evals (call real LLM on test set)
│
├── Pass rate ≥ 95%? → Allow merge
└── Pass rate < 95%? → Block + show failuresThe most reliable prompt engineering pattern I've found for structured output is providing the output format as a complete example in the prompt, not as a JSON Schema description. Models follow examples better than schemas, and you catch format violations before they reach your JSON parser.
The most common production prompt problem is regression: a prompt change that improves behavior on the failing case you were fixing degrades behavior on previously-working cases. I maintain an evals/ directory with test cases and run them on every prompt change using a CI step.
Once you have a test suite, you can use automatic prompt optimization: tools like DSPy (Stanford) that automatically generate and test prompt variations. I used DSPy to optimize my workout recommendation prompt and got a 22% improvement in eval score.
# DSPy — automatic prompt optimization
import dspy
# Define your task as a DSPy signature
class WorkoutRecommendation(dspy.Signature):
"""Generate a personalized workout recommendation."""
user_profile: str = dspy.InputField()
goals: str = dspy.InputField()
recommendation: str = dspy.OutputField()
# Build a DSPy program
class WorkoutRecommender(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought(WorkoutRecommendation)
def forward(self, user_profile, goals):
return self.generate(user_profile=user_profile, goals=goals)
# Load your eval set (50+ examples)
trainset = [
dspy.Example(
user_profile="...",
goals="...",
recommendation="..." # ground truth
).with_inputs("user_profile", "goals")
for example in load_evals()
]
# Optimize — DSPy tries prompt variations automatically
teleprompter = dspy.BootstrapFewShot(metric=exact_match_metric)
optimized = teleprompter.compile(WorkoutRecommender(), trainset=trainset)
# Result: auto-optimized prompt with best few-shot examples
optimized.save("optimized_recommender.json")Better prompts are often shorter prompts. I ran a prompt audit on my AI Gymbro system prompt and reduced it from 1,100 tokens to 620 tokens with no measurable quality degradation. That's 480 fewer tokens per request — at 10,000 requests/day, it's 4.8M tokens/day saved.
Setting temperature to 0 is the common advice for production prompts. But temperature 0 can cause quality degradation: it amplifies small prompt changes and produces overconfident outputs. I use temperature 0.1 as my default for production prompts — nearly deterministic, but with enough variance to avoid the deterministic failure modes.
Prompts don't transfer perfectly between models. Claude responds better to explicit step-by-step thinking instructions. GPT-4o tends to follow format instructions more literally. Gemini models respond well to explicit output examples. When you migrate a prompt from one model to another, budget time for re-optimization.
Long-term, managing prompts at scale requires: version control (Git), testing (evals), CI/CD (automatic eval runs on changes), observability (log prompt versions alongside LLM calls), and a change review process. I use a 'prompt PR' process for all prompt changes.