Prometheus collects metrics. AlertManager decides who gets paged and when. The combination, done properly, is the difference between knowing about an outage before your users do and finding out when your phone rings at 3 AM from a client. At Commsult Indonesia, I maintain a Prometheus + AlertManager stack that monitors 11 servers across two cloud providers. We get actionable alerts — not noise — because of careful alert rule design and AlertManager routing configuration. This guide covers what actually works in production, including the mistakes I made configuring AlertManager that caused alert floods and missed incidents.
Prometheus fires alerts based on PromQL rules — when a condition evaluates to true for longer than the for duration, Prometheus sends the alert to AlertManager. AlertManager then handles deduplication, grouping, routing, silencing, and inhibition before sending notifications to receivers (Telegram, Slack, PagerDuty, email). The separation of concerns is clean: Prometheus knows what's wrong; AlertManager decides who to tell and how. This architecture means you can have multiple Prometheus instances sending alerts to one AlertManager cluster, and AlertManager coordinates across all of them.
The biggest mistake teams make with Prometheus alerting is writing too many alert rules too quickly. Every rule that fires becomes a notification, and too many notifications cause alert fatigue — engineers start ignoring alerts because they're not actionable. Write alert rules only for conditions that require human action. A good alert has a clear owner, a clear action, and a clear urgency level. 'CPU > 90% for 5 minutes' is actionable if you've established what to do when CPU spikes. 'Disk usage > 80%' is actionable because the action is clear (add storage or clean up files). 'Node up is 0' is actionable — a server is down. An alert for 'HTTP 4xx rate > 5%' might be noise if your app has expected 404s.
The AlertManager routing tree determines which alerts go to which receivers. The root route is a catch-all — every alert that doesn't match a more specific route ends up here. Child routes match on alert labels (severity, service, environment) and route to specific receivers. A production routing tree: critical severity alerts go to PagerDuty for immediate on-call paging; warning severity alerts go to a Telegram channel for review; informational alerts go to a low-priority email list. Environment-based routing: prod alerts go to the on-call Telegram group; staging and dev alerts go to a separate channel that fewer people monitor.
From my experience: use AlertManager's grouping aggressively to prevent notification floods. When a network switch fails and 20 servers become unreachable, you want one alert ('network failure affecting 20 servers') not 20 individual 'server down' alerts. Configure group_by: [alertname, env, cluster] and group_wait: 30s (wait 30 seconds for more alerts in the same group before sending) and group_interval: 5m (send accumulated new alerts every 5 minutes after the initial notification). This grouping has prevented several alert storms from becoming notification floods for our on-call rotation.
Inhibition rules suppress alerts when a higher-level alert is already firing. The classic use case: if a 'cluster is unreachable' alert fires, inhibit all 'service is down' alerts for services in that cluster — because the root cause is the cluster being unreachable, not individual service failures. Writing inhibition rules requires careful matching to avoid over-inhibition. Use source_matchers (the alert that does the inhibiting) and target_matchers (the alerts being inhibited) with label matching to ensure only the right alerts are suppressed. Over-inhibition can cause you to miss real independent failures.
# alertmanager.yml — Production routing configuration
global:
resolve_timeout: 5m
route:
group_by: [alertname, env, cluster]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: default-telegram
routes:
- matchers:
- severity = critical
receiver: oncall-telegram
continue: false
- matchers:
- severity = warning
- env = prod
receiver: prod-warnings-telegram
inhibit_rules:
- source_matchers:
- alertname = ClusterUnreachable
target_matchers:
- alertname = ServiceDown
equal: [cluster, env]
receivers:
- name: oncall-telegram
telegram_configs:
- bot_token: "YOUR_BOT_TOKEN"
chat_id: -1001234567890
message: |
FIRING: {{ .GroupLabels.alertname }}
Severity: {{ .CommonLabels.severity }}
{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}Telegram is the most practical alerting channel for Indonesian DevOps teams — everyone has it installed, it works reliably in Indonesia, and AlertManager has native Telegram support. Create a Telegram bot via @BotFather, create a group chat for alerts, add the bot, and get the chat ID. In AlertManager's receiver configuration, use the telegram_configs block with the bot token and chat_id. Format alert messages with the message template to include the alert name, severity, description, and a link to the relevant Grafana dashboard. A well-formatted Telegram alert tells the on-call engineer exactly what's wrong and where to look within 30 seconds of opening the notification.
┌──────────────────────────────────────────────────────┐
│ Prometheus + AlertManager Flow │
├──────────────────────────────────────────────────────┤
│ │
│ Prometheus (evaluates PromQL rules every 1m) │
│ ↓ Alert fires (condition true for "for" period)│
│ AlertManager Cluster (HA: 2 nodes, gossip protocol) │
│ ↓ Deduplicate + Group (group_wait: 30s) │
│ ↓ Apply inhibition rules │
│ ↓ Route by severity/env labels │
│ │
│ critical → Telegram on-call group (immediate) │
│ warning → Telegram prod-warnings channel │
│ info → Email digest (low priority) │
└──────────────────────────────────────────────────────┘Early in my Prometheus setup, I wrote alert rules with no for duration, which means alerts fire immediately when the condition is true — even for momentary metric spikes. The result was a flood of firing-and-resolving alerts during normal traffic variations. A CPU spike to 95% for 3 seconds fires and resolves before anyone can even look at it, but it generates two Telegram messages (FIRING and RESOLVED) and contributes to alert fatigue. Always set a for duration appropriate to the alert's nature: for: 5m for resource saturation alerts, for: 1m for service down alerts, for: 15m for trend-based alerts. The for duration acts as a debounce — it requires the condition to be continuously true for that period before firing.
A single AlertManager instance is a single point of failure for your alerting infrastructure. If AlertManager goes down, all Prometheus alerts queue up silently and you receive no notifications during the outage. Run AlertManager in an HA cluster of at least two instances using the --cluster.peer flag to connect them. AlertManager uses a gossip protocol to share notification state across instances — this prevents duplicate notifications when Prometheus sends an alert to multiple AlertManager instances. With two AlertManager instances, Prometheus sends each alert to both; the instances coordinate via gossip to ensure only one sends the notification.
Planned maintenance generates expected alerts — a server reboot causes 'server down' alerts; a database migration causes slow query alerts. AlertManager's silence feature lets you suppress specific alerts for a defined time window. Create silences via the AlertManager web UI or API before maintenance begins. Silences match on label matchers and have an expiry time. A discipline I enforce: every planned maintenance window must have a corresponding AlertManager silence created before it begins. This prevents unnecessary page notifications during expected disruptions and keeps the on-call team's trust that alerts are real issues, not maintenance noise.
After two years of running Prometheus + AlertManager at Commsult Indonesia, good alerting means: receiving fewer than 5 alert notifications per week during normal operations, each notification being actionable within 10 minutes of receipt, and zero 'production is down and nobody was paged' incidents. Getting there required writing alert rules that fire on symptoms rather than causes (alert on 'HTTP 5xx rate > 1%' not 'CPU > 80%'), aggressive grouping and inhibition configuration, and quarterly alert review sessions where we deliberately delete alerts that fired but didn't require action. Alert debt — accumulated rules that nobody responds to — is as damaging to operations as technical debt is to code quality.
Sources & Further Reading