What is AlertManager's role in the Prometheus stack, and why is it separate from Prometheus itself?

Prometheus is responsible for collecting metrics and evaluating alert rules — it fires an alert when a PromQL condition stays true longer than the configured 'for' duration. AlertManager then takes over to handle deduplication, grouping, routing, silencing, and inhibition before sending notifications to receivers like Telegram, Slack, or PagerDuty. The separation keeps concerns clean: Prometheus knows what is wrong, while AlertManager decides who to tell and how. This architecture also allows multiple Prometheus instances to send alerts to a single AlertManager cluster, which coordinates across all of them.

How should alert grouping be configured to prevent notification floods?

AlertManager's grouping feature batches related alerts into a single notification rather than sending one message per alert. Configuring 'group_by: [alertname, env, cluster]' with 'group_wait: 30s' and 'group_interval: 5m' means AlertManager waits 30 seconds for additional alerts in the same group before sending, then delivers accumulated new alerts every 5 minutes. For example, when a network switch failure makes 20 servers unreachable, this produces one 'network failure affecting 20 servers' alert instead of 20 individual 'server down' messages. This approach has prevented several alert storms from becoming notification floods for the on-call rotation.

Why should alert rules always include a 'for' duration, and what happens if you omit it?

Without a 'for' duration, Prometheus fires an alert the instant a condition becomes true — including brief, transient spikes. A CPU jump to 95% for 3 seconds triggers both a FIRING and a RESOLVED Telegram message before anyone can act on it, creating noise that trains engineers to ignore alerts. The 'for' duration acts as a debounce, requiring the condition to be continuously true for the entire period before an alert fires. Good starting points are 'for: 5m' for resource saturation, 'for: 1m' for service-down alerts, and 'for: 15m' for trend-based alerts.

What are inhibition rules, and when should they be used?

Inhibition rules suppress lower-level alerts when a higher-level alert is already firing, preventing cascading noise from a single root cause. The classic example is suppressing all 'service is down' alerts for services in a cluster when a 'cluster is unreachable' alert is active — the services appear down because the cluster is unreachable, not due to independent failures. Inhibition rules use 'source_matchers' and 'target_matchers' with label matching to ensure only the right alerts are suppressed. Over-inhibition is a real risk, so matching should be precise enough to avoid hiding genuine independent failures.

What does good alerting look like after tuning, and how is it measured?

After two years running Prometheus and AlertManager at Commsult Indonesia, a well-tuned setup receives fewer than 5 alert notifications per week during normal operations, each notification is actionable within 10 minutes, and there are zero 'production is down and nobody was paged' incidents. Reaching this standard required writing alert rules that fire on symptoms rather than causes — for example, alerting on 'HTTP 5xx rate > 1%' instead of 'CPU > 80%' — combined with aggressive grouping, inhibition, and quarterly review sessions that deliberately remove alerts which fired but required no action.

Prometheus and AlertManager: Building an On-Call Alert System That Works

Prometheus collects metrics. AlertManager decides who gets paged and when. The combination, done properly, is the difference between knowing about an outage before your users do and finding out when your phone rings at 3 AM from a client. At Commsult Indonesia, I maintain a Prometheus + AlertManager stack that monitors 11 servers across two cloud providers. We get actionable alerts — not noise — because of careful alert rule design and AlertManager routing configuration. This guide covers what actually works in production, including the mistakes I made configuring AlertManager that caused alert floods and missed incidents.

AlertManager's Role in the Prometheus Stack

Prometheus fires alerts based on PromQL rules — when a condition evaluates to true for longer than the for duration, Prometheus sends the alert to AlertManager. AlertManager then handles deduplication, grouping, routing, silencing, and inhibition before sending notifications to receivers (Telegram, Slack, PagerDuty, email). The separation of concerns is clean: Prometheus knows what's wrong; AlertManager decides who to tell and how. This architecture means you can have multiple Prometheus instances sending alerts to one AlertManager cluster, and AlertManager coordinates across all of them.

Alert Rule Design: Quality Over Quantity

The biggest mistake teams make with Prometheus alerting is writing too many alert rules too quickly. Every rule that fires becomes a notification, and too many notifications cause alert fatigue — engineers start ignoring alerts because they're not actionable. Write alert rules only for conditions that require human action. A good alert has a clear owner, a clear action, and a clear urgency level. 'CPU > 90% for 5 minutes' is actionable if you've established what to do when CPU spikes. 'Disk usage > 80%' is actionable because the action is clear (add storage or clean up files). 'Node up is 0' is actionable — a server is down. An alert for 'HTTP 4xx rate > 5%' might be noise if your app has expected 404s.

Routing Tree Configuration

The AlertManager routing tree determines which alerts go to which receivers. The root route is a catch-all — every alert that doesn't match a more specific route ends up here. Child routes match on alert labels (severity, service, environment) and route to specific receivers. A production routing tree: critical severity alerts go to PagerDuty for immediate on-call paging; warning severity alerts go to a Telegram channel for review; informational alerts go to a low-priority email list. Environment-based routing: prod alerts go to the on-call Telegram group; staging and dev alerts go to a separate channel that fewer people monitor.

From my experience: use AlertManager's grouping aggressively to prevent notification floods. When a network switch fails and 20 servers become unreachable, you want one alert ('network failure affecting 20 servers') not 20 individual 'server down' alerts. Configure group_by: [alertname, env, cluster] and group_wait: 30s (wait 30 seconds for more alerts in the same group before sending) and group_interval: 5m (send accumulated new alerts every 5 minutes after the initial notification). This grouping has prevented several alert storms from becoming notification floods for our on-call rotation.

Inhibition Rules: Preventing Cascading Alert Noise

Inhibition rules suppress alerts when a higher-level alert is already firing. The classic use case: if a 'cluster is unreachable' alert fires, inhibit all 'service is down' alerts for services in that cluster — because the root cause is the cluster being unreachable, not individual service failures. Writing inhibition rules requires careful matching to avoid over-inhibition. Use source_matchers (the alert that does the inhibiting) and target_matchers (the alerts being inhibited) with label matching to ensure only the right alerts are suppressed. Over-inhibition can cause you to miss real independent failures.

# alertmanager.yml — Production routing configuration
global:
  resolve_timeout: 5m

route:
  group_by: [alertname, env, cluster]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: default-telegram

  routes:
    - matchers:
        - severity = critical
      receiver: oncall-telegram
      continue: false
    - matchers:
        - severity = warning
        - env = prod
      receiver: prod-warnings-telegram

inhibit_rules:
  - source_matchers:
      - alertname = ClusterUnreachable
    target_matchers:
      - alertname = ServiceDown
    equal: [cluster, env]

receivers:
  - name: oncall-telegram
    telegram_configs:
      - bot_token: "YOUR_BOT_TOKEN"
        chat_id: -1001234567890
        message: |
          FIRING: {{ .GroupLabels.alertname }}
          Severity: {{ .CommonLabels.severity }}
          {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}

Telegram Integration for Indonesian Teams

Telegram is the most practical alerting channel for Indonesian DevOps teams — everyone has it installed, it works reliably in Indonesia, and AlertManager has native Telegram support. Create a Telegram bot via @BotFather, create a group chat for alerts, add the bot, and get the chat ID. In AlertManager's receiver configuration, use the telegram_configs block with the bot token and chat_id. Format alert messages with the message template to include the alert name, severity, description, and a link to the relevant Grafana dashboard. A well-formatted Telegram alert tells the on-call engineer exactly what's wrong and where to look within 30 seconds of opening the notification.

┌──────────────────────────────────────────────────────┐
│         Prometheus + AlertManager Flow                │
├──────────────────────────────────────────────────────┤
│                                                      │
│  Prometheus (evaluates PromQL rules every 1m)        │
│       ↓ Alert fires (condition true for "for" period)│
│  AlertManager Cluster (HA: 2 nodes, gossip protocol) │
│       ↓ Deduplicate + Group (group_wait: 30s)        │
│       ↓ Apply inhibition rules                       │
│       ↓ Route by severity/env labels                 │
│                                                      │
│  critical → Telegram on-call group (immediate)       │
│  warning  → Telegram prod-warnings channel           │
│  info      → Email digest (low priority)             │
└──────────────────────────────────────────────────────┘

Early in my Prometheus setup, I wrote alert rules with no for duration, which means alerts fire immediately when the condition is true — even for momentary metric spikes. The result was a flood of firing-and-resolving alerts during normal traffic variations. A CPU spike to 95% for 3 seconds fires and resolves before anyone can even look at it, but it generates two Telegram messages (FIRING and RESOLVED) and contributes to alert fatigue. Always set a for duration appropriate to the alert's nature: for: 5m for resource saturation alerts, for: 1m for service down alerts, for: 15m for trend-based alerts. The for duration acts as a debounce — it requires the condition to be continuously true for that period before firing.

High-Availability AlertManager Cluster

A single AlertManager instance is a single point of failure for your alerting infrastructure. If AlertManager goes down, all Prometheus alerts queue up silently and you receive no notifications during the outage. Run AlertManager in an HA cluster of at least two instances using the --cluster.peer flag to connect them. AlertManager uses a gossip protocol to share notification state across instances — this prevents duplicate notifications when Prometheus sends an alert to multiple AlertManager instances. With two AlertManager instances, Prometheus sends each alert to both; the instances coordinate via gossip to ensure only one sends the notification.

Silence Management for Planned Maintenance

Planned maintenance generates expected alerts — a server reboot causes 'server down' alerts; a database migration causes slow query alerts. AlertManager's silence feature lets you suppress specific alerts for a defined time window. Create silences via the AlertManager web UI or API before maintenance begins. Silences match on label matchers and have an expiry time. A discipline I enforce: every planned maintenance window must have a corresponding AlertManager silence created before it begins. This prevents unnecessary page notifications during expected disruptions and keeps the on-call team's trust that alerts are real issues, not maintenance noise.

My Take: What Good Alerting Looks Like

After two years of running Prometheus + AlertManager at Commsult Indonesia, good alerting means: receiving fewer than 5 alert notifications per week during normal operations, each notification being actionable within 10 minutes of receipt, and zero 'production is down and nobody was paged' incidents. Getting there required writing alert rules that fire on symptoms rather than causes (alert on 'HTTP 5xx rate > 1%' not 'CPU > 80%'), aggressive grouping and inhibition configuration, and quarterly alert review sessions where we deliberately delete alerts that fired but didn't require action. Alert debt — accumulated rules that nobody responds to — is as damaging to operations as technical debt is to code quality.

Sources & Further Reading

Frequently Asked Questions

Prometheus and AlertManager: Building an On-Call Alert System That Works

Frequently Asked Questions

Prometheus and AlertManager: Building an On-Call Alert System That Works

AlertManager's Role in the Prometheus Stack

Alert Rule Design: Quality Over Quantity

Routing Tree Configuration

Inhibition Rules: Preventing Cascading Alert Noise

Telegram Integration for Indonesian Teams

High-Availability AlertManager Cluster

Silence Management for Planned Maintenance

My Take: What Good Alerting Looks Like

Related Articles

AlertManager's Role in the Prometheus Stack

Alert Rule Design: Quality Over Quantity

Routing Tree Configuration

Inhibition Rules: Preventing Cascading Alert Noise

Telegram Integration for Indonesian Teams

High-Availability AlertManager Cluster

Silence Management for Planned Maintenance

My Take: What Good Alerting Looks Like

Related Articles