As a solo DevOps engineer managing production workloads for Commsult Indonesia, I cannot afford a dedicated SRE team or enterprise APM tooling. What I can afford is a self-hosted monitoring stack that costs under /month to run and tells me about problems before clients notice. After trying several combinations, I settled on Prometheus + Grafana for infrastructure metrics, Uptime Kuma for external availability monitoring, and Telegram for alerts. This combination covers inside (server health) and outside (is the site reachable?) with a setup time of under two hours.
Effective monitoring requires two perspectives. Inside monitoring (Prometheus + Grafana) tells you what is happening on your server: CPU spikes, memory pressure, disk saturation, query latency, and application-level metrics. Outside monitoring (Uptime Kuma) tells you what users experience: is the site loading, does the API respond, is the TLS certificate valid? You need both because inside metrics can look healthy while the site is down (firewall block, DNS issue) and the site can be up while inside metrics show a slow memory leak building toward a crash.
Run Prometheus, Node Exporter, and Grafana as Docker containers on a dedicated lightweight VPS (a /month DigitalOcean Basic Droplet handles monitoring for 5-10 hosts). Prometheus scrapes Node Exporter every 15 seconds for 1000+ host metrics. Import Grafana dashboard ID 1860 (Node Exporter Full) immediately after connecting Prometheus — it gives you 30+ panels covering CPU, memory, disk, and network without building anything. Set up alert rules for the four critical thresholds: CPU >85% for 5 minutes, memory available <10%, disk usage >90%, and load average >4.
Uptime Kuma is a self-hosted monitoring tool with a clean UI and built-in Telegram/Slack/Discord notifications. Deploy it via Docker (docker run -d --restart always -p 3001:3001 -v uptime-kuma:/app/data louislam/uptime-kuma:1) and set up monitors for every public endpoint: HTTP(S) checks for your web apps and APIs, DNS checks for your domains, certificate expiry checks for TLS certs (alert at 14 days remaining), and TCP port checks for databases on private networks. Uptime Kuma’s status pages let you share a public availability dashboard with clients without exposing your Grafana internals.
┌─────────────────────────────────────────────────────┐
│ SOLO DEVOPS MONITORING STACK │
└─────────────────────────────────────────────────────┘
INSIDE VIEW OUTSIDE VIEW
─────────────── ─────────────────
Node Exporter Uptime Kuma
│ │
▼ ▼
Prometheus ─────────────────► HTTP checks
│ TLS expiry
▼ DNS checks
Grafana │
(Dashboard 1860) │
│ │
└──────────────────────────────┘
│
▼
Telegram Bot
(Alert Channel)From my experience managing solo DevOps for Commsult Indonesia, set up a dedicated Telegram bot and channel for production alerts — separate from the main work chat. Production alerts in the main chat get ignored; a dedicated alerts channel with a distinct notification sound trains your brain to respond immediately. Use Grafana’s Alert Contact Point to route critical alerts (disk >90%, site down) to the alerts channel and informational alerts (high CPU for <5 minutes) to a separate noise channel you can mute at night.
Alert fatigue kills monitoring. If your alerts fire constantly for non-urgent conditions, you start ignoring them — including the ones that matter. Design your alerts with three principles: actionability (every alert should have a clear response action), urgency separation (P1 alerts wake you up at 3am, P3 alerts are checked in the morning), and suppression (avoid duplicate alerts for the same root cause). For a solo operator, I recommend: P1 (site down, disk >95%, OOM imminent) triggers immediate Telegram notification, P2 (disk >80%, high error rate) triggers Telegram at business hours, P3 (informational trends) appears in Grafana only.
Node Exporter covers host metrics. For application-level observability, add the prom-client npm package to your NestJS app and expose a /metrics endpoint. Track: HTTP request duration (histogram), request count by route and status code, database query duration, cache hit/miss ratio, and any business-critical metrics (e.g., orders processed per minute for an ERP). These application metrics catch problems that host metrics miss: a memory leak in your NestJS heap, slow database queries, or a broken payment integration.
# docker-compose.yml — Prometheus + Grafana + Node Exporter
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
node-exporter:
image: prom/node-exporter:latest
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
uptime-kuma:
image: louislam/uptime-kuma:1
ports:
- "3001:3001"
volumes:
- uptime-kuma:/app/data
volumes:
prometheus_data:
grafana_data:
uptime-kuma:Full log aggregation (ELK stack, Loki) adds significant resource overhead. For a solo operator, a pragmatic approach: use journalctl -u <service> -f for real-time log tailing, ship critical application errors to a free Sentry instance (generous free tier, excellent error grouping and stack traces), and use Loki only if you need to search historical logs across multiple servers. Loki + Promtail + Grafana is the lightweight Prometheus-ecosystem log stack — much simpler than ELK and integrates with your existing Grafana setup.
Early in my monitoring journey at Commsult Indonesia I added 40+ Grafana panels and 20+ alert rules to feel comprehensive. The result was dashboard paralysis and alert fatigue — I stopped checking Grafana because there was too much noise to extract signal. I now follow a rule: if I cannot describe the exact action I would take when this alert fires, I do not create the alert. Start with 5-6 critical alerts, learn from them for a month, then add more. Less is more in solo monitoring.
My current stack for Commsult Indonesia: Prometheus scraping Node Exporter on 4 servers (2 DigitalOcean Droplets, 2 GCP instances), Grafana with 2 dashboards (infrastructure overview and per-service application metrics), Uptime Kuma checking 12 HTTP endpoints and 3 TLS certificates, Sentry for application error tracking, and Telegram for all alert delivery. Total infrastructure cost: ~/month for a dedicated monitoring Droplet. Total setup time for a new project: ~90 minutes. This stack has caught 3 disk space issues, 2 memory leaks, and 1 expired certificate before they became incidents.
Fastest path to useful monitoring: deploy Uptime Kuma first (10 minutes), add HTTP monitors for all your production endpoints, configure Telegram notifications. Then deploy Prometheus + Node Exporter via Docker Compose on your largest server (15 minutes), connect to Grafana Cloud free tier (no self-hosting needed), import dashboard 1860, and set one alert for disk usage >85%. This gives you external availability monitoring and basic host metrics in 30 minutes with zero ongoing cost if you use Grafana Cloud free tier.