I had my disaster at 11:47 PM on a Tuesday. A DigitalOcean Droplet running a client's production web application went unresponsive — no SSH, no ping, no response from the monitoring dashboard. The Droplet's disk had filled 100% due to an uncapped log file, which corrupted the ext4 filesystem. No automatic backups had been configured (we had manual snapshots but the last one was 18 days old). Recovery took 4 hours and resulted in a loss of 3 days of user data. After that incident, I implemented a proper disaster recovery plan across every server we manage. This guide is that plan.
Recovery Time Objective (RTO) is the maximum acceptable time from disaster declaration to service restoration. Recovery Point Objective (RPO) is the maximum acceptable data loss measured in time — how far back can you roll back without unacceptable business impact? These numbers drive every other DR decision. For a customer-facing e-commerce API with transaction data: RTO might be 1 hour, RPO 15 minutes. For an internal reporting dashboard: RTO 8 hours, RPO 24 hours. Do not configure backups and assume you're covered — explicitly agree with stakeholders on RTO and RPO targets and design your DR infrastructure to meet them.
Not all systems need the same DR investment. Tier 1 systems (customer-facing, revenue-generating, compliance-relevant) need aggressive RTO (<1 hour) and RPO (<1 hour) targets — use automated hourly backups with tested recovery runbooks and hot standby. Tier 2 systems (internal tools, dashboards, dev/staging environments) can tolerate 4-8 hour RTO and 24-hour RPO — daily backups, documented recovery procedures. Tier 3 systems (throwaway environments, one-time batch jobs) may need no DR at all. Classify your systems honestly and allocate DR budget proportionally.
DigitalOcean offers automated weekly Droplet backups for 20% of the Droplet cost — a $24/month Droplet costs $4.80/month for weekly backups. Enable this for every Tier 1 and Tier 2 Droplet immediately. DigitalOcean backups capture the entire Droplet state as a snapshot. For more granular RPO, use DigitalOcean Managed Database automated backups (daily full backup + continuous binary log backup = point-in-time recovery to any second within the retention period). For application data not in a managed database (uploaded files, user-generated content), use restic or rclone to push incremental backups to DigitalOcean Spaces or R2 on a schedule.
From my experience: configure disk usage alerts at 80% threshold for every VPS you manage. The most common cause of unplanned downtime I see is disk fill — either log files, container image accumulation, or database transaction logs. Set up a cron job that runs du -sh /var/log/* and alerts via Telegram if any log directory exceeds a threshold. This 30-minute setup prevents 90% of the 'disk full at 2 AM' incidents. Also, set logrotate for all application log files with a maximum size of 100MB and retention of 7 days.
restic is a fast, encrypted, deduplicated backup tool that runs on any Linux server and supports any S3-compatible storage backend. Set up daily restic backups from your VPS to DigitalOcean Spaces or Cloudflare R2 (both S3-compatible). restic automatically deduplicates across backup snapshots, so after the first full backup, incremental snapshots only upload changed data. Encrypt backups with a restic password stored in a secrets manager. Schedule via cron or a systemd timer. Send backup completion status to a Telegram channel so you know immediately if a backup fails.
#!/bin/bash
# Daily restic backup to Cloudflare R2 (S3-compatible)
export RESTIC_REPOSITORY="s3:https://ACCOUNT_ID.r2.cloudflarestorage.com/backups"
export AWS_ACCESS_KEY_ID="r2_access_key"
export AWS_SECRET_ACCESS_KEY="r2_secret_key"
export RESTIC_PASSWORD_COMMAND="gcloud secrets versions access latest --secret=restic-password"
# Backup application data
restic backup /opt/myapp /etc/nginx /etc/letsencrypt --tag "daily" --tag "$(hostname)" --exclude="/opt/myapp/node_modules"
# Backup PostgreSQL dump
pg_dump -U postgres mydb | restic backup --stdin --stdin-filename mydb.sql
# Prune old snapshots (keep 7 daily, 4 weekly, 3 monthly)
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 3 --prune
# Verify backup integrity
restic check
# Alert on failure via Telegram
if [ $? -ne 0 ]; then
curl -s -X POST "https://api.telegram.org/botTOKEN/sendMessage" -d "chat_id=CHAT_ID&text=Backup failed on $(hostname)"
fiFor PostgreSQL, use pg_dump for logical backups (schema + data in SQL format) and WAL archiving for point-in-time recovery. A daily pg_dump to S3 provides a clean, importable snapshot that works regardless of PostgreSQL version. WAL archiving (archive_command in postgresql.conf) continuously ships transaction logs to object storage — enabling recovery to any point in time, not just the daily snapshot. For MySQL, use mysqldump for logical backups and binary log replication for PITR. Never rely solely on filesystem-level snapshots for databases — they can capture an inconsistent state unless the database is quiesced first.
After the disaster I described in the intro, I implemented automated backups on every server. Three months later I tested one of those backups — and it failed to restore. The restic repository had been created but the backup jobs had been failing silently (the cron output went to /dev/null and I hadn't checked the logs). I had three months of what I believed were daily backups that were actually empty snapshots. The lesson: schedule a quarterly DR test. Spin up a new Droplet, restore from backup, verify data integrity, test application functionality, and document the actual restoration time. Only tested backups are real backups.
A DR runbook is a step-by-step procedure document for restoring services after a failure. It should be simple enough to follow under stress at 2 AM. Include: (1) Initial triage — check monitoring dashboard, ping servers, SSH attempt, check cloud console for alerts; (2) Decision tree — is this a network issue, a disk issue, a process crash, or a hardware failure?; (3) Recovery procedures for each failure type — for a crashed process, restart via systemd; for a full disk, free space and restart; for a hardware failure, restore from snapshot to new Droplet. Store the runbook in a location accessible without the failed server — a shared Notion page, a GitHub repo, or even a PDF in Google Drive.
┌─────────────────────────────────────────────────────┐
│ VPS Disaster Recovery Layers │
├─────────────────────────────────────────────────────┤
│ │
│ Layer 1: DO Weekly Snapshots (fast VM restore) │
│ RPO: ~7 days | RTO: ~30 min │
│ │
│ Layer 2: Daily restic → Cloudflare R2 │
│ RPO: ~24h | RTO: ~1-2 hours │
│ │
│ Layer 3: pg_dump + WAL archiving │
│ RPO: ~1h | RTO: ~2-4 hours │
│ │
│ Layer 4: Disk monitoring (80% alert) │
│ Prevention > Recovery │
└─────────────────────────────────────────────────────┘On GCP, disaster recovery has more managed options than DigitalOcean. Compute Engine instance snapshots can be scheduled via the Cloud Console (daily snapshots retained for 7 days is a common baseline). Cloud SQL supports automated daily backups and point-in-time recovery for up to 7 days by default (configurable to 365 days). For a multi-region DR setup, Cloud SQL's cross-region read replicas can be promoted to primary in a regional outage. Cloud Storage objects replicate across zones within a region by default — for cross-region redundancy, enable multi-region storage class or replicate to a separate bucket in a different region using Object Lifecycle Management.
After the 2 AM disaster, here's my current baseline for every VPS I manage: automated daily restic backups to Cloudflare R2 (off-site, different provider), DigitalOcean weekly automated Droplet snapshots (for fast VM restore), disk usage monitoring with 80% alerts, log rotation for all application logs, and a quarterly DR test scheduled in the calendar. The total cost for a standard $24 Droplet: $4.80/month for DigitalOcean backups + $0.50/month for R2 storage = $5.30/month for peace of mind. The true cost of not having proper DR is measured in client trust, not server bills.
Sources & Further Reading