What is the difference between RTO and RPO, and why do they matter for VPS disaster recovery?

RTO (Recovery Time Objective) is the maximum acceptable time from declaring a disaster to restoring service, while RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time. These two numbers drive every other DR decision — for example, a customer-facing e-commerce API might require an RTO of 1 hour and RPO of 15 minutes, while an internal reporting dashboard could tolerate an RTO of 8 hours and RPO of 24 hours. Without explicitly agreeing on these targets with stakeholders, you cannot design DR infrastructure that actually meets business requirements.

How much does a proper backup setup cost for a standard DigitalOcean Droplet?

For a $24/month Droplet, DigitalOcean automated weekly backups cost $4.80/month (20% of the Droplet price), and offsite storage on Cloudflare R2 adds roughly $0.50/month — a total of $5.30/month. This covers both automated weekly Droplet snapshots for fast VM restore and daily restic backups to an off-site, different-provider location. The author frames this as a fixed, predictable cost compared to the far higher cost of losing client trust after an unplanned outage.

Why are untested backups considered unreliable, and what does a proper DR test look like?

The author discovered that three months of what appeared to be daily restic backups were actually empty snapshots — the backup jobs had been failing silently because cron output was redirected to /dev/null and logs were never checked. This real-world failure illustrates why untested backups are not real backups. A quarterly DR test should involve spinning up a new Droplet, restoring from backup, verifying data integrity, testing application functionality, and recording the actual restoration time.

What is the most common cause of unplanned VPS downtime mentioned in the post, and how can it be prevented?

The most common cause cited is disk filling completely — from uncapped log files, container image accumulation, or database transaction logs. The author recommends setting disk usage alerts at an 80% threshold for every VPS, configuring a cron job that monitors log directory sizes and alerts via Telegram, and configuring logrotate for all application logs with a 100MB maximum size and 7-day retention. This 30-minute setup is described as preventing 90% of 'disk full at 2 AM' incidents.

How should systems be tiered for disaster recovery, and what backup strategy applies to each tier?

The post defines three tiers based on business criticality. Tier 1 systems (customer-facing, revenue-generating, compliance-relevant) require aggressive RTO and RPO under 1 hour, meaning automated hourly backups, tested runbooks, and hot standby. Tier 2 systems (internal tools, dashboards, dev/staging) can tolerate 4–8 hour RTO and 24-hour RPO with daily backups and documented procedures. Tier 3 systems (throwaway environments, one-time batch jobs) may need no DR at all. Allocating DR budget proportionally to this classification avoids over-engineering low-criticality systems.

Disaster Recovery Planning for VPS: Backups, RTO, and Actual Tested Recovery

I had my disaster at 11:47 PM on a Tuesday. A DigitalOcean Droplet running a client's production web application went unresponsive — no SSH, no ping, no response from the monitoring dashboard. The Droplet's disk had filled 100% due to an uncapped log file, which corrupted the ext4 filesystem. No automatic backups had been configured (we had manual snapshots but the last one was 18 days old). Recovery took 4 hours and resulted in a loss of 3 days of user data. After that incident, I implemented a proper disaster recovery plan across every server we manage. This guide is that plan.

Defining Your RTO and RPO Before Configuring Anything

Recovery Time Objective (RTO) is the maximum acceptable time from disaster declaration to service restoration. Recovery Point Objective (RPO) is the maximum acceptable data loss measured in time — how far back can you roll back without unacceptable business impact? These numbers drive every other DR decision. For a customer-facing e-commerce API with transaction data: RTO might be 1 hour, RPO 15 minutes. For an internal reporting dashboard: RTO 8 hours, RPO 24 hours. Do not configure backups and assume you're covered — explicitly agree with stakeholders on RTO and RPO targets and design your DR infrastructure to meet them.

Tier Your Systems by Criticality

Not all systems need the same DR investment. Tier 1 systems (customer-facing, revenue-generating, compliance-relevant) need aggressive RTO (<1 hour) and RPO (<1 hour) targets — use automated hourly backups with tested recovery runbooks and hot standby. Tier 2 systems (internal tools, dashboards, dev/staging environments) can tolerate 4-8 hour RTO and 24-hour RPO — daily backups, documented recovery procedures. Tier 3 systems (throwaway environments, one-time batch jobs) may need no DR at all. Classify your systems honestly and allocate DR budget proportionally.

DigitalOcean Backup Configuration

DigitalOcean offers automated weekly Droplet backups for 20% of the Droplet cost — a $24/month Droplet costs $4.80/month for weekly backups. Enable this for every Tier 1 and Tier 2 Droplet immediately. DigitalOcean backups capture the entire Droplet state as a snapshot. For more granular RPO, use DigitalOcean Managed Database automated backups (daily full backup + continuous binary log backup = point-in-time recovery to any second within the retention period). For application data not in a managed database (uploaded files, user-generated content), use restic or rclone to push incremental backups to DigitalOcean Spaces or R2 on a schedule.

From my experience: configure disk usage alerts at 80% threshold for every VPS you manage. The most common cause of unplanned downtime I see is disk fill — either log files, container image accumulation, or database transaction logs. Set up a cron job that runs du -sh /var/log/* and alerts via Telegram if any log directory exceeds a threshold. This 30-minute setup prevents 90% of the 'disk full at 2 AM' incidents. Also, set logrotate for all application log files with a maximum size of 100MB and retention of 7 days.

Automated Backup with restic and Offsite Storage

restic is a fast, encrypted, deduplicated backup tool that runs on any Linux server and supports any S3-compatible storage backend. Set up daily restic backups from your VPS to DigitalOcean Spaces or Cloudflare R2 (both S3-compatible). restic automatically deduplicates across backup snapshots, so after the first full backup, incremental snapshots only upload changed data. Encrypt backups with a restic password stored in a secrets manager. Schedule via cron or a systemd timer. Send backup completion status to a Telegram channel so you know immediately if a backup fails.

#!/bin/bash
# Daily restic backup to Cloudflare R2 (S3-compatible)

export RESTIC_REPOSITORY="s3:https://ACCOUNT_ID.r2.cloudflarestorage.com/backups"
export AWS_ACCESS_KEY_ID="r2_access_key"
export AWS_SECRET_ACCESS_KEY="r2_secret_key"
export RESTIC_PASSWORD_COMMAND="gcloud secrets versions access latest --secret=restic-password"

# Backup application data
restic backup /opt/myapp /etc/nginx /etc/letsencrypt   --tag "daily"   --tag "$(hostname)"   --exclude="/opt/myapp/node_modules"

# Backup PostgreSQL dump
pg_dump -U postgres mydb | restic backup --stdin --stdin-filename mydb.sql

# Prune old snapshots (keep 7 daily, 4 weekly, 3 monthly)
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 3 --prune

# Verify backup integrity
restic check

# Alert on failure via Telegram
if [ $? -ne 0 ]; then
  curl -s -X POST "https://api.telegram.org/botTOKEN/sendMessage"     -d "chat_id=CHAT_ID&text=Backup failed on $(hostname)"
fi

Database Backup Strategies

For PostgreSQL, use pg_dump for logical backups (schema + data in SQL format) and WAL archiving for point-in-time recovery. A daily pg_dump to S3 provides a clean, importable snapshot that works regardless of PostgreSQL version. WAL archiving (archive_command in postgresql.conf) continuously ships transaction logs to object storage — enabling recovery to any point in time, not just the daily snapshot. For MySQL, use mysqldump for logical backups and binary log replication for PITR. Never rely solely on filesystem-level snapshots for databases — they can capture an inconsistent state unless the database is quiesced first.

After the disaster I described in the intro, I implemented automated backups on every server. Three months later I tested one of those backups — and it failed to restore. The restic repository had been created but the backup jobs had been failing silently (the cron output went to /dev/null and I hadn't checked the logs). I had three months of what I believed were daily backups that were actually empty snapshots. The lesson: schedule a quarterly DR test. Spin up a new Droplet, restore from backup, verify data integrity, test application functionality, and document the actual restoration time. Only tested backups are real backups.

The DR Runbook: What to Do When Everything Is Down

A DR runbook is a step-by-step procedure document for restoring services after a failure. It should be simple enough to follow under stress at 2 AM. Include: (1) Initial triage — check monitoring dashboard, ping servers, SSH attempt, check cloud console for alerts; (2) Decision tree — is this a network issue, a disk issue, a process crash, or a hardware failure?; (3) Recovery procedures for each failure type — for a crashed process, restart via systemd; for a full disk, free space and restart; for a hardware failure, restore from snapshot to new Droplet. Store the runbook in a location accessible without the failed server — a shared Notion page, a GitHub repo, or even a PDF in Google Drive.

┌─────────────────────────────────────────────────────┐
│            VPS Disaster Recovery Layers              │
├─────────────────────────────────────────────────────┤
│                                                     │
│  Layer 1: DO Weekly Snapshots (fast VM restore)     │
│           RPO: ~7 days | RTO: ~30 min               │
│                                                     │
│  Layer 2: Daily restic → Cloudflare R2              │
│           RPO: ~24h   | RTO: ~1-2 hours             │
│                                                     │
│  Layer 3: pg_dump + WAL archiving                   │
│           RPO: ~1h    | RTO: ~2-4 hours             │
│                                                     │
│  Layer 4: Disk monitoring (80% alert)               │
│           Prevention > Recovery                     │
└─────────────────────────────────────────────────────┘

GCP Disaster Recovery Patterns

On GCP, disaster recovery has more managed options than DigitalOcean. Compute Engine instance snapshots can be scheduled via the Cloud Console (daily snapshots retained for 7 days is a common baseline). Cloud SQL supports automated daily backups and point-in-time recovery for up to 7 days by default (configurable to 365 days). For a multi-region DR setup, Cloud SQL's cross-region read replicas can be promoted to primary in a regional outage. Cloud Storage objects replicate across zones within a region by default — for cross-region redundancy, enable multi-region storage class or replicate to a separate bucket in a different region using Object Lifecycle Management.

My Take: What I Do Differently Now

After the 2 AM disaster, here's my current baseline for every VPS I manage: automated daily restic backups to Cloudflare R2 (off-site, different provider), DigitalOcean weekly automated Droplet snapshots (for fast VM restore), disk usage monitoring with 80% alerts, log rotation for all application logs, and a quarterly DR test scheduled in the calendar. The total cost for a standard $24 Droplet: $4.80/month for DigitalOcean backups + $0.50/month for R2 storage = $5.30/month for peace of mind. The true cost of not having proper DR is measured in client trust, not server bills.

Sources & Further Reading

Frequently Asked Questions

Disaster Recovery Planning for VPS: Backups, RTO, and Actual Tested Recovery

Frequently Asked Questions

Disaster Recovery Planning for VPS: Backups, RTO, and Actual Tested Recovery

Defining Your RTO and RPO Before Configuring Anything

Tier Your Systems by Criticality

DigitalOcean Backup Configuration

Automated Backup with restic and Offsite Storage

Database Backup Strategies

The DR Runbook: What to Do When Everything Is Down

GCP Disaster Recovery Patterns

My Take: What I Do Differently Now

Related Articles

Defining Your RTO and RPO Before Configuring Anything

Tier Your Systems by Criticality

DigitalOcean Backup Configuration

Automated Backup with restic and Offsite Storage

Database Backup Strategies

The DR Runbook: What to Do When Everything Is Down

GCP Disaster Recovery Patterns

My Take: What I Do Differently Now

Related Articles