At what daily log volume does self-hosting ELK become more cost-effective than a managed service?

Self-hosted ELK typically becomes cost-effective at around 10–20 GB of log data per day. Below that threshold, managed services like Elastic Cloud or AWS OpenSearch Service are often competitive on cost once operational overhead is factored in. Above roughly 10–50 GB/day, managed services can cost $500–2,000/month versus $150–400/month for equivalent self-hosted infrastructure.

Why must the Elasticsearch JVM heap never exceed 31 GB, even on servers with much more RAM?

Above 31 GB, the JVM switches from compressed object pointers to uncompressed pointers, which wastes heap space and significantly slows garbage collection. Setting a 34 GB heap on a 64 GB server caused GC pause times to triple compared to a 31 GB heap on the same hardware. The recommended setting is exactly 50% of available RAM, capped at 31 GB, configured in the jvm.options file rather than as environment variables.

When should you bypass Logstash and send logs directly from Filebeat to Elasticsearch?

For simple log ingestion that does not require complex transformation — such as grok parsing, timestamp normalization, or geoip enrichment — routing Filebeat directly to Elasticsearch reduces infrastructure components and operational overhead. Logstash is worth the added complexity only when you need its filter stage capabilities. Eliminating Logstash also removes one failure point and lowers memory consumption on the pipeline servers.

What is the difference between Elastic License 2.0 and Apache 2.0 in the context of self-hosted ELK versus OpenSearch?

Elastic Stack 8.x uses the Elastic License 2.0, which is free for internal self-hosted deployments but restricts offering Elasticsearch as a managed service to third parties. OpenSearch, AWS's fork of Elasticsearch 7.10 created in 2021, is Apache 2.0 licensed and freely usable for any commercial purpose without those restrictions. The practical recommendation is to choose Elastic Stack if you need proprietary features such as ML alerting or security analytics, and OpenSearch if you prefer Apache 2.0 licensing or need deep AWS integration.

Elastic Stack (ELK) Self-Hosted: Production Setup and Operational Lessons

Q: What happens when Elasticsearch's flood-stage disk watermark is reached, and how should you prevent it?

When disk usage hits the default 95% flood-stage watermark, Elasticsearch makes all indices read-only to prevent data corruption, causing Logstash to drop incoming log events and Kibana to show error banners. Recovery requires manually freeing disk space and then clearing the read-only flag via the index settings API. To prevent this, set disk usage alerts at 75% and 85% thresholds, configure ILM delete phases conservatively, and verify those phases are actually running.

The Elastic Stack — Elasticsearch, Logstash, and Kibana (ELK) — is the most widely deployed centralized logging solution in production. According to Elastic's own metrics, Elasticsearch is downloaded over 600 million times per year. But self-hosted ELK has a reputation for being resource-hungry, operationally complex, and prone to specific failure modes that catch new operators off guard. I deployed a self-hosted ELK stack at Commsult Indonesia for a client compliance requirement (log retention for 2 years in a specific jurisdiction) and spent the first three months learning its operational characteristics the hard way. This guide covers what I wish I'd known at the start.

When Self-Hosted ELK Makes Sense

Managed Elasticsearch services (Elastic Cloud, AWS OpenSearch Service, GCP's managed offering) eliminate most operational overhead but cost significantly more than self-hosted for high data volumes. At roughly 10-50GB of log data per day ingested, managed services can cost $500-2000/month. Self-hosted on equivalent infrastructure costs $150-400/month. The break-even point where self-hosted becomes cost-effective versus managed is typically 10-20GB/day for most teams. Other reasons to self-host: data residency requirements (data must stay on specific hardware in Indonesia), compliance requirements for air-gapped environments, or specific customization needs that managed services don't allow.

Cluster Architecture for Production

A single-node Elasticsearch cluster is not production-grade — it has no fault tolerance and no high availability. The minimum production setup is three nodes in a cluster: this allows primary shards on two nodes, replica shards on the third, and quorum-based leader election (2-of-3 required for cluster writes). Each node requires at minimum 8GB of RAM (16GB recommended), with the JVM heap set to 50% of available RAM (max 31GB — do not exceed 31GB heap due to JVM pointer compression). SSD storage is effectively required — Elasticsearch's random read/write pattern makes HDD performance unacceptable for production query latency.

The Logstash Pipeline

Logstash is the data processing pipeline — it receives log data from Beats agents (Filebeat on each server), parses and transforms it, and forwards to Elasticsearch. The core Logstash pipeline stages are input (receive data from Filebeat or direct sources), filter (parse, enrich, and transform — grok patterns for log parsing, date filter for timestamp normalization, geoip for IP enrichment), and output (write to Elasticsearch). For simple log ingestion without complex transformation, consider Filebeat → Elasticsearch directly, bypassing Logstash — this reduces infrastructure and operational overhead at the cost of transformation flexibility.

From my experience: set JVM heap to exactly half your server RAM, but never more than 31GB. On a 32GB RAM server, set -Xms16g -Xmx16g for Elasticsearch. Beyond 31GB, the JVM switches from compressed object pointers to uncompressed, which wastes heap space and slows garbage collection. I discovered this the hard way after setting a 34GB heap on a 64GB server and seeing GC pause times triple compared to a 31GB heap on the same server. Set heap in the jvm.options file, not as environment variables, to ensure it applies correctly.

Index Management and Data Lifecycle

Elasticsearch's Index Lifecycle Management (ILM) is essential for production operations. Without ILM, indices grow indefinitely until disk fills up. ILM defines phases: hot (active writes, replicas enabled), warm (read-only, can move to slower storage), cold (infrequent access, frozen index), and delete (remove old data). A typical log index policy: hot phase for 7 days (active, replicated), warm phase for 30 days (moved to warm nodes, compressed), cold phase for 90 days (optional, minimal storage), delete phase at 365 days. Attach this policy to your index template so every new index automatically follows the lifecycle.

# elasticsearch.yml — Production cluster configuration
cluster.name: prod-logs
node.name: es-node-1
network.host: 0.0.0.0
http.port: 9200

# Cluster discovery (3-node HA cluster)
discovery.seed_hosts:
  - es-node-1.internal:9300
  - es-node-2.internal:9300
  - es-node-3.internal:9300
cluster.initial_master_nodes:
  - es-node-1
  - es-node-2
  - es-node-3

# JVM heap (half of RAM, max 31GB)
# Set in jvm.options:
# -Xms16g
# -Xmx16g

# Index Lifecycle Management policy (via API)
# PUT _ilm/policy/logs-policy
# {
#   "policy": {
#     "phases": {
#       "hot":    { "min_age": "0ms", "actions": { "rollover": { "max_age": "7d" } } },
#       "warm":   { "min_age": "7d",  "actions": { "shrink":   { "number_of_shards": 1 } } },
#       "delete": { "min_age": "365d","actions": { "delete":   {} } }
#     }
#   }
# }

# Check cluster health
curl -u elastic:password https://localhost:9200/_cluster/health?pretty

Monitoring Elasticsearch Cluster Health

Elasticsearch exposes a REST API for cluster health: GET /_cluster/health returns a green/yellow/red status. Green means all primary and replica shards are allocated. Yellow means all primary shards are allocated but some replicas are not (functional but degraded redundancy). Red means some primary shards are not allocated (data loss or unavailable). Add Elasticsearch health monitoring to your Prometheus + Grafana stack using the prometheus-elasticsearch-exporter — it exposes Elasticsearch metrics in Prometheus format. Alert on cluster status red (immediate action required) and yellow for more than 10 minutes (investigation required).

┌──────────────────────────────────────────────────────┐
│            Elastic Stack Data Flow                    │
├──────────────────────────────────────────────────────┤
│                                                      │
│  App Servers                                         │
│  [Filebeat] → reads logs from /var/log/              │
│       ↓                                              │
│  [Logstash] → grok parse + geoip + timestamp         │
│       ↓                                              │
│  [Elasticsearch Cluster] (3 nodes, HA)               │
│  ├── hot:  es-node-1 (SSD, active writes)            │
│  ├── warm: es-node-2 (HDD, read-only)                │
│  └── cold: es-node-3 (compressed, infrequent)        │
│       ↓                                              │
│  [Kibana] → dashboards, Discover, Alerting           │
└──────────────────────────────────────────────────────┘

Elasticsearch has a flood-stage disk watermark (default 95% disk usage) that, when exceeded, makes all indices read-only to prevent data corruption from a disk-full condition. When this happens, Logstash starts dropping log events because it can't write to Elasticsearch, and Kibana shows an error banner. Recovery requires manually freeing disk space and then clearing the read-only status via the index settings API. I hit this at 2 AM when log volume spiked unexpectedly. Set disk usage alerts at 75% and 85% thresholds, not just 90%. Configure ILM delete phases conservatively and verify they're running. Elasticsearch disk fill is silent until it's an emergency.

Kibana Dashboards for Operational Visibility

Kibana's Discover view for ad-hoc log searching and Dashboard view for operational dashboards cover most day-to-day needs. For an Nginx access log index, create dashboards showing: request rate by status code, top slow endpoints (95th percentile response time), top error paths (5xx status), geographic distribution of visitors (using geoip enrichment in Logstash), and bot/crawler traffic identification. Kibana's Lens editor makes dashboard creation accessible without deep Elasticsearch query knowledge. Export dashboards as JSON and store in version control — dashboards are configuration and should be reproducible.

Security: TLS and Authentication for Production

Recent Elasticsearch versions (8.x) enable security by default — TLS for all inter-node communication and HTTP, and password-based authentication. The xpack.security.enabled setting defaults to true in Elasticsearch 8.x. Run the elasticsearch-setup-passwords utility to set initial passwords for built-in users (elastic, kibana_system, logstash_system). Create dedicated Elasticsearch roles for each component: a logstash_writer role with write access only to log indices, a kibana_user role for dashboard viewers. Never use the elastic superuser account for application access — it has full cluster admin rights.

My Take: When to Consider OpenSearch Instead

AWS forked Elasticsearch 7.10 in 2021 to create OpenSearch after Elastic changed licensing. OpenSearch is Apache 2.0 licensed, meaning it's freely usable for commercial purposes without licensing concerns. For new self-hosted deployments where Elastic's features beyond the open-source subset aren't required, OpenSearch is a viable alternative with a growing ecosystem. However, Elastic Stack 8.x on the Elastic License 2.0 is free for self-hosted use — the license only restricts providing Elasticsearch as a managed service to third parties. For internal self-hosted deployments, the Elastic License 2.0 is generally compatible. My recommendation: use Elastic Stack if you need Elastic's proprietary features (ML alerting, security analytics) and are comfortable with the license; use OpenSearch if you prefer Apache 2.0 licensing or need AWS integration.

Sources & Further Reading

Frequently Asked Questions

Elastic Stack (ELK) Self-Hosted: Production Setup and Operational Lessons

Frequently Asked Questions

Elastic Stack (ELK) Self-Hosted: Production Setup and Operational Lessons

When Self-Hosted ELK Makes Sense

Cluster Architecture for Production

The Logstash Pipeline

Index Management and Data Lifecycle

Monitoring Elasticsearch Cluster Health

Kibana Dashboards for Operational Visibility

Security: TLS and Authentication for Production

My Take: When to Consider OpenSearch Instead

Related Articles

When Self-Hosted ELK Makes Sense

Cluster Architecture for Production

The Logstash Pipeline

Index Management and Data Lifecycle

Monitoring Elasticsearch Cluster Health

Kibana Dashboards for Operational Visibility

Security: TLS and Authentication for Production

My Take: When to Consider OpenSearch Instead

Related Articles