The Elastic Stack — Elasticsearch, Logstash, and Kibana (ELK) — is the most widely deployed centralized logging solution in production. According to Elastic's own metrics, Elasticsearch is downloaded over 600 million times per year. But self-hosted ELK has a reputation for being resource-hungry, operationally complex, and prone to specific failure modes that catch new operators off guard. I deployed a self-hosted ELK stack at Commsult Indonesia for a client compliance requirement (log retention for 2 years in a specific jurisdiction) and spent the first three months learning its operational characteristics the hard way. This guide covers what I wish I'd known at the start.
Managed Elasticsearch services (Elastic Cloud, AWS OpenSearch Service, GCP's managed offering) eliminate most operational overhead but cost significantly more than self-hosted for high data volumes. At roughly 10-50GB of log data per day ingested, managed services can cost $500-2000/month. Self-hosted on equivalent infrastructure costs $150-400/month. The break-even point where self-hosted becomes cost-effective versus managed is typically 10-20GB/day for most teams. Other reasons to self-host: data residency requirements (data must stay on specific hardware in Indonesia), compliance requirements for air-gapped environments, or specific customization needs that managed services don't allow.
A single-node Elasticsearch cluster is not production-grade — it has no fault tolerance and no high availability. The minimum production setup is three nodes in a cluster: this allows primary shards on two nodes, replica shards on the third, and quorum-based leader election (2-of-3 required for cluster writes). Each node requires at minimum 8GB of RAM (16GB recommended), with the JVM heap set to 50% of available RAM (max 31GB — do not exceed 31GB heap due to JVM pointer compression). SSD storage is effectively required — Elasticsearch's random read/write pattern makes HDD performance unacceptable for production query latency.
Logstash is the data processing pipeline — it receives log data from Beats agents (Filebeat on each server), parses and transforms it, and forwards to Elasticsearch. The core Logstash pipeline stages are input (receive data from Filebeat or direct sources), filter (parse, enrich, and transform — grok patterns for log parsing, date filter for timestamp normalization, geoip for IP enrichment), and output (write to Elasticsearch). For simple log ingestion without complex transformation, consider Filebeat → Elasticsearch directly, bypassing Logstash — this reduces infrastructure and operational overhead at the cost of transformation flexibility.
From my experience: set JVM heap to exactly half your server RAM, but never more than 31GB. On a 32GB RAM server, set -Xms16g -Xmx16g for Elasticsearch. Beyond 31GB, the JVM switches from compressed object pointers to uncompressed, which wastes heap space and slows garbage collection. I discovered this the hard way after setting a 34GB heap on a 64GB server and seeing GC pause times triple compared to a 31GB heap on the same server. Set heap in the jvm.options file, not as environment variables, to ensure it applies correctly.
Elasticsearch's Index Lifecycle Management (ILM) is essential for production operations. Without ILM, indices grow indefinitely until disk fills up. ILM defines phases: hot (active writes, replicas enabled), warm (read-only, can move to slower storage), cold (infrequent access, frozen index), and delete (remove old data). A typical log index policy: hot phase for 7 days (active, replicated), warm phase for 30 days (moved to warm nodes, compressed), cold phase for 90 days (optional, minimal storage), delete phase at 365 days. Attach this policy to your index template so every new index automatically follows the lifecycle.
# elasticsearch.yml — Production cluster configuration
cluster.name: prod-logs
node.name: es-node-1
network.host: 0.0.0.0
http.port: 9200
# Cluster discovery (3-node HA cluster)
discovery.seed_hosts:
- es-node-1.internal:9300
- es-node-2.internal:9300
- es-node-3.internal:9300
cluster.initial_master_nodes:
- es-node-1
- es-node-2
- es-node-3
# JVM heap (half of RAM, max 31GB)
# Set in jvm.options:
# -Xms16g
# -Xmx16g
# Index Lifecycle Management policy (via API)
# PUT _ilm/policy/logs-policy
# {
# "policy": {
# "phases": {
# "hot": { "min_age": "0ms", "actions": { "rollover": { "max_age": "7d" } } },
# "warm": { "min_age": "7d", "actions": { "shrink": { "number_of_shards": 1 } } },
# "delete": { "min_age": "365d","actions": { "delete": {} } }
# }
# }
# }
# Check cluster health
curl -u elastic:password https://localhost:9200/_cluster/health?prettyElasticsearch exposes a REST API for cluster health: GET /_cluster/health returns a green/yellow/red status. Green means all primary and replica shards are allocated. Yellow means all primary shards are allocated but some replicas are not (functional but degraded redundancy). Red means some primary shards are not allocated (data loss or unavailable). Add Elasticsearch health monitoring to your Prometheus + Grafana stack using the prometheus-elasticsearch-exporter — it exposes Elasticsearch metrics in Prometheus format. Alert on cluster status red (immediate action required) and yellow for more than 10 minutes (investigation required).
┌──────────────────────────────────────────────────────┐
│ Elastic Stack Data Flow │
├──────────────────────────────────────────────────────┤
│ │
│ App Servers │
│ [Filebeat] → reads logs from /var/log/ │
│ ↓ │
│ [Logstash] → grok parse + geoip + timestamp │
│ ↓ │
│ [Elasticsearch Cluster] (3 nodes, HA) │
│ ├── hot: es-node-1 (SSD, active writes) │
│ ├── warm: es-node-2 (HDD, read-only) │
│ └── cold: es-node-3 (compressed, infrequent) │
│ ↓ │
│ [Kibana] → dashboards, Discover, Alerting │
└──────────────────────────────────────────────────────┘Elasticsearch has a flood-stage disk watermark (default 95% disk usage) that, when exceeded, makes all indices read-only to prevent data corruption from a disk-full condition. When this happens, Logstash starts dropping log events because it can't write to Elasticsearch, and Kibana shows an error banner. Recovery requires manually freeing disk space and then clearing the read-only status via the index settings API. I hit this at 2 AM when log volume spiked unexpectedly. Set disk usage alerts at 75% and 85% thresholds, not just 90%. Configure ILM delete phases conservatively and verify they're running. Elasticsearch disk fill is silent until it's an emergency.
Kibana's Discover view for ad-hoc log searching and Dashboard view for operational dashboards cover most day-to-day needs. For an Nginx access log index, create dashboards showing: request rate by status code, top slow endpoints (95th percentile response time), top error paths (5xx status), geographic distribution of visitors (using geoip enrichment in Logstash), and bot/crawler traffic identification. Kibana's Lens editor makes dashboard creation accessible without deep Elasticsearch query knowledge. Export dashboards as JSON and store in version control — dashboards are configuration and should be reproducible.
Recent Elasticsearch versions (8.x) enable security by default — TLS for all inter-node communication and HTTP, and password-based authentication. The xpack.security.enabled setting defaults to true in Elasticsearch 8.x. Run the elasticsearch-setup-passwords utility to set initial passwords for built-in users (elastic, kibana_system, logstash_system). Create dedicated Elasticsearch roles for each component: a logstash_writer role with write access only to log indices, a kibana_user role for dashboard viewers. Never use the elastic superuser account for application access — it has full cluster admin rights.
AWS forked Elasticsearch 7.10 in 2021 to create OpenSearch after Elastic changed licensing. OpenSearch is Apache 2.0 licensed, meaning it's freely usable for commercial purposes without licensing concerns. For new self-hosted deployments where Elastic's features beyond the open-source subset aren't required, OpenSearch is a viable alternative with a growing ecosystem. However, Elastic Stack 8.x on the Elastic License 2.0 is free for self-hosted use — the license only restricts providing Elasticsearch as a managed service to third parties. For internal self-hosted deployments, the Elastic License 2.0 is generally compatible. My recommendation: use Elastic Stack if you need Elastic's proprietary features (ML alerting, security analytics) and are comfortable with the license; use OpenSearch if you prefer Apache 2.0 licensing or need AWS integration.
Sources & Further Reading