What cgroups v2 Really Does When You Set Docker Memory Limits

Photo by Rémy

Photo by Rémy
The first time one of my containers got OOM-killed in production, the logs told me nothing. The app just vanished, Docker restarted it, and the only evidence was a single line in dmesg about the kernel sacrificing a process. I had set --memory=512m believing it was a polite suggestion to Docker. It is not. It is a kernel-enforced hard limit written into a file, and the enforcement mechanism — control groups v2 — is worth understanding before it teaches you the hard way.
This post unpacks what actually happens on a modern Linux host when you set Docker memory and CPU limits: which cgroup files get written, what the kernel does at each threshold, why CPU limits throttle while memory limits kill, and how to size limits for Node.js and JVM workloads so the runtime and the kernel stop fighting each other. Everything here applies to any recent distro running cgroups v2 — the default since roughly 2021 on Ubuntu, Debian, and Fedora.
Control groups are the kernel's mechanism for partitioning resources — CPU time, memory, IO — among process trees. Docker does not implement resource limiting itself; when you pass --memory or --cpus, the daemon creates a cgroup for your container and writes your numbers into the corresponding interface files. The kernel does the rest.
You can watch this happen. Start a container with limits and read its cgroup directly:
# every docker run flag becomes a file in the cgroup tree
docker run -d --name api --memory=512m --cpus=1.5 myapp:latest
CID=$(docker inspect -f '{{.Id}}' api)
cat /sys/fs/cgroup/system.slice/docker-$CID.scope/memory.max
# 536870912 <- your --memory=512m, in bytes
cat /sys/fs/cgroup/system.slice/docker-$CID.scope/cpu.max
# 150000 100000 <- your --cpus=1.5: 150ms of CPU per 100ms windowThat cpu.max line is the most useful mental model in this whole topic: --cpus=1.5 means the container's processes may consume at most 150 milliseconds of CPU time in every 100-millisecond window, across all cores combined. There is no core pinning involved — it is a bandwidth quota enforced by the scheduler, which is why the effects show up as throttling rather than as missing CPUs.
Cgroups v2 gives memory control several distinct knobs, and Docker's flags map onto them directly:
| Docker flag | cgroup v2 file | Kernel behavior at the threshold |
|---|---|---|
| --memory | memory.max | Hard cap. The kernel first tries to reclaim pages; if usage cannot be brought down, the OOM killer is invoked inside the cgroup and your process dies with exit code 137. |
| (no direct flag) | memory.high | Soft ceiling. Processes are throttled and put under heavy reclaim pressure, but never OOM-killed. Orchestrators use this for graceful degradation before the hard limit. |
| --memory-swap | memory.swap.max | Controls swap on top of RAM. Setting --memory-swap equal to --memory disables swap for the container — usually what you want for latency-sensitive services. |
The crucial subtlety: memory.max counts page cache, not just your process heap. A container that reads large files can show high memory usage that is actually reclaimable cache, and conversely your app can be OOM-killed while its heap looks fine because anonymous memory plus cache crossed the line together. When a limit seems to fire too early, check memory.stat inside the cgroup before blaming your app.
Exit code 137 with no application error is the OOM killer's signature. Confirm with docker inspect — OOMKilled true — and resist the temptation of --oom-kill-disable. Disabling the killer on a memory-capped container does not free memory; it deadlocks the container in permanent reclaim instead of restarting it cleanly.
Memory is non-negotiable — a page either exists or it does not — so the kernel kills. CPU is time-sliced, so the kernel can simply make you wait. Three Docker flags cover the practical space:
Throttling is visible and measurable: cpu.stat inside the cgroup reports nr_throttled and throttled_usec counters. A web service can sit at 40 percent average CPU and still have terrible p99 latency because it bursts into its quota ceiling on every request spike and spends the rest of each period frozen.
Rule of thumb from my own dashboards: for latency-sensitive services, alert on throttling counters, not on CPU usage. cAdvisor exposes container_cpu_cfs_throttled_periods_total to Prometheus; sustained throttling above a few percent of periods means the quota is too tight even if average utilization looks comfortable.
If your runbooks or Stack Overflow answers date from the v1 era, three differences matter operationally:
The kernel enforces the cap, but your runtime allocates against assumptions. Older runtimes read host memory and size their heaps accordingly — a JVM on a 32 GB host inside a 1 GB container would happily plan a multi-gigabyte heap, then die at 137. The fix is telling the runtime the truth:
# Node.js: heap limit must fit inside memory.max
docker run -d --memory=512m \
-e NODE_OPTIONS="--max-old-space-size=384" myapp
# JVM: let it read the cgroup instead of guessing
docker run -d --memory=1g \
-e JAVA_TOOL_OPTIONS="-XX:MaxRAMPercentage=75" myserviceModern JVMs (10+) are container-aware by default and read the cgroup, but the percentage is still worth setting explicitly: the gap between heap limit and memory.max must hold thread stacks, metaspace or native buffers, and page cache. My defaults are 75 percent of the container limit for both Node's max-old-space-size and the JVM's MaxRAMPercentage, narrowing only after profiling.
How I set limits for a new service, in order:
On Docker Swarm the same kernel mechanics apply through the deploy.resources block: limits become the cgroup caps described here, and reservations drive the scheduler's placement decisions. Setting reservations honestly is what stops Swarm from packing three memory-hungry services onto one 4 GB node and letting cgroups referee the resulting fight.
Docker resource flags stop being mysterious once you see them as what they are: numbers in cgroup v2 interface files, enforced by a kernel with very predictable rules. Memory crosses memory.max and something dies; CPU crosses cpu.max and something waits. Size the limits from measurements, make your runtime agree with the kernel about how much memory exists, and alert on OOM kills and throttling rather than on raw usage. Containers without limits are not generous — they are just deferring the negotiation to the worst possible moment.
Sources and further reading