Containers aren't real. They are just processes lying to the kernel. CVE-2019-5736 exploits this lie by tricking the host's container runtime (`runc`) into exposing its own binary file descriptor to the container it is managing. An attacker can overwrite the `runc` binary on the host with a malicious payload, achieving root execution on the host system the next time `runc` is used.
A fundamental design flaw in how `runc` handles file descriptors allows a malicious container to overwrite the host `runc` binary, resulting in complete host compromise upon subsequent execution.
Let's dispel a myth before we start: a Docker container is not a Virtual Machine. It does not have its own kernel. It is simply a process on the host system that has been put in a "time-out" corner using Linux Namespaces and cgroups. The babysitter enforcing this time-out is runc.
runc is the low-level CLI tool that actually spawns and runs containers (used by Docker, Kubernetes, containerd, etc.). When you run docker exec, runc has to effectively dive into the shark tank (the container) to set up the environment for your command.
Here is the problem: when runc dives into the tank to check on the sharks, it brings its ID badge (its own binary executable) with it. CVE-2019-5736 is the story of how the sharks learned to steal that badge and walk out the front door.
The vulnerability relies on a quirky feature of the Linux kernel: /proc/self/exe. This is a symbolic link that points to the binary of the currently running process. If you are running /bin/ls, /proc/self/exe points to /bin/ls.
When you execute a command inside a container using docker exec, the runc process on the host starts up, enters the container's namespaces, and then executes the requested command. However, for a brief window after entering the namespace but before executing the target command, the runc process is susceptible to manipulation.
If the attacker inside the container places a symbolic link or a modified library that runc interacts with, the attacker can trick runc into executing itself (or rather, pointing /proc/self/exe back to the host runc binary). Once the attacker has a file descriptor (FD) pointing to the host's runc binary, they win. They can't overwrite it while it's running (Linux throws ETXTBSY - Text File Busy), but they can hold the door open and wait for it to leave.
The fix for this issue was both elegant and slightly paranoid. The developers realized that they couldn't trust the container environment with the real runc binary. So, they decided to send in a clone.
In the patched version, before runc enters the container's namespace, it copies its own binary logic into a temporary, anonymous file in memory (using memfd_create if available). It then re-executes itself from this memory-resident clone.
Here is the logic flow in the patch:
// From libcontainer/nsenter/cloned_binary.c
// Create a memfd (anonymous file in RAM)
int fd = memfd_create("runc_cloned", MFD_CLOEXEC | MFD_ALLOW_SEALING);
// Copy the current runc binary into the memfd
sent = sendfile(fd, self_fd, NULL, sent);
// Seal the memfd so it cannot be modified
fcntl(fd, F_ADD_SEALS, F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW | F_SEAL_WRITE);
// Execute the clone
fexecve(fd, argv, envp);By the time runc enters the hostile container environment, /proc/self/exe points to this throwaway memory file, not the critical binary on the host disk. If the attacker overwrites it, they are just vandalizing a temporary ghost in RAM.
The exploitation process is a race against time and file locks. It requires a malicious container image or a compromised container where the attacker has root privileges (inside the container).
Here is the attack chain:
/bin/sh or the container entrypoint) with a symbolic link to /proc/self/exe.docker exec -it malicious-container /bin/sh.runc enters the container and tries to execute /bin/sh. Because of the symlink, runc actually re-executes itself inside the container context.runc has started. It opens /proc/[runc-pid]/exe for reading. This gives the attacker a file descriptor (FD) pointing to the real host runc binary.ETXTBSY because runc is still running. The attacker script sits in a loop, hammering the open() syscall.runc process exits, the write lock is released. The attacker's script successfully opens the FD for writing and overwrites the host runc binary with a malicious payload (e.g., a reverse shell wrapper).This isn't a simple data leak; this is infrastructure destruction. By overwriting runc, the attacker has effectively backdoored the entire container host.
The next time any container is started, stopped, or exec'd into on that machine—whether by an admin or an automated orchestration tool like Kubernetes—the malicious runc binary will execute.
Typically, the malicious payload will spawn a reverse shell to the attacker and then execute the original runc logic so nobody notices anything is wrong. In a Kubernetes cluster, compromising one node often leads to compromising the orchestration credentials, allowing the attacker to pivot to every other node in the fleet.
The primary fix is to update runc (and by extension Docker/containerd). However, there are architectural defenses that render this class of bug irrelevant.
1. User Namespaces (userns-remap):
This is the silver bullet. By mapping root inside the container to a non-privileged user (e.g., nobody) on the host, the attack fails immediately. Even if the attacker gets a file descriptor to the host binary, the kernel says, "Nice try, but you don't have write permissions to this file."
2. Read-Only Root Filesystem: Running containers with a read-only root filesystem prevents the initial symlink setup, though sophisticated attackers might find other writable locations to stage the attack.
3. SELinux:
If you are on a Red Hat based system, SELinux policies for containers usually prevent the container process from writing to runc, effectively blocking the exploit.
CVSS:3.1/AV:L/AC:L/PR:N/UI:R/S:C/C:H/I:H/A:H| Product | Affected Versions | Fixed Version |
|---|---|---|
runc OpenContainers | <= 1.0-rc6 | 1.0-rc7 |
Docker Docker Inc. | < 18.09.2 | 18.09.2 |
| Attribute | Detail |
|---|---|
| CWE ID | CWE-269 |
| Attack Vector | Local (requires container execution) |
| CVSS | 8.6 (High) |
| EPSS Score | 55.56% |
| Impact | Container Escape / Host Root Compromise |
| Exploit Status | Weaponized / PoC Available |
Improper Privilege Management
Get the latest CVE analysis reports delivered to your inbox.