A flaw in runc's `WORKDIR` handling allows a malicious container to escape its sandbox. By racing runc's initialization process, the container can trick it into opening a file handle to the host's filesystem. This provides a direct path to host access, leading to a full container escape and RCE on the node.
CVE-2024-21626 is a critical vulnerability in runc, the low-level container runtime underpinning Docker, Kubernetes, and other major containerization platforms. The flaw stems from a race condition and file descriptor leak when processing the `WORKDIR` instruction for a new container or when using `runc exec`. A malicious actor can craft a container image that tricks runc into retaining a handle to the host filesystem, allowing the container to break out of its isolation and achieve full remote code execution on the underlying host machine, completely shattering the security boundary of containerization.
Before we dive into the guts of this bug, let's talk about the unsung hero—or in this case, the villain—of the container world: runc. Most developers interact with Docker or Kubernetes, blissfully unaware of the grimy engine room that actually makes their containers go. That engine room is runc. It’s the low-level tool that takes a container configuration and actually spawns the isolated process, setting up all the Linux kernel magic like namespaces, cgroups, and capabilities.
Think of Docker as the ship's captain giving orders, and runc as the chief engineer who actually turns the valves and shovels coal into the furnace. It's a small, focused, command-line tool written in Go, and its singular purpose is to run containers according to the Open Container Initiative (OCI) specification. Because it's responsible for creating the very sandbox that isolates containers, it operates with god-like privileges on the host system. It has to.
This privileged position is precisely what makes it such a juicy target. A flaw in runc isn't just a bug in some user-space application; it's a crack in the very foundation of container security. An attacker who can manipulate runc doesn't just compromise a container; they compromise the entire host, and by extension, every other container running on it. The whole security model of modern cloud infrastructure rests on the assumption that this little Go binary does its job flawlessly.
Unfortunately, as we're about to see, 'flawless' is a word that rarely applies to software written by humans. The developers of runc had to perform a delicate, high-wire ballet of system calls, file descriptor management, and process state manipulation. One tiny slip, one misstep in the sequence of operations, and the entire performance comes crashing down. CVE-2024-21626 is the story of one such slip.
At its heart, this vulnerability is a classic case of TOCTOU (Time-of-check to time-of-use), wrapped in the arcane complexity of Linux file descriptors. To understand the bug, you first need to appreciate how profoundly weird the /proc filesystem is on Linux. Specifically, the /proc/self/fd/ directory. This isn't a normal directory; it's a magic window into the file descriptors held open by the current process. Each entry, like /proc/self/fd/7, is a symbolic link to the actual file or resource that file descriptor 7 points to.
So, what went wrong in runc? The vulnerability lies in the sequence of events when runc sets up the working directory for a container, either during creation or via runc exec. The process, simplified, looks something like this: runc starts up with high privileges, opens a handle to the container's root filesystem (which is just a directory on the host), and prepares to pivot into the container's isolated world.
A key step in this process is honoring the WORKDIR instruction from a Dockerfile. This instruction tells the container what its default directory should be. The vulnerable version of runc would read this path from the container's configuration and perform an os.Chdir() call to change its own current working directory. The catastrophic mistake was when it did this. It changed its directory before it had fully sandboxed itself and closed all its privileged file descriptors pointing to the host system.
This created a race condition. A malicious container could specify a WORKDIR of, for instance, /proc/self/fd/7. If the container's code could win the race against runc's initialization, it could ensure that when runc performed the Chdir, file descriptor 7 was still a handle to the host's root filesystem (/). runc, now tricked, would have its current working directory set to the host's root. From there, any subsequent relative path operations would be relative to the host, not the container. The jailbreak was complete before the cell door was even fully shut.
Talk is cheap. Let's look at the code. The patch that fixed CVE-2024-21626 is beautifully simple, which is often a sign of a truly devious bug. It’s all about reordering operations to eliminate the race window. Let's dissect the logic.
In the vulnerable code, the working directory was processed something like this within the libcontainer setup logic:
// BEFORE THE PATCH - libcontainer/standard_init_linux.go
// ... some setup ...
if config.Cwd != "" {
// THIS IS THE VULNERABLE STEP!
// The Chdir happens while runc still holds sensitive host FDs.
// If config.Cwd is malicious (e.g., "/proc/self/fd/X"),
// we are now operating on the host filesystem.
if err := syscall.Chdir(config.Cwd); err != nil {
return fmt.Errorf("cannot chdir to cwd (%q) - %w", config.Cwd, err)
}
}
// ... more setup, including dropping privileges and closing FDs ...The fundamental design flaw is right there. The syscall.Chdir(config.Cwd) is called based on user-controlled input (config.Cwd) before the process has been fully sanitized. The attacker provides the path, runc blindly follows, and suddenly its internal state points outside the intended container root.
Now, let's look at the fix. The developers realized they couldn't trust the path until after the process was safely inside the container's new root filesystem. The patch refactors the logic significantly.
// AFTER THE PATCH
// ... setup, but NO Chdir yet ...
// The process now enters the container's mount namespace and performs a pivot_root.
// At this point, the process's view of the filesystem is locked to the container.
if err := finalizeNamespace(config); err != nil {
return err
}
// ... inside finalizeNamespace or a similar subsequent function ...
// NOW it's safe to change the directory.
if config.Cwd != "" {
// The process is already jailed. Even if config.Cwd is malicious,
// it will be resolved relative to the container's root, not the host's.
if err := syscall.Chdir(config.Cwd); err != nil {
return fmt.Errorf("cannot chdir to cwd (%q) - %w", config.Cwd, err)
}
}The fix is to delay the Chdir. By moving it until after pivot_root and the finalization of the mount namespace, the attack is neutered. Even if the WORKDIR is /proc/self/fd/7, the meaning of that path is now scoped entirely within the container's filesystem. The file descriptor 7, if it even exists and points to anything, will point to a resource inside the container, not on the host. The race is over, and the house wins. This simple reordering of operations is the difference between a secure sandbox and a wide-open door.
So, how does an attacker actually turn this theoretical flaw into a practical host takeover? It's surprisingly straightforward. The entire exploit can be packaged into a malicious Docker image, waiting for an unsuspecting admin to run it or for an automated CI/CD pipeline to pick it up.
First, you craft a Dockerfile. The beauty of this exploit is its simplicity. All you need are two key instructions: WORKDIR and CMD (or ENTRYPOINT).
# Malicious Dockerfile for CVE-2024-21626
FROM ubuntu:latest
# The magic trick. This path is the key.
# We are telling runc to change its directory to a file descriptor path.
# An attacker might iterate through several common FDs (e.g., 3 through 10).
WORKDIR /proc/self/fd/7
# The payload. Once WORKDIR succeeds, this command runs
# with the host's root as its current directory.
CMD ["ls", "-la", "."]This Dockerfile tells runc to set the working directory to /proc/self/fd/7. Now, the attacker needs to win the race. The CMD here is benign (ls -la .), but it demonstrates the exploit. When an admin runs docker run leaky-vessel, runc fires up. It opens various files on the host, one of which might land on file descriptor 7. The vulnerable runc then executes Chdir("/proc/self/fd/7"), and suddenly its working directory is the host's /.
Then, runc executes the container's command: ls -la .. Since the current directory (.) is now the host's root, the command doesn't list the container's files; it lists the contents of the host's root directory (/etc, /home, /root, etc.). Game over. An attacker would, of course, use a more destructive CMD, such as one that writes their SSH key to /root/.ssh/authorized_keys or executes a reverse shell.
[!WARNING] A more advanced attacker wouldn't rely on a static FD number. They could write a small C program as the container's entrypoint that sprays open files and then iterates through
/proc/self/fd/*to find a directory handle, making the exploit far more reliable. The core principle remains the same: trickruncinto using a path that resolves outside the container's intended root.
Let's not mince words here. The impact of a successful CVE-2024-21626 exploit is total, catastrophic host compromise. This isn't a data leak or a denial of service; it's the keys to the kingdom. The very concept of containerization is that processes inside the container are jailed, unable to see or affect the host system or other containers. This vulnerability dynamites that jail wall.
Once an attacker has escaped, they are running with the privileges of the container runtime, which is often root (or close to it). They have full read/write access to the host's entire filesystem. Think about what that means. They can steal application source code, database files, and TLS certificates. They can read Kubernetes secrets mounted onto the host. They can install persistent backdoors, rootkits, or cryptominers that will survive reboots.
In a shared tenancy environment like a Kubernetes cluster, the stakes are even higher. An attacker who compromises one tenant's pod can escape to the node. From there, they can often leverage the node's credentials (like the kubelet's service account token) to attack the Kubernetes API server itself. A single pod compromise can quickly escalate into a full cluster takeover, allowing the attacker to destroy workloads, steal data from every running application, and use the cluster's compute resources for their own nefarious purposes.
This vulnerability effectively undoes decades of progress in operating system security and process isolation. It reminds us that our complex, abstracted cloud-native stacks are often perched precariously on a few critical, low-level components. When one of those components fails, the entire tower can come tumbling down. This is why runtime security and defense-in-depth are not optional luxuries; they are fundamental requirements for running containers in production.
The immediate, obvious, and only real fix is to patch. You need to update runc to version 1.1.12 or newer. Since runc is bundled with higher-level tools, this means you need to update your entire container stack. For Docker users, that means updating Docker Engine. For Kubernetes users, it means updating containerd and ensuring your node images are patched. There is no clever workaround for this one; you are vulnerable, and you must patch.
However, this incident is a fantastic, if painful, lesson in defense-in-depth. Relying solely on the container runtime for security is like relying on a single lock on your front door. A determined attacker will find a way to pick it. You need more layers. For example, a properly configured Seccomp profile could have mitigated this attack. Seccomp can restrict the system calls a container is allowed to make. A strict profile might block the Chdir syscall entirely for most applications, or at least generate an audit log that something deeply weird is happening.
Similarly, mandatory access control systems like SELinux or AppArmor are designed for exactly this scenario. A strong AppArmor profile would have prevented the container process from accessing files outside of its designated directories, even after the runc process was tricked. The escape would have failed because the kernel's MAC policy would have stepped in as a second line of defense.
Finally, the principle of least privilege is paramount. Don't run your containers as the root user. While this specific bug allowed escape regardless of the user inside the container, many other escape vulnerabilities are thwarted if the containerized process is unprivileged. By running as a non-root user, you reduce the attack surface and limit the immediate capabilities of an attacker, potentially giving you time to detect and respond before they can escalate to full root on the host.
As for re-exploitation, security researchers should now be scrutinizing every single code path in runc and other runtimes where a path from the container's configuration is used by the privileged runtime process. Any instance of file access, directory change, or path resolution that occurs before the final pivot_root is a potential goldmine for similar vulnerabilities. The fix for this bug was simple, but the flawed design pattern may exist elsewhere.
CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:H/I:H/A:H| Product | Affected Versions | Fixed Version |
|---|---|---|
runc Open Container Initiative | < 1.1.12 | 1.1.12 |
| Attribute | Detail |
|---|---|
| CWE ID | CWE-22: Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal') |
| Attack Vector | Local (Attacker must be able to run a malicious container on the host) |
| CVSS Score | 8.6 (High) |
| CVSS Vector | CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:H/I:H/A:H |
| Impact | Container Escape, Host RCE |
| Exploit Status | Active Exploitation / Public PoC |
| KEV Status | Listed in CISA KEV Catalog |
| EPSS Score | 90.25% (0.90252) |
Get the latest CVE analysis reports delivered to your inbox.