Traefik Jam: The Eternal Handshake Denial of Service
Jan 15, 2026·5 min read
Executive Summary (TL;DR)
Traefik's "fast path" for handling Let's Encrypt certificate validations (TLS-ALPN-01) failed to implement timeouts during the internal TLS handshake. By initiating a connection with the specific ALPN header and then halting communication, an attacker can force the server to wait indefinitely. This leaks goroutines and file descriptors, eventually crashing the load balancer.
A resource exhaustion vulnerability in Traefik's ACME TLS-ALPN-01 challenge handler allows unauthenticated attackers to trigger infinite hangs during the TLS handshake, consuming file descriptors and goroutines.
The Hook: The VIP Lane for Certificates
In the world of modern reverse proxies, Traefik is the cool kid on the block. It integrates seamlessly with Kubernetes, Docker, and—crucially for this story—Let's Encrypt. One of the sleekest features it offers is the TLS-ALPN-01 challenge.
Normally, proving you own a domain involves putting a file on a server (HTTP-01) or messing with DNS records (DNS-01). But TLS-ALPN-01 is special. It happens entirely within the TLS handshake on port 443. It's cleaner, faster, and requires less infrastructure plumbing.
To make this work, Traefik employs a "fast path." When a new TCP connection hits the entry point, Traefik peeks at the ClientHello. If it sees the specific ALPN protocol acme-tls/1, it says, "Hold on, this isn't web traffic, this is a certificate challenge!" and hijacks the connection, passing it to a dedicated internal handler. It's a VIP lane for ACME bots. Unfortunately, Traefik forgot to put a bouncer in the VIP lane.
The Flaw: A Handshake That Never Ends
Here is the architectural sin: Assumption of Benevolence.
When Traefik detects the acme-tls/1 protocol, it hands the connection off to acmeTLSALPNHandler in pkg/server/router/tcp/router.go. Because this is a special internal process, the developers explicitly cleared the socket deadlines. They likely thought, "We are just finishing a handshake we already started; it will take milliseconds."
They used the standard Go tls.Server(conn, config).Handshake() method. The problem? That method is synchronous and blocking. If the underlying TCP connection doesn't provide data, Handshake() sleeps.
By clearing the deadlines (SetDeadline(time.Time{})) and not using a Context with a timeout, Traefik effectively told the operating system: "I will wait for this client to finish talking until the heat death of the universe." An attacker simply has to say "Hello" and then go silent. Traefik keeps the line open, the memory allocated, and the file descriptor locked forever.
The Code: The Smoking Gun
Let's look at the diff. It's a classic example of "Go concurrency gotchas."
The Vulnerable Code: Note the lack of context or timeout logic. It just attempts the handshake on a naked connection.
// The old way: Optimistic and dangerous
return tcp.HandlerFunc(func(conn tcp.WriteCloser) {
// DANGER: This blocks forever if the client stops sending bytes
_ = tls.Server(conn, r.httpsTLSConfig).Handshake()
})The Fix (Commit e9f3089e9045812bcf1b410a9d40568917b26c3d):
The patch introduces three critical safety mechanisms: a defer to ensure cleanup, a strict 2-second timeout, and a context-aware handshake.
// The fixed way: Paranoid and safe
return tcp.HandlerFunc(func(conn tcp.WriteCloser) {
tlsConn := tls.Server(conn, r.httpsTLSConfig)
// 1. Ensure we close the FD no matter what happens
defer tlsConn.Close()
// 2. Define a hard deadline. 2 seconds is plenty for a bot.
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
// 3. Use the Context-aware handshake
if err := tlsConn.HandshakeContext(ctx); err != nil {
log.FromContext(ctx).WithError(err).Debug("Error during ACME-TLS/1 handshake")
}
})It's a textbook remediation: never perform network I/O without a timeout.
The Exploit: Ghosting the Server
To exploit this, we don't need fancy buffer overflows. We just need to be a bad conversationalist. This is a "Slow Loris" style attack, but targeted specifically at the ACME logic.
The Recipe:
- Connect: Open a standard TCP socket to the target IP on port 443.
- Deceive: Construct a TLS
ClientHellopacket. The critical payload is the ALPN extension containing the stringacme-tls/1. - Ghost: Send the
ClientHelloand then... do nothing. Do not send the rest of the handshake. Do not close the socket. Just wait.
Why it works:
Traefik sees the ALPN tag, clears the socket timeout, and calls Handshake(). Because we sent the ClientHello, the server logic engages. But because we haven't sent the Finished message (or the Key Exchange), the server waits for more bytes.
Repeat this loop 5,000 times. You will consume 5,000 file descriptors (FDs). Once the server hits its ulimit (often 1024 or 4096 default on older systems, though higher in K8s), it stops accepting legitimate traffic. The load balancer is now a brick.
The Impact: Death by a Thousand Goroutines
The severity here is labeled "Medium" (CVSS 5.9) largely because it requires the ACME TLS-ALPN configuration to be active. However, for environments that rely on this feature, the impact is High Availability Failure.
When the exploit runs:
- Memory Leak: Each "stuck" connection consumes a Goroutine (2KB+ stack) and the associated TLS state structures.
- FD Exhaustion: This is the real killer. Linux has a finite number of open file descriptors per process. Once you consume them all, Traefik cannot open new sockets to upstream backends, cannot accept new client connections, and often cannot even write to its own log files.
- Cascading Failure: If this Traefik instance is an Ingress Controller in Kubernetes, the liveness probes might fail (if they rely on a new connection), causing Kubernetes to restart the pod. The attacker simply targets the new pod, creating a boot-loop denial of service.
The Fix: Timeouts Are Not Optional
If you are running Traefik versions below 2.11.35 or 3.6.7, you are vulnerable. The patch is simple but mandatory.
Immediate Mitigation:
If you cannot upgrade immediately, check your traefik.yaml or static configuration. If you are using Let's Encrypt (ACME), check which challenge type is enabled. If you see tlsChallenge: true, you are exposed.
Switching to httpChallenge or dnsChallenge neutralizes the attack vector because the vulnerable code path (acmeTLSALPNHandler) will never be triggered. However, this may require infrastructure changes (e.g., opening port 80 or configuring DNS provider API keys).
Long Term: Update to the patched versions. The fix implements a hard 2-second timeout. If the handshake doesn't complete in 2 seconds, Traefik ruthlessly severs the connection.
Fix Analysis (1)
Technical Appendix
CVSS:3.1/AV:N/AC:H/PR:N/UI:N/S:U/C:N/I:N/A:HAffected Systems
Affected Versions Detail
| Product | Affected Versions | Fixed Version |
|---|---|---|
Traefik Proxy Traefik | < 2.11.35 | 2.11.35 |
Traefik Proxy Traefik | >= 3.0.0-beta1, < 3.6.7 | 3.6.7 |
| Attribute | Detail |
|---|---|
| CWE | CWE-770 (Resource Exhaustion) |
| CVSS v3.1 | 5.9 (Medium) |
| Attack Vector | Network (Remote) |
| Impact | Denial of Service |
| Protocol | TLS / ACME |
| Fix Complexity | Low (Update) |
MITRE ATT&CK Mapping
Allocation of Resources Without Limits or Throttling
Known Exploits & Detection
Vulnerability Timeline
Subscribe to updates
Get the latest CVE analysis reports delivered to your inbox.