CVEReports
CVEReports

Automated vulnerability intelligence platform. Comprehensive reports for high-severity CVEs generated by AI.

Product

  • Home
  • Dashboard
  • Sitemap
  • RSS Feed

Company

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 CVEReports. All rights reserved.

Made with love by Amit Schendel & Alon Barad



CVE-2026-22691
5.30.02%

Death by a Thousand Spaces: pypdf DoS Deep Dive

Alon Barad
Alon Barad
Software Engineer

Feb 21, 2026·5 min read·2 visits

PoC Available

Executive Summary (TL;DR)

pypdf < 6.6.0 contains multiple DoS vectors in its error recovery logic. Malformed PDFs with broken xref tables, missing Root objects, or circular page trees can cause 100% CPU usage or crashes.

A trio of resource exhaustion vulnerabilities in the popular pypdf library allows attackers to trigger Denial of Service via malformed PDF files. By exploiting the library's 'helpful' error recovery logic, attackers can force infinite loops, recursion errors, or catastrophic backtracking.

The Hook: PDFs Are Cursed

Let's be honest: the PDF specification isn't so much a 'standard' as it is a crime scene that we've all agreed to ignore. It is a complex, hierarchical beast capable of embedding everything from JavaScript to 3D models. Because the spec is so convoluted, PDF parsers often have to be incredibly forgiving. They try to 'heal' broken files so the user doesn't see an error. This is where pypdf lives—a pure-Python library designed to manipulate these digital monstrosities.

But here is the golden rule of secure coding: Benevolence breeds bugs.

CVE-2026-22691 is a classic example of what happens when a library tries too hard to be helpful. When pypdf encounters a malformed PDF in its default (non-strict) mode, it doesn't just reject it. It rolls up its sleeves and attempts to reconstruct the missing pieces. It scans for objects, hunts for the Root catalog, and tries to flatten the page tree. The vulnerability lies in how it does this—using inefficient algorithms that assume the file isn't actively trying to kill the process.

The Flaw: A Trilogy of Errors

This isn't just one bug; it's a bundle of three distinct ways to make a Python process cry. They all stem from the library's error recovery mechanisms (CWE-400 and CWE-1333).

1. The Regex Bomb (ReDoS): When the cross-reference (xref) table is broken, pypdf tries to find objects manually. It used a Regular Expression to scan the binary stream. The regex [\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj looks innocent enough, but those nested quantifiers (* and + on whitespace) are deadly. If an attacker injects a massive block of spaces or tabs without a valid object definition, the regex engine enters catastrophic backtracking, eating CPU cycles like popcorn.

2. The Count to Infinity: If the PDF's trailer says the file has a huge /Size (e.g., 2 billion objects) but the /Root (Catalog) is missing, the library enters a recovery loop. It iterates from 0 to /Size, checking every single ID to see if it might be the Catalog. It's an O(N) search where N is controlled by the attacker. Spoiler: 2 billion iterations in Python takes a while.

3. The Ouroboros (Infinite Recursion): The _flatten function is responsible for turning the hierarchical tree of PDF pages into a linear list. Prior to the fix, it blindly followed the /Kids array. If a malicious PDF defined a page node that listed itself or a parent as a child, the function would recurse until it hit the stack depth limit, crashing the application with a RecursionError.

The Code: Regex vs. Reality

The most interesting fix is the move away from regex. Regular expressions on binary streams are often a trap. Here is the vulnerable logic that was responsible for the ReDoS:

# Vulnerable Code (Pre-6.6.0)
# Looking for "1 0 obj"
re.finditer(
    rb'[\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj',
    f_
)

The fix, introduced in commit 294165726b646bb7799be1cc787f593f2fdbcf45, abandons the regex entirely. Instead, the developers switched to a manual byte-scanning approach using Python's native bytes.find() method. It's less 'elegant' to read, perhaps, but it's linear time complexity and immune to backtracking.

# Patched Code (6.6.0)
@classmethod
def _find_pdf_objects(cls, data: bytes) -> Iterable[tuple[int, int, int]]:
    index = 0
    while True:
        # Simple string search. Fast. Safe.
        index = data.find(b" obj", index)
        if index == -1:
            return
        # ... [Logic to parse ID/Generation backwards from the match] ...
        index += 4

Additionally, for the "Count to Infinity" bug, they added a hard cap. The PdfReader now accepts a root_object_recovery_limit parameter (default 10,000). If it can't find the Root in 10k tries, it gives up. Sometimes, quitting is the best option.

The Exploit: Weaponizing Whitespace

Exploiting this doesn't require advanced memory corruption techniques. You don't need to know the stack alignment or bypass ASLR. You just need a text editor (or a hex editor) and a bad attitude.

Scenario: A web application allows users to upload PDF invoices. The backend uses pypdf to read the metadata.

Attack 1: The Whitespace Bomb

  1. Take a valid PDF.
  2. Corrupt the startxref pointer at the end of the file so pypdf triggers recovery mode.
  3. Append 50MB of spaces (0x20) and tabs (0x09) to the end of the file.
  4. Upload. The server's CPU spikes to 100% as the regex engine tries to match that whitespace against the complex pattern, effectively freezing the worker process.

Attack 2: The Loop of Death

  1. Create a minimal PDF.
  2. In the trailer, set /Size 2147483647.
  3. Delete the /Root key from the trailer.
  4. Upload. The worker thread enters a loop attempting to iterate 2 billion times. In Python, this will hang the process for hours, blocking other requests.

The Fix: Upgrade or Strict Mode

The remediation is straightforward. If you are using pypdf, you are likely bundling it with your application. You need to update your requirements.txt or pyproject.toml immediately.

1. Update to 6.6.0+: This version includes the manual scanner, the recursion depth checks, and the iteration limits.

2. Enable Strict Mode: If you cannot update immediately, you can mitigation the issue by forcing strict compliance. This disables the 'helpful' recovery logic that contains the vulnerabilities.

# Mitigation
reader = PdfReader("suspicious.pdf", strict=True)

> [!NOTE] > Using strict=True means valid-but-slightly-broken PDFs will fail to load. This is a trade-off between availability and security. In a hostile environment (processing public uploads), strict=True should be the default anyway.

Official Patches

pypdfRelease notes for version 6.6.0

Fix Analysis (1)

Technical Appendix

CVSS Score
5.3/ 10
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:L
EPSS Probability
0.02%
Top 95% most exploited

Affected Systems

Python applications using pypdf < 6.6.0Document processing pipelinesEmail gateways analyzing PDF attachments

Affected Versions Detail

Product
Affected Versions
Fixed Version
pypdf
py-pdf
< 6.6.06.6.0
AttributeDetail
CWE IDCWE-400 / CWE-1333
CVSS v3.15.3 (Medium)
Attack VectorNetwork (via File Upload)
ImpactDenial of Service (DoS)
Exploit StatusPoC Available (Trivial)
Patchv6.6.0

MITRE ATT&CK Mapping

T1499Endpoint Denial of Service
Impact
CWE-400
Uncontrolled Resource Consumption

Uncontrolled Resource Consumption and Inefficient Regular Expression Complexity

Known Exploits & Detection

Manual AnalysisTechnical analysis of the fix reveals reproducible ReDoS vectors.

Vulnerability Timeline

Patch released in v6.6.0
2026-01-09
CVE Published
2026-01-10

References & Sources

  • [1]GitHub Security Advisory
  • [2]NVD Entry

Attack Flow Diagram

Press enter or space to select a node. You can then use the arrow keys to move the node around. Press delete to remove it and escape to cancel.
Press enter or space to select an edge. You can then press delete to remove it or escape to cancel.