Feb 21, 2026·5 min read·2 visits
pypdf < 6.6.0 contains multiple DoS vectors in its error recovery logic. Malformed PDFs with broken xref tables, missing Root objects, or circular page trees can cause 100% CPU usage or crashes.
A trio of resource exhaustion vulnerabilities in the popular pypdf library allows attackers to trigger Denial of Service via malformed PDF files. By exploiting the library's 'helpful' error recovery logic, attackers can force infinite loops, recursion errors, or catastrophic backtracking.
Let's be honest: the PDF specification isn't so much a 'standard' as it is a crime scene that we've all agreed to ignore. It is a complex, hierarchical beast capable of embedding everything from JavaScript to 3D models. Because the spec is so convoluted, PDF parsers often have to be incredibly forgiving. They try to 'heal' broken files so the user doesn't see an error. This is where pypdf lives—a pure-Python library designed to manipulate these digital monstrosities.
But here is the golden rule of secure coding: Benevolence breeds bugs.
CVE-2026-22691 is a classic example of what happens when a library tries too hard to be helpful. When pypdf encounters a malformed PDF in its default (non-strict) mode, it doesn't just reject it. It rolls up its sleeves and attempts to reconstruct the missing pieces. It scans for objects, hunts for the Root catalog, and tries to flatten the page tree. The vulnerability lies in how it does this—using inefficient algorithms that assume the file isn't actively trying to kill the process.
This isn't just one bug; it's a bundle of three distinct ways to make a Python process cry. They all stem from the library's error recovery mechanisms (CWE-400 and CWE-1333).
1. The Regex Bomb (ReDoS): When the cross-reference (xref) table is broken, pypdf tries to find objects manually. It used a Regular Expression to scan the binary stream. The regex [\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj looks innocent enough, but those nested quantifiers (* and + on whitespace) are deadly. If an attacker injects a massive block of spaces or tabs without a valid object definition, the regex engine enters catastrophic backtracking, eating CPU cycles like popcorn.
2. The Count to Infinity: If the PDF's trailer says the file has a huge /Size (e.g., 2 billion objects) but the /Root (Catalog) is missing, the library enters a recovery loop. It iterates from 0 to /Size, checking every single ID to see if it might be the Catalog. It's an O(N) search where N is controlled by the attacker. Spoiler: 2 billion iterations in Python takes a while.
3. The Ouroboros (Infinite Recursion): The _flatten function is responsible for turning the hierarchical tree of PDF pages into a linear list. Prior to the fix, it blindly followed the /Kids array. If a malicious PDF defined a page node that listed itself or a parent as a child, the function would recurse until it hit the stack depth limit, crashing the application with a RecursionError.
The most interesting fix is the move away from regex. Regular expressions on binary streams are often a trap. Here is the vulnerable logic that was responsible for the ReDoS:
# Vulnerable Code (Pre-6.6.0)
# Looking for "1 0 obj"
re.finditer(
rb'[\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj',
f_
)The fix, introduced in commit 294165726b646bb7799be1cc787f593f2fdbcf45, abandons the regex entirely. Instead, the developers switched to a manual byte-scanning approach using Python's native bytes.find() method. It's less 'elegant' to read, perhaps, but it's linear time complexity and immune to backtracking.
# Patched Code (6.6.0)
@classmethod
def _find_pdf_objects(cls, data: bytes) -> Iterable[tuple[int, int, int]]:
index = 0
while True:
# Simple string search. Fast. Safe.
index = data.find(b" obj", index)
if index == -1:
return
# ... [Logic to parse ID/Generation backwards from the match] ...
index += 4Additionally, for the "Count to Infinity" bug, they added a hard cap. The PdfReader now accepts a root_object_recovery_limit parameter (default 10,000). If it can't find the Root in 10k tries, it gives up. Sometimes, quitting is the best option.
Exploiting this doesn't require advanced memory corruption techniques. You don't need to know the stack alignment or bypass ASLR. You just need a text editor (or a hex editor) and a bad attitude.
Scenario: A web application allows users to upload PDF invoices. The backend uses pypdf to read the metadata.
Attack 1: The Whitespace Bomb
startxref pointer at the end of the file so pypdf triggers recovery mode.0x20) and tabs (0x09) to the end of the file.Attack 2: The Loop of Death
/Size 2147483647./Root key from the trailer.The remediation is straightforward. If you are using pypdf, you are likely bundling it with your application. You need to update your requirements.txt or pyproject.toml immediately.
1. Update to 6.6.0+: This version includes the manual scanner, the recursion depth checks, and the iteration limits.
2. Enable Strict Mode: If you cannot update immediately, you can mitigation the issue by forcing strict compliance. This disables the 'helpful' recovery logic that contains the vulnerabilities.
# Mitigation
reader = PdfReader("suspicious.pdf", strict=True)> [!NOTE]
> Using strict=True means valid-but-slightly-broken PDFs will fail to load. This is a trade-off between availability and security. In a hostile environment (processing public uploads), strict=True should be the default anyway.
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:L| Product | Affected Versions | Fixed Version |
|---|---|---|
pypdf py-pdf | < 6.6.0 | 6.6.0 |
| Attribute | Detail |
|---|---|
| CWE ID | CWE-400 / CWE-1333 |
| CVSS v3.1 | 5.3 (Medium) |
| Attack Vector | Network (via File Upload) |
| Impact | Denial of Service (DoS) |
| Exploit Status | PoC Available (Trivial) |
| Patch | v6.6.0 |
Uncontrolled Resource Consumption and Inefficient Regular Expression Complexity