Infinite Loops & ReDoS: Crashing pypdf with Malformed Files
Jan 11, 2026·7 min read
Executive Summary (TL;DR)
The pypdf library, in its attempt to be helpful and read broken PDF files (non-strict mode), exposed itself to three separate Denial of Service vectors: catastrophic backtracking in regex, infinite loops in object recovery, and recursion crashes in cyclic page trees. Fixed in version 6.6.0.
A deep dive into how pypdf's lenient recovery logic for broken PDFs created a perfect storm for Denial of Service via ReDoS and infinite recursion loops.
The Hook: The Road to Hell is Paved with Good Intentions
PDFs are notoriously broken. The specification is less of a rigid set of rules and more of a suggestion that Adobe Acrobat ignores whenever it feels like it. Because of this, PDF parsing libraries like pypdf have a difficult job: they can't just reject malformed files, or they'd reject half the internet. So, they implement "recovery" modes—logic designed to guess what a broken file meant to say.
In pypdf, this is the default behavior. Unless you explicitly pass strict=True, the library puts on its detective hat and tries to fix cross-reference tables and missing objects. It's a noble goal, really. The problem is that this detective is gullible, easily distracted, and prone to walking into walls forever.
CVE-2026-22691 isn't a buffer overflow or a remote code execution exploit. It's a classic Denial of Service (DoS) born from algorithmic complexity. It turns out that if you ask pypdf to fix a specifically crafted "broken" file, it will happily consume 100% of your CPU trying to solve an unsolvable puzzle, effectively locking up your web worker or processing thread until the heat death of the universe (or a kill -9).
The Flaw: A Triple Threat of Incompetence
This vulnerability is actually a hydra with three heads. The developers didn't just leave one door open; they left the windows and the chimney accessible too. Here is how the recovery logic failed in three distinct ways:
1. The ReDoS Trap (CWE-1333)
When pypdf encounters a PDF with a corrupted Cross-Reference (xref) table, it attempts to rebuild it by scanning the file for object definitions (1 0 obj). To do this, it used a Regular Expression. The regex was rb"[\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj". If you know anything about regex, you likely spotted the issue: [\r\n \t] followed immediately by [ \t]*. This creates overlapping matching conditions for whitespace. If an attacker sends a file with megabytes of spaces, the regex engine enters catastrophic backtracking, trying every possible permutation of those spaces to find a match. It's $O(2^n)$ complexity on a silver platter.
2. The Infinite Object Hunt (CWE-400)
If the PDF trailer is messed up, pypdf tries to find the Root (Catalog) object manually. It looks at the /Size parameter in the trailer to know how many objects to check. The flaw? It trusted the /Size parameter implicitly. If I hand you a 1KB PDF but write /Size 999999999 in the trailer, pypdf says "You got it, boss!" and enters a massive loop trying to recover objects that don't exist. It had no timeout and no reasonable upper bound.
3. The Ouroboros Page Tree
PDF page structures are technically trees (Directed Acyclic Graphs). But pypdf didn't enforce the "Acyclic" part during recovery. The _flatten function, responsible for flattening the page tree into a list, would recursively visit the /Kids of every page node. An attacker could define a page that lists itself as a child. pypdf would happily recurse A -> A -> A until Python screamed RecursionError and the process crashed.
The Code: The Smoking Gun
Let's look at the fix in commit 2941657. The difference between the vulnerable code and the secure code highlights just how expensive "convenience" features can be.
1. Killing the Regex The developers realized that using Regex on binary streams is often a trap. They replaced the backtracking nightmare with a manual byte scanner.
Vulnerable (The ReDoS Monster):
# This regex has cubic or exponential behavior on whitespace
re.search(rb"[\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj", data)Fixed (The Linear Scanner):
# Simple, linear time search. O(N).
start = data.find(b" obj", start)
# ... followed by manual backwards pointer arithmetic
# to find the object numbers preceding the " obj" marker.2. Breaking the Cycle To stop the infinite recursion in the page tree, they added a simple history check. It's Computer Science 101: detecting cycles in a graph.
The Logic Added to _flatten:
if indirect_ref in (self.indirect_ref, parent):
# If the child is actually the parent (or self),
# stop immediately.
raise PdfReadError("Detected cyclic page references.")3. Capping the Greed Finally, to stop the infinite loop when searching for the Root object, they introduced a hard limit. Trust, but verify.
The Fix:
# Introduced a default limit of 10,000 objects for recovery
root_object_recovery_limit = 10000
if i >= root_object_recovery_limit:
logger.warning("Root object recovery limit reached.")
breakThe Exploit: Weaponizing Whitespace
Exploiting this requires no authentication and no network access beyond the ability to upload a file. If you have an endpoint that accepts PDF resumes, invoices, or receipts, and it uses pypdf in default mode, you are vulnerable.
Attack Scenario: The "Blank" Check
We want to trigger the ReDoS. We need a file that looks like a PDF but forces the recovery logic to kick in (by corrupting the xref table), and then feeds the regex engine a massive buffer of whitespace.
Step 1: The Header
Start with a standard header: %PDF-1.7.
Step 2: The Payload
Insert a corrupted object definition. We start with a valid object start, but then pad it with 50MB of spaces before the obj keyword.
%PDF-1.7
1 0 obj
<< /Type /Catalog /Pages 2 0 R >>
endobj
... [corrupt xref] ...
startxref
0
%%EOF
[... 50,000,000 spaces ...] 9 0 objStep 3: The Trigger
When pypdf loads this, it sees the broken xref. It calls _rebuild_xref_table. It starts regex matching. It hits the wall of text. The Python process spikes to 100% CPU. If this is a web worker (like Gunicorn or Uvicorn), that worker is now dead to the world. Repeat 5 times to starve the worker pool.
Attack Scenario: The Infinite Pages
Alternatively, we can just crash the stack. We create a PDF where Object 2 (the Page Tree) claims that one of its kids is... Object 2.
2 0 obj
<< /Type /Pages /Kids [ 2 0 R ] /Count 1 >>
endobjWhen reader.pages is accessed, pypdf attempts to list all pages. It goes 2 -> 2 -> 2... until the stack overflows.
The Impact: Low Score, High Drama
The CVSS score is 2.7 (Low). Do not let this fool you. CVSS scores for library vulnerabilities often fail to capture context. The metric assumes Availability: Low because the attack "only" causes a hang or crash. However, in the context of a modern web application, this is catastrophic.
Consider an automated invoice processing system handling thousands of files an hour. An attacker can perform a purely asymmetric attack: they upload a 2KB file that consumes 100% of a CPU core for minutes or hours. With a handful of requests, they can completely saturate a server cluster's processing power.
There is no data exfiltration here—no one is stealing your database credentials. But there is a very real cost in downtime and compute credits. If you are running on auto-scaling infrastructure (like AWS Lambda), this vulnerability could be used to perform a "Wallet Denial of Service," forcing your infrastructure to scale out infinitely to handle the "load" generated by stalled processes.
The Fix: Patching the Hull
The remediation is straightforward, but it requires action. pypdf does not auto-update.
Option 1: The Upgrade (Recommended) Bump your requirement to version 6.6.0 or higher.
pip install pypdf>=6.6.0This version includes the non-regex scanner, the cycle detection, and the recovery limits.
Option 2: The Workaround (Strict Mode)
If you cannot upgrade immediately (perhaps due to legacy dependencies), you can mitigate the issue by disabling the helpful-but-dangerous recovery logic. Instantiate your readers with strict=True.
# Tell pypdf to stop trying to be a hero
reader = PdfReader("untrusted.pdf", strict=True)Warning: This will cause your application to reject legitimately broken PDFs that it used to accept. You are trading availability (for malformed files) for security.
Official Patches
Fix Analysis (1)
Technical Appendix
CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:N/VI:N/VA:L/SC:N/SI:N/SA:N/E:UAffected Systems
Affected Versions Detail
| Product | Affected Versions | Fixed Version |
|---|---|---|
pypdf py-pdf | < 6.6.0 | 6.6.0 |
| Attribute | Detail |
|---|---|
| CWE IDs | CWE-1333 (ReDoS), CWE-400 (Resource Consumption) |
| CVSS v4.0 | 2.7 (Low) |
| Attack Vector | Local / Remote (via file upload) |
| Impact | Denial of Service (CPU Exhaustion / Crash) |
| Affected Component | pypdf.PdfReader (Non-strict mode) |
| Fix Version | 6.6.0 |
MITRE ATT&CK Mapping
The software uses a regular expression that is inefficient when processing specific inputs (ReDoS) or fails to restrict resource allocation (Infinite Loops).
Known Exploits & Detection
Vulnerability Timeline
Subscribe to updates
Get the latest CVE analysis reports delivered to your inbox.