This vulnerability isn't just one bug; it's a Hydra of three distinct logic flaws, all stemming from the library's "non-strict" read mode. When pypdf detects a broken startxref (the pointer to the file's map), it triggers a fallback routine to scan the entire file for objects. Here is where the demons lived.

1. The Regex Bomb To find objects, the library used a regular expression to scan binary data. The regex looked something like rb"[\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj". If you know anything about ReDoS (Regular Expression Denial of Service), your spidey senses should be tingling. By injecting massive blocks of whitespace or null bytes between object definitions, an attacker could force the regex engine into catastrophic backtracking or simply force Python to churn through gigabytes of useless scans, pinning the CPU at 100%.

2. The Infinite Page Loop PDFs are trees. A /Pages object points to /Kids, which can be other /Pages or leaf /Page nodes. The vulnerability here was simple recursion without a base case check. If a generic /Pages object pointed back to itself (or a parent) in the /Kids array, pypdf would happily traverse that loop forever—or at least until the Python recursion limit shattered the stack.

3. The Wild Goose Chase If the PDF Catalog (/Root) was missing, pypdf would attempt to find it by iterating through every potential object ID up to the number defined in /Size in the trailer. Since /Size is just a number in the file, an attacker could set it to 999,999,999. The library would then attempt to look up nearly a billion non-existent objects, turning a millisecond parse into a runtime measured in heat death of the universe epochs.

# OLD (Vulnerable) # relying on regex to parse structure regex = re.compile(rb"[\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj") # NEW (Fixed) # Manual byte scanning with bounds checking while True: loc = data.find(b" obj", loc) if loc == -1: break # ... extensive logic to verify it's a real object ...

Product

Affected Versions

Fixed Version

pypdf

< 6.6.0

6.6.0

Attribute

Detail

CWE ID

CWE-1333 (Inefficient Regular Expression Complexity)

CWE ID

CWE-400 (Uncontrolled Resource Consumption)

Attack Vector

Network / File Upload

CVSS v3.1

5.3 (Medium)

Impact

Denial of Service (DoS)

Exploit Status

PoC Available (in test suite)

KEV Status

Not Listed

GHSA-4F6G-68PF-7VHV

5.30.02%

Death by PDF: Infinite Loops and Regex Nightmares in pypdf

Alon Barad

Software Engineer

Feb 22, 2026·7 min read·10 visits

PoC Available

Executive Summary (TL;DR)

pypdf < 6.6.0 is vulnerable to DoS. If you feed it a malformed PDF with a broken cross-reference table or cyclic page tree, the library enters an infinite loop or CPU-exhausting regex scan while trying to 'fix' the file. This kills the worker process instantly.

A deep dive into CVE-2026-22691, a Denial of Service vulnerability in the pypdf library caused by catastrophic backtracking in regex and infinite recursion loops during error recovery. When pypdf tries too hard to fix a broken file, it breaks your server instead.

Attack Flow Diagram

The Hook: When "Helpful" Becomes "Harmful"

We often praise software for being "robust." If I hand a library a slightly malformed file, I want it to figure it out, not throw a tantrum and crash. But in the world of parsing—especially the hellscape that is the PDF specification—robustness is often a synonym for "attack surface."

Enter pypdf, a pure-Python library used by thousands of backend services to merge, split, and scrape PDFs. It has a feature that sounds great on paper: if a PDF is corrupted (missing its Cross-Reference Table or Root object), pypdf will roll up its sleeves and try to rebuild the missing structures byte-by-byte. It tries to be the hero.

CVE-2026-22691 is the story of that heroism backfiring. By crafting a PDF that looks broken in just the right way, we can trigger these recovery mechanisms. The problem? The recovery logic contained algorithmic complexities ranging from $O(N^2)$ regex searches to literal infinite recursion. It's the computational equivalent of asking for directions and being told to drive until the earth runs out of road.

The Flaw: A Trinity of Resource Exhaustion

The Code: Autopsy of a Fix

The patch provided in version 6.6.0 (Commit 294165726b646bb7799be1cc787f593f2fdbcf45) is a masterclass in performant defensive coding. The maintainers didn't just tweak the regex; they ripped it out entirely.

Killing the Regex

The old code relied on re.search to find the next object. The new code implements a manual scanner _find_pdf_objects using data.find(b' obj'). This is not only safer but orders of magnitude faster because it leverages C-optimized byte searching without the overhead of a state machine.

# OLD (Vulnerable)
# relying on regex to parse structure
regex = re.compile(rb"[\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj")
 
# NEW (Fixed)
# Manual byte scanning with bounds checking
while True:
    loc = data.find(b" obj", loc)
    if loc == -1:
        break
    # ... extensive logic to verify it's a real object ...

Stopping the Loops

For the infinite recursion in the page tree, the fix involved adding a visited set to track object IDs during the flattening process. If the code encounters a node it has already seen in the current traversal path, it raises a PdfReadError immediately instead of recursing into oblivion.

# Simplified logic from the patch
if indirect_ref.idnum in visited_pages:
    raise PdfReadError("Cyclic reference detected in page tree")
visited_pages.add(indirect_ref.idnum)

Capping the Recovery

Finally, they introduced root_object_recovery_limit (defaulting to 10,000). Even if the file says it has a billion objects, pypdf will now only search a reasonable number before giving up and declaring the file dead.

The Exploit: Crafting the Poisoned Chalice

To exploit this, we don't need complex shellcode or heap grooming. we just need a text editor and a disregard for PDF standards. We are targeting the "error recovery" path, so step one is to break the file intentionally.

Recipe 1: The Infinite Loop (Recursion Error)

We create a valid PDF structure but mess with the Page Tree.

Define a Root object 1 0 obj pointing to Pages 2 0 obj.
Define Pages 2 0 obj with a /Kids array.
Put a reference to 2 0 R inside the /Kids of 2 0 obj.
Corrupt the startxref so pypdf is forced to traverse the tree manually to rebuild structure.

When pypdf attempts to len(reader.pages) or access a page, it dives into 2 0, sees 2 0 as a kid, dives into that, and repeats until the process crashes.

Recipe 2: The CPU Burner (Regex/Scan Exhaustion)

Create a header %PDF-1.7.
Insert 10MB of spaces or tab characters.
Insert a broken object definition at the end.
Corrupt the startxref.

When the library tries to rebuild the XREF table, the vulnerable regex [\r\n \t][ \t]* attempts to match that massive block of whitespace. Depending on the engine implementation and the exact pattern, this hangs the execution thread indefinitely.

The Impact: Why Denial of Service Matters

It is easy to dismiss DoS bugs as "low severity" compared to RCE, but in the context of modern document processing pipelines, this is a killer.

Imagine a Fintech company that allows users to upload invoices for automated processing. They run a Celery worker or a Kubernetes pod that picks up PDFs and uses pypdf to extract text. An attacker uploads a single 4KB malicious PDF.

The worker picks it up. It enters the recovery loop. That CPU core hits 100%. The worker stops responding to heartbeats. The orchestration layer kills it and starts a new one. If the job queue is persistent (which it usually is), the new worker picks up the same bad file and dies again.

This is the "Poison Message" pattern. A single file can effectively jam an entire processing queue, denying service to all legitimate users until an engineer manually intervenes and purges the queue. For cloud-based services, this also translates to direct financial damage as autoscalers spin up more (useless) compute nodes to handle the perceived load.

The Fix: Update or Strict Mode

Remediation is straightforward, but it requires action. The primary fix is to update the library.

1. Upgrade to 6.6.0+ This is the only way to get the optimized scanning logic and the recursion guards.

pip install pypdf>=6.6.0

2. Use Strict Mode (Mitigation) If you are stuck on an older version (perhaps due to legacy dependencies), you must instantiate the PdfReader in strict mode. This disables the "helpful" recovery features that contain the vulnerabilities. If a file is broken, it will simply raise an error rather than hanging your server.

from pypdf import PdfReader
 
# SAFE: Will raise an exception on malformed files instead of hanging
reader = PdfReader("suspicious.pdf", strict=True)

Do not rely on catching RecursionError alone as a mitigation, as that does not protect against the CPU exhaustion caused by the regex scanning or the O(N) object search.

Official Patches

pypdfpypdf 6.6.0 Release Notes

Fix Analysis (1)

Technical Appendix

CVSS Score

5.3/ 10

CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:L

EPSS Probability

0.02%

Top 95% most exploited

Affected Systems

pypdf < 6.6.0

Affected Versions Detail

Product	Affected Versions	Fixed Version
pypdf pypdf	< 6.6.0	6.6.0

Attribute	Detail
CWE ID	CWE-1333 (Inefficient Regular Expression Complexity)
CWE ID	CWE-400 (Uncontrolled Resource Consumption)
Attack Vector	Network / File Upload
CVSS v3.1	5.3 (Medium)
Impact	Denial of Service (DoS)
Exploit Status	PoC Available (in test suite)
KEV Status	Not Listed

MITRE ATT&CK Mapping

T1499Endpoint Denial of Service

Impact

T1499.003Application Exhaustion Flood

Impact

CWE-1333

Inefficient Regular Expression Complexity

The software uses a regular expression that can be made to run in exponential time, or processes recursive structures without adequate depth limiting.

Known Exploits & Detection

GitHub (pypdf Test Suite)The library's own regression tests (test_rebuild_xref_table__speed) serve as a PoC for the DoS condition.

Vulnerability Timeline

pypdf 6.6.0 released with fix

2026-01-09

CVE-2026-22691 Assigned

2026-01-10

Advisory Analysis Published

2026-01-22