Feb 25, 2026·5 min read·5 visits
pypdf < 6.7.2 fails to track visited offsets when parsing PDF cross-reference tables. A malicious PDF with a `/Prev` pointer referencing an earlier byte offset creates an infinite loop, causing permanent CPU exhaustion.
A critical Denial of Service (DoS) vulnerability exists in the `pypdf` library, a ubiquitous tool for PDF manipulation in the Python ecosystem. By crafting a PDF with a circular cross-reference (xref) chain, an attacker can trap the parser in an infinite loop. This results in immediate 100% CPU utilization and process hang, potentially taking down document processing pipelines, web services, or serverless functions.
PDFs are not documents; they are containers of sorrow. The PDF specification is a sprawling, decades-old beast that supports features most people have never heard of, including incremental updates. When you edit a PDF, the software doesn't necessarily rewrite the whole file. Instead, it appends a new 'body' to the end, containing the changed objects and a new 'cross-reference' (xref) section.
To read the file, a parser starts at the end (the trailer) and works its way backward, following pointers to previous versions of the document. This is handled by the /Prev key in the trailer dictionary, which points to the byte offset of the previous xref table.
pypdf, one of the most popular Python libraries for handling PDFs, is tasked with traversing this chain to build a complete map of the document. It’s a standard linked-list traversal problem. But as any computer science freshman knows, if you traverse a linked list without checking for cycles, you are one bad pointer away from eternity. That is exactly what happened here.
The vulnerability lies in the _read_xref_tables_and_trailers method within pypdf/_reader.py. This function is responsible for reconstructing the document structure by hopping from the most recent xref table to the oldest one via the /Prev attribute.
The logic was deceptively simple: start at the startxref offset, parse the table, look for a /Prev key, update startxref to that new value, and repeat until startxref is None. It looks standard, but it lacks a critical defensive programming concept: distrust.
The code blindly assumed that the chain of /Prev pointers would eventually terminate or at least move linearly backward through the file. It did not account for a malicious PDF where the trailer at offset 1000 points to a trailer at offset 500, which in turn points back to offset 1000. Once the parser enters this loop, it spins forever, repeatedly re-parsing and re-caching the same objects, consuming 100% of the CPU core until the process is killed or the universe ends.
Let's look at the vulnerable code. It's a textbook example of an infinite loop waiting to happen. The variable startxref is the control condition, but there is no history of where we've been.
# Vulnerable implementation in pypdf/_reader.py
def _read_xref_tables_and_trailers(self, stream, startxref):
# ... initialization ...
while startxref is not None:
# The parser seeks to the offset blindly
stream.seek(startxref, 0)
# ... complex parsing logic ...
# The parser reads the next link in the chain
# If this points to an offset we just visited, we are doomed.
startxref = trailer.get("/Prev")The fix, introduced in version 6.7.2, is elegant and standard: keep a set of visited offsets. If we see an offset again, we know we are in a loop, and we bail out.
# Patched implementation
def _read_xref_tables_and_trailers(self, stream, startxref):
# ... initialization ...
visited_xref_offsets: set[int] = set() # [!code ++]
while startxref is not None:
# Check if we've been here before
if startxref in visited_xref_offsets: # [!code ++]
logger_warning( # [!code ++]
f"Circular xref chain detected at offset {startxref}, stopping", # [!code ++]
__name__, # [!code ++]
) # [!code ++]
break # [!code ++]
visited_xref_offsets.add(startxref) # [!code ++]
stream.seek(startxref, 0)
# ... parsing logic ...This simple addition of visited_xref_offsets completely neutralizes the attack. It transforms an infinite loop into a logged warning.
Exploiting this is trivial if you understand raw PDF syntax. We don't need complex heap massaging or ROP chains; we just need a text editor. A PDF trailer usually looks like this:
trailer
<< /Size 10 /Root 1 0 R /Prev 400 >>
startxref
1000
%%EOF
In a valid file, the /Prev 400 implies there is another cross-reference table at byte offset 400. To exploit pypdf, we construct a file where the cross-reference table points to itself, or points to a second table that points back to the first.
Here is the logic flow of the attack:
xref keyword (let's say it's at offset 550).trailer dictionary, set /Prev 550.startxref points to 550. The parser reads the table, finds /Prev 550, sets startxref to 550, and loops. And loops. And loops.The PoC provided in the repository does exactly this using Python formatting to inject the calculated offsets dynamically. It creates a 'self-referential' PDF that is syntactically valid enough to trick the parser into the loop but logically broken.
You might think, "It's just an infinite loop, who cares?" In the age of automated document processing, you should care deeply. Imagine a backend service that accepts user-uploaded resumes or invoices. This service likely spins up a worker process (or an AWS Lambda) to parse the PDF, extract text, or render a thumbnail.
If an attacker uploads a 5KB malformed PDF:
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H| Product | Affected Versions | Fixed Version |
|---|---|---|
pypdf py-pdf | < 6.7.2 | 6.7.2 |
| Attribute | Detail |
|---|---|
| CWE | CWE-835 (Infinite Loop) |
| CVSS v3.1 | 7.5 (High) |
| Attack Vector | Network (via file upload) |
| Impact | Denial of Service (CPU Exhaustion) |
| Exploit Status | PoC Available |
| EPSS Score | 0.04% |
Loop with Unreachable Exit Condition ('Infinite Loop')