CVEReports
CVEReports

Automated vulnerability intelligence platform. Comprehensive reports for high-severity CVEs generated by AI.

Product

  • Home
  • Sitemap
  • RSS Feed

Company

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 CVEReports. All rights reserved.

Made with love by Amit Schendel & Alon Barad



CVE-2025-66516
8.42.73%

Tika Taka Boom: The Core XXE Hiding in Your PDFs

Amit Schendel
Amit Schendel
Senior Security Researcher

Feb 10, 2026·6 min read·5 visits

PoC Available

Executive Summary (TL;DR)

Critical XXE in Apache Tika (tika-core < 3.2.2). Attackers can embed malicious XML payloads inside PDF XFA forms. When Tika parses the PDF to extract metadata or text, the XML payload triggers, allowing file read (LFI) or network requests (SSRF). Upgrade `tika-core` immediately.

Apache Tika, the ubiquitously trusted content analysis toolkit, suffers from a critical XML External Entity (XXE) vulnerability within its core library. Specifically affecting how Tika handles XFA (XML Forms Architecture) data embedded within PDF files, this flaw allows attackers to exfiltrate local files or perform Server-Side Request Forgery (SSRF) simply by submitting a malicious document. This is a scope expansion of the earlier CVE-2025-54988, revealing that the rot wasn't just in the PDF module, but deep in `tika-core`.

The Hook: The Swiss Army Knife with a Loose Blade

Apache Tika is the silent workhorse of the enterprise world. It is the library that powers Solr, Elasticsearch, and thousands of corporate Content Management Systems (CMS). Its job is simple but thankless: take a binary blob—be it a Word doc, a PDF, or an image—and extract the text and metadata. It is the "universal translator" of file formats.

But here is the problem with being a universal translator: you have to speak everyone's language, even the dangerous ones. And no language is quite as dangerous in the Java ecosystem as XML. Tika is designed to trust input, parse it, and return structured data. When you feed it a PDF, Tika doesn't just read the text; it looks for forms, metadata, and embedded objects.

This vulnerability exploits a specific feature in PDFs called XFA (XML Forms Architecture). XFA is essentially a way to embed a dynamic XML document inside a static PDF. If Tika blindly passes that embedded XML to a parser that hasn't been strictly disciplined, you get XXE. And in this case, the disciplinarian—tika-core—was asleep at the wheel.

The Flaw: A Tale of Two CVEs

To understand this bug, you have to appreciate the comedy of errors that is XML parsing in Java. By default, Java XML parsers are often promiscuous—they will happily resolve external entities defined in a DTD (Document Type Definition). This allows an XML file to say, "Hey, before you parse me, go fetch the contents of /etc/passwd and stick it in this variable."

This vulnerability is actually the "Director's Cut" of a previous bug, CVE-2025-54988. Originally, developers thought the issue was isolated to the tika-pdf-module. They patched it there. But as security researchers dug deeper, they realized the rot went further down. The unsafe XML configuration wasn't just in the PDF handling logic; it was in the tika-core utility that the PDF module relied on.

This means if you updated your PDF parser but left tika-core on an older version, you were still vulnerable. It’s like locking your front door (the PDF module) but realizing your house frame (the Core) is made of cheese. The specific failure was in not setting the XMLConstants.FEATURE_SECURE_PROCESSING correctly or, more likely, failing to explicitly disable DOCTYPE declarations when processing the extracted XFA stream.

The Code: XML Parser configurations

Let's look at what typically goes wrong in these scenarios. In Java, instantiating a SAXParserFactory or DocumentBuilderFactory looks innocent enough, but it's a loaded gun if you don't toggle the safety catch.

The Vulnerable Pattern (Conceptual):

// Inside tika-core logic handling XFA
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
// Missing: The lines that tell the parser to ignore DTDs
SAXParser parser = factory.newSAXParser();
parser.parse(xfaInputStream, handler);

When Tika encounters a PDF with XFA data, it extracts the XML stream and feeds it to this parser. Because the disallow-doctype-decl feature isn't set, the parser obeys the attacker's DTD.

The Fix (Apache Tika 3.2.2):

The remediation involves rigorously locking down the parser factory in tika-core. The patch ensures that any XML parsing initiated by this utility explicitly forbids external entities.

// The Hardened Configuration
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
// The critical line that kills XXE dead:
factory.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
factory.setFeature("http://xml.org/sax/features/external-general-entities", false);
factory.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

This change in tika-core propagates safety to all modules consuming these utilities, finally closing the hole that the previous CVE missed.

The Exploit: Crafting the Poisoned PDF

Exploiting this requires a bit of "Inception." We aren't sending an XML file directly; we are wrapping an XML payload inside a PDF envelope. The target (Tika) will unwrap it for us.

Step 1: The Container We create a valid PDF structure. Inside the PDF Catalog, we define an /AcroForm dictionary that references an /XFA stream. This tells the PDF reader (or parser), "Hey, the form data for this document is stored in this XML blob."

Step 2: The Payload Inside that XFA stream, we inject the classic XXE payload. Here is a simplified view of the malicious object structure:

5 0 obj
<< /XFA 6 0 R >> % Reference to the XML stream
endobj
 
6 0 obj
<< /Length 123 >>
stream
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xdp:xdp [
  <!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<xdp:xdp xmlns:xdp="http://ns.adobe.com/xdp/">
  <template>
    <field name="stealing_your_data">
      <value><text>&xxe;</text></value>
    </field>
  </template>
</xdp:xdp>
endstream
endobj

Step 3: The Trigger We upload this PDF to a Tika endpoint (e.g., PUT /tika). Tika accepts the PDF, parses the structure, finds the XFA stream, and passes it to its XML parser. The parser sees &xxe;, resolves it to the contents of /etc/passwd, and places that text into the <value> field.

Step 4: The Exfiltration Tika then returns the "text content" of the PDF to the user (or logs it, or indexes it). The attacker simply reads the response body, and there, amidst the PDF metadata, is the content of the shadow file.

The Impact: Why Should You Panic?

An XXE in a library like Tika is catastrophic because of where Tika lives. It is rarely sitting on a user's laptop; it is usually deployed in the backend of massive data pipelines.

1. Cloud Metadata Theft (SSRF): If Tika is running in AWS, GCP, or Azure, an attacker doesn't care about /etc/passwd. They will change the entity to http://169.254.169.254/latest/meta-data/iam/security-credentials/. Now they have your IAM keys. They own your bucket.

2. Denial of Service (Billion Laughs): Even if data exfiltration is blocked, an attacker can use the "Billion Laughs" attack (recursive entity expansion) to consume all memory on the Tika server, crashing the indexing pipeline for the entire organization.

3. Internal Network Scanning: Blind XXE can be used to scan internal ports. The Tika server acts as a proxy, allowing an external attacker to probe the internal network for open services (Redis, Elastic, etc.).

The Fix: Patching the Core

This is a mandatory update. You cannot configure your way out of this easily without recompiling code.

Immediate Action: Update your dependencies. You need tika-core version 3.2.2 or higher. If you are using a build tool like Maven, check your dependency tree (mvn dependency:tree). You might have updated tika-parsers but still be pulling in an old tika-core transitively. Explicitly force the version:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>3.2.2</version>
</dependency>

Legacy Users (1.x branch): The 1.x branch is End-of-Life (EOL) but widely used. The tika-parsers module in 1.x is vulnerable. There is no official fix for 1.x. You must migrate to the 3.x branch. If you cannot migrate, you are sitting on a time bomb. Your only mitigation is to firewall the Tika service and ensure it cannot make outbound connections, though this does not stop local file theft.

Official Patches

ApacheOfficial Apache Advisory

Technical Appendix

CVSS Score
8.4/ 10
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:L/A:L
EPSS Probability
2.73%
Top 14% most exploited

Affected Systems

Apache Tika Core < 3.2.2Apache Tika PDF Module < 3.2.2Apache Tika Parsers (1.x branch)Enterprise Search (Solr, Elasticsearch) using vulnerable Tika pluginsContent Management Systems (CMS) with document preview features

Affected Versions Detail

Product
Affected Versions
Fixed Version
Apache Tika Core
Apache Software Foundation
1.13 - 3.2.13.2.2
Apache Tika PDF Module
Apache Software Foundation
2.0.0 - 3.2.13.2.2
AttributeDetail
CWE IDCWE-611
Attack VectorNetwork (via File Upload)
CVSS8.4 (High)
EPSS Score2.73% (High Percentile)
ImpactInformation Disclosure / SSRF
Exploit StatusPoC Available / Active
Patch StatusFixed in 3.2.2

MITRE ATT&CK Mapping

T1190Exploit Public-Facing Application
Initial Access
T1552.001Credentials from Password Stores
Credential Access
T1212Exploitation for Credential Access
Credential Access
CWE-611
Improper Restriction of XML External Entity Reference

The software processes an XML document that can contain XML entities with URIs that resolve to documents outside of the intended sphere of control, causing the product to embed incorrect documents into its output.

Known Exploits & Detection

GitHub (chasingimpact)Full writeup and PoC generation script
Nuclei TemplatesAutomated detection template using base64 encoded PDF payload

Vulnerability Timeline

Initial XXE identification in PDF module
2025-08-29
CVE-2025-66516 Published (Scope Expanded)
2025-12-04
Apache Tika 3.2.2 Released with Core Fix
2025-12-04
Public PoC released
2025-12-05

References & Sources

  • [1]NVD Detail
  • [2]Picus Security Deep Dive
Related Vulnerabilities
CVE-2025-54988

Attack Flow Diagram

Press enter or space to select a node. You can then use the arrow keys to move the node around. Press delete to remove it and escape to cancel.
Press enter or space to select an edge. You can then press delete to remove it or escape to cancel.