Feb 12, 2026·6 min read·13 visits
The RecursiveUrlLoader in LangChain JS used `startsWith()` to validate URLs, allowing attackers to bypass domain restrictions and scan internal networks or steal cloud credentials.
A logic flaw in the LangChain JS `@langchain/community` package allows for Server-Side Request Forgery (SSRF) within the `RecursiveUrlLoader`. By bypassing a weak string-prefix validation check, attackers can force the crawler to access internal network resources, local loopback interfaces, or cloud metadata services. Since the output of this loader is typically fed into an LLM for summarization or processing, this vulnerability transforms a simple network scan into a high-fidelity data exfiltration pipeline.
In the modern AI ecosystem, data is oxygen. We build agents, give them tools, and tell them to "go forth and learn." One of the most popular tools in the LangChain arsenal is the RecursiveUrlLoader. It's essentially a web spider in a box—you point it at a documentation site or a wiki, and it recursively scrapes every link it finds to build a knowledge base for your RAG (Retrieval-Augmented Generation) pipeline.
But here's the thing about giving a robot a web browser: unless you put a leash on it, it's going to wander into your backyard. The developers knew this. They implemented a preventOutside flag, enabled by default, intended to keep the spider inside the garden fence (the target domain). Ideally, if you point it at https://docs.example.com, it shouldn't wander off to https://pornhub.com or, more importantly, http://169.254.169.254 (the AWS metadata service).
CVE-2026-26019 is the story of how that leash was made of wet paper. It turns out that validating URLs is hard, and using string manipulation to do it is almost always a death sentence for security.
The root cause of this vulnerability is a classic developer mistake: confusing a string prefix with a security boundary. To enforce the preventOutside rule, the code needed to check if a discovered link belonged to the same origin as the base URL.
Instead of parsing the URL into its component parts (protocol, hostname, port) and comparing them semantically, the code did this:
// The 'Security' Check
const isAllowed = !this.preventOutside || link.startsWith(baseUrl);If you are a security researcher, you are likely grinning right now. If you are a developer, let me explain why this is catastrophic. In the world of URLs, startsWith is meaningless for origin validation. If my baseUrl is https://example.com, obviously https://example.com/page2 passes. But so does https://example.com.attacker.com.
This is the "Golden Key" bypass. By simply registering a domain that starts with the target string, or utilizing certain URL formatting tricks (like the @ symbol for authentication segments), an attacker can fool the crawler into thinking an external, malicious, or internal IP is actually part of the allowed zone.
Let's look at the smoking gun. The vulnerability existed in RecursiveUrlLoader prior to version 1.1.14. The fix involved ripping out the naive string comparison and replacing it with a dedicated SSRF protection module that actually understands what a URL is.
The Vulnerable Code:
// recursive_url.ts (Pre-patch)
// 🚩 Narrative: "Looks like the base URL? Must be safe!"
if (this.preventOutside && !link.startsWith(this.url)) {
continue;
}The Patched Code:
// recursive_url.ts (Patched)
// 🛡️ Narrative: "Check origin, check IP, check everything."
import { isSameOrigin, validateSafeUrl } from "@langchain/core/utils/ssrf";
// ... inside the loop
if (this.preventOutside && !isSameOrigin(link, this.url)) {
continue;
}
// The real MVP: Proactive IP validation before fetch
if (!(await validateSafeUrl(link))) {
throw new Error("Potentially unsafe URL detected");
}The patch introduced validateSafeUrl, which does the heavy lifting. It blocks Private IP ranges (RFC 1918), loopback addresses (127.0.0.1), and the notorious cloud metadata IP (169.254.169.254). It also handles the semantic origin check correctly, ensuring that example.com.evil.com is treated as a different origin than example.com.
So, how do we weaponize this? We need two things: a way to feed the crawler a starting URL (or control a page it visits), and a target worth stealing. In a cloud environment, the target is almost always the Instance Metadata Service (IMDS).
The Setup:
RecursiveUrlLoader to scrape a URL provided by the user (or a URL the attacker can modify).The Attack Chain:
Bypass the Scope: The attacker hosts a malicious page at https://target-site.com.attacker-controlled.net. If the user points the loader here, or if the loader crawls here from a valid site (via an open redirect or comment section), the startsWith check passes because the string matches the prefix.
The Redirection (SSRF): On the attacker's page, we place a link or an HTTP redirect to http://169.254.169.254/latest/meta-data/iam/security-credentials/.
The Exfiltration: This is where LangChain makes the vulnerability worse than a standard SSRF. A standard SSRF might just be "blind" (you can send requests but not see responses). But the entire purpose of this component is to read the response body and return it as text to the LLM.
The Payday: The loader fetches the metadata, believing it to be just another web page. It passes the AWS keys to the LLM. The attacker then asks the LLM: "Summarize the documents you just crawled." The LLM happily replies: "I found some JSON data containing an AccessKeyId and a SecretAccessKey..."
SSRF in AI agents is a distinct beast. In traditional web apps, SSRF is often used to port scan internal networks or hit legacy administrative interfaces. In AI, it is an Identity Theft vector.
Because the RecursiveUrlLoader is designed to ingest unstructured text, it bypasses many of the format-based protections that might stop other SSRF attacks (like expecting HTML). It will happily ingest JSON, XML, or raw text credentials.
If this runs in a cloud environment (AWS, GCP, Azure) without strict IMDSv2 token requirements, a single crawl can result in full account takeover. Furthermore, because the "User Interaction" metric is required (someone has to tell the bot to crawl), the CVSS score (4.1) deceptively masks the critical nature of the flaw. If you are building a "Chat with Website" feature, this is a Critical vulnerability for your infrastructure.
The immediate fix is simple: Upgrade @langchain/community to version 1.1.14 or later. This pulls in the new SSRF protections.
However, code fixes are only one layer of defense. You should treat the environment your AI agents run in as hostile.
iptables or Security Groups to block egress traffic to 10.0.0.0/8, 192.168.0.0/16, and 169.254.169.254.PUT request with specific headers to get a token before reading metadata. Simple GET based SSRF (like this web crawler) cannot generate those headers, effectively neutralizing the cloud credential theft vector.CVSS:3.1/AV:N/AC:L/PR:L/UI:R/S:C/C:L/I:N/A:N| Product | Affected Versions | Fixed Version |
|---|---|---|
@langchain/community langchain-ai | < 1.1.14 | 1.1.14 |
| Attribute | Detail |
|---|---|
| CWE ID | CWE-918 |
| Attack Vector | Network |
| CVSS Score | 4.1 (Medium) |
| Impact | Confidentiality (Low), Scope Changed |
| Vulnerable Logic | String.startsWith() bypass |
| Target Component | RecursiveUrlLoader |
Server-Side Request Forgery (SSRF)