Apr 17, 2026·6 min read·5 visits
LangChain's HTML text splitter fails to validate HTTP redirects during content retrieval, enabling attackers to bypass SSRF protections and extract internal network data or cloud IAM credentials.
The `langchain-text-splitters` package prior to version 0.3.5 is vulnerable to Server-Side Request Forgery (SSRF) in the `HTMLHeaderTextSplitter.split_text_from_url` method. The vulnerability arises from an incomplete validation mechanism that checks the initial URL but fails to restrict subsequent HTTP redirects, allowing an attacker to access restricted internal resources and cloud metadata services.
The langchain-text-splitters package, a component of the broader LangChain ecosystem, provides utilities for dividing text into smaller, semantically meaningful chunks. This functionality is required for processing large documents before embedding them into vector stores or feeding them to Large Language Models (LLMs). The HTMLHeaderTextSplitter class specifically targets HTML documents, parsing the Document Object Model (DOM) and splitting content based on header tags (<h1>, <h2>, etc.) to maintain logical structure.
A Server-Side Request Forgery (SSRF) vulnerability exists in the split_text_from_url method of this class. The flaw is tracked as GHSA-FV5P-P927-QMXR and carries a CVSS v3.1 score of 6.5. The vulnerability allows an attacker to bypass initial URL validation checks by leveraging HTTP redirects, forcing the server into making unauthorized requests to internal network resources.
The component attempts to restrict outbound requests to safe, public IP addresses to prevent SSRF. It implements a validation step that checks the user-supplied URL against a blocklist of restricted ranges, such as local loopback addresses and cloud metadata service IPs. However, this validation is performed only on the initial URL provided in the method invocation, failing to account for subsequent network routing events.
The root cause of this vulnerability lies in the implementation of the HTTP request lifecycle within the split_text_from_url method. When a user supplies a URL to this method, the underlying code executes a validation routine to ensure the destination is not a restricted or internal network address. If the URL passes this check, the method proceeds to fetch the content using an HTTP client.
The critical flaw is the failure to restrict or re-validate HTTP redirects. By default, standard HTTP clients automatically follow 3xx redirect status codes, such as 301 Moved Permanently or 302 Found. The underlying client transparently processes the Location header provided in the remote server's response and initiates a secondary request to the new destination.
Because the anti-SSRF validation logic only inspects the initial input string, it remains blind to any subsequent destinations introduced during the redirect chain. An attacker can supply a URL pointing to an external server they control. This server passes the initial validation but responds with a redirect pointing to a restricted internal IP address, circumventing the security control entirely.
To understand the mechanical failure, we examine the sequence of operations in the vulnerable code path. The initial implementation performs a synchronous check on the URL string, verifying its host component against known restricted CIDR blocks. This logic correctly identifies and blocks explicit attempts to access addresses like 127.0.0.1 or 169.254.169.254.
# Conceptual representation of the vulnerable pattern
def split_text_from_url(url: str):
if not is_safe_url(url):
raise ValueError("Unsafe URL")
# Flaw: The default HTTP client follows redirects without re-validation
response = requests.get(url)
return split_text(response.text)The patch introduced in Pull Request #35960 addresses this discrepancy by hardening the anti-SSRF mechanisms. The fix modifies the request execution strategy to explicitly control redirect behavior. It disables automatic redirects or implements a custom redirect handler that recursively validates each hop in the redirect chain before proceeding.
# Conceptual representation of the patched pattern
def split_text_from_url(url: str):
if not is_safe_url(url):
raise ValueError("Unsafe URL")
# Enforcing strict redirect validation or disabling auto-redirects
response = requests.get(url, allow_redirects=False)
if response.status_code in (301, 302, 303, 307, 308):
# Handle redirect manually by re-verifying the Location header
new_url = response.headers['Location']
if not is_safe_url(new_url):
raise ValueError("Unsafe redirect URL")
response = requests.get(new_url, allow_redirects=False)
return split_text(response.text)Exploiting this SSRF vulnerability requires the attacker to control an external web server and pass its URL into an application utilizing the HTMLHeaderTextSplitter.split_text_from_url method. The attacker configures their server to respond to incoming HTTP GET requests with a 302 Found status code. The response includes a Location header pointing to the targeted internal resource.
When the vulnerable LangChain application processes the attacker's input, it first validates the external domain. Since the domain resolves to a public IP address, the validation check passes. The application then issues the GET request. The underlying HTTP library receives the 302 response and automatically issues a secondary request to the URL specified in the Location header.
The application retrieves the content from the internal resource, processes it through the HTML splitting logic, and incorporates the resulting text chunks into its normal execution flow. Depending on the application's design, this data may be reflected directly back to the attacker in an HTTP response, stored in a database, or processed by an LLM, making the exfiltrated data accessible to the attacker.
The primary impact of this vulnerability is the unauthorized disclosure of internal network configuration, local services, and sensitive credentials. The most critical risk surfaces when the vulnerable application is deployed in a cloud environment, such as AWS, Google Cloud Platform, or Microsoft Azure. Cloud providers utilize metadata services accessible via deterministic, non-routable IP addresses.
By targeting these metadata endpoints, an attacker extracts temporary Identity and Access Management (IAM) credentials, instance configuration details, and user-data scripts. If the IAM role attached to the computing instance possesses excessive privileges, the attacker leverages these credentials to pivot into the broader cloud environment, resulting in broader infrastructure compromise.
Beyond cloud metadata, the SSRF flaw enables an attacker to map the internal network architecture. They systematically probe local ports to identify services bound to 127.0.0.1 or scan adjacent internal subnets. Unauthenticated internal services, such as internal Redis databases, REST APIs, or administrative consoles, become reachable from the external attacker's perspective.
The definitive remediation for this vulnerability is upgrading the langchain-text-splitters package to version 0.3.5 or later. This release incorporates the anti-SSRF hardening implemented in PR #35960. Development teams must audit their dependencies and verify that their Python environments execute the patched version.
In scenarios where immediate patching is not feasible, organizations must implement defense-in-depth measures at the network level. Configuring strict egress filtering on the host or container running the LangChain application restricts the impact. The egress firewall must deny all outbound traffic to 169.254.169.254 and block traffic to internal subnets (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) unless explicitly required by business logic.
Applications accepting arbitrary URLs for processing should implement defense-in-depth at the application layer. Utilizing dedicated proxy services designed to fetch external content safely prevents direct interaction with untrusted remote servers. These proxies enforce strict routing policies, deny redirects automatically, and strip sensitive headers.
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:N| Product | Affected Versions | Fixed Version |
|---|---|---|
langchain-text-splitters LangChain | < 0.3.5 | 0.3.5 |
| Attribute | Detail |
|---|---|
| CWE ID | CWE-918 |
| Attack Vector | Network |
| CVSS Score | 6.5 |
| Impact | Confidentiality, Integrity |
| Exploit Status | Proof-of-Concept |
| KEV Status | Not Listed |
The web server receives a URL or similar request from an upstream component and retrieves the contents of this URL, but it does not sufficiently ensure that the request is being sent to the expected destination.