Feb 21, 2026·5 min read·4 visits
The `KnowledgeBaseWebReader` in LlamaIndex failed to increment or check recursion depth during web crawls. An attacker can supply a URL pointing to a page with circular links, causing the Python process to hit its recursion limit and crash (DoS). Fixed in `llama-index-readers-web` version 0.3.6.
In the race to feed Large Language Models (LLMs) with data, developers often overlook the basics of web crawling safety. CVE-2025-1752 is a stark reminder of this: a High-severity Denial of Service vulnerability in LlamaIndex's web reader component. By failing to track recursion depth during a crawl, the library allows attackers to trap the ingestion process in an infinite loop, leading to a stack exhaustion crash (`RecursionError`). This affects any application using `KnowledgeBaseWebReader` to ingest content from untrusted URLs.
We live in the age of RAG (Retrieval-Augmented Generation). Everyone and their dog is building pipelines to scrape the internet, chunk the text, embed it, and shove it into a vector database so their AI chatbot knows about the latest company policy or news article. LlamaIndex is one of the premier frameworks for this orchestration. It's the shovel we use to feed the beast.
But here's the thing about shovels: if you aren't careful, you can whack yourself in the face. Specifically, the KnowledgeBaseWebReader component—designed to crawl knowledge bases and documentation sites—had a fatal flaw in how it walked the web. It assumed that web pages are trees. They aren't. The web is a graph, and graphs have cycles.
This vulnerability isn't complex memory corruption. It's not a buffer overflow in C. It's a logic error in Python that turns a simple URL ingestion task into a process-killing weapon. If you are allowing users to submit URLs for your AI to 'learn' from, you just handed them a kill switch for your worker nodes.
The vulnerability (CWE-674: Uncontrolled Recursion) lies in the get_article_urls method. The developers intended to limit the crawl depth—they even included a parameter named max_depth in the function signature. It looks safe on paper. You see max_depth=100 and think, 'Ah, good, it won't crawl the entire internet.'
But here is the punchline: they never actually checked or incremented a depth counter. They passed the static max_depth value down to every child call, but never passed the current depth.
Imagine telling a runner, 'Stop running after 10 miles,' but never giving them a watch or mile markers. They just keep running until they collapse. In computer science terms, this is the difference between a while loop with a broken condition and a recursive function without a base case. The Python interpreter, thankfully, has a fail-safe: sys.getrecursionlimit() (usually 1000). When the code hits that wall, it doesn't just stop gracefully; it raises a RecursionError and crashes the program.
Let's look at the code. It is almost comical how close they were to getting it right, yet how completely they missed it.
The Vulnerable Code (Simplified):
def get_article_urls(self, browser, root_url, current_url, max_depth=100):
# ... scraping logic ...
for link in links:
# CRITICAL FAIL: max_depth is passed, but no 'current_depth' is tracked.
# The function has no idea how deep it is.
article_urls.extend(
self.get_article_urls(browser, root_url, link, max_depth)
)Do you see it? max_depth is a constant. If I call this with max_depth=10, the child calls it with max_depth=10. The grandchild calls it with max_depth=10. There is no depth counter increasing (0, 1, 2...).
The Fix (Commit 3c65db2):
# Now we track 'depth' and increment it.
def get_article_urls(self, browser, root_url, current_url, max_depth=100, depth=0):
# The Base Case! Finally!
if depth >= max_depth:
return []
# ... scraping logic ...
for link in links:
article_urls.extend(
# Increment depth!
self.get_article_urls(browser, root_url, link, max_depth, depth + 1)
)The fix is standard CS 101: add a depth argument, default it to 0, check if depth >= max_depth, and recurse with depth + 1.
To exploit this, we don't need advanced fuzzing or heap spraying. We just need two HTML files. Let's create a deadly trap for the KnowledgeBaseWebReader.
Step 1: The Trap
We set up a simple Flask server hosting two pages: /alpha and /omega.
/alpha contains a link to /omega./omega contains a link to /alpha.Step 2: The Trigger
We feed http://attacker.com/alpha to the LlamaIndex reader.
The Execution Flow:
get_article_urls(..., url='/alpha') runs./omega. Calls get_article_urls(..., url='/omega')./alpha. Calls get_article_urls(..., url='/alpha')./omega...Since the vulnerable code has no visited list (a set of URLs already processed) AND no depth limit enforcement, this ping-pong continues instantly until the Python stack explodes.
This crashes the thread. If you are running this in a Celery worker or a synchronous API handler, that worker is dead. If you don't have robust supervisor processes, your service starts dropping requests.
The immediate fix is to upgrade llama-index-readers-web to version 0.3.6 or later. This introduces the depth tracking logic seen in the code section above.
> [!NOTE]
> Researcher's Note: Even with the patch, the crawler doesn't seem to implement a visited set (deduplication) based on the diff analysis. While the depth parameter prevents infinite recursion crashes, a site with a massive number of links within the max_depth range could still cause performance degradation (a 'wide' traversal DoS rather than a 'deep' one). However, the crash vector is resolved.
Defensive Strategy:
pip install --upgrade llama-index-readers-webContent-Length or structure of the target page before fully committing resources.func_timeout or Celery's time limits are your friends here.CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H| Product | Affected Versions | Fixed Version |
|---|---|---|
llama-index-readers-web LlamaIndex | < 0.3.6 | 0.3.6 |
| Attribute | Detail |
|---|---|
| CWE ID | CWE-674 (Uncontrolled Recursion) |
| CVSS v3.0 | 7.5 (High) |
| Attack Vector | Network |
| Impact | Denial of Service (DoS) |
| Exploit Status | POC Available |
| Patch Date | 2025-02-27 |
The product does not properly control the amount of recursion that takes place, consuming excessive resources, such as memory or the program stack.