Jun 16, 2026·7 min read·2 visits
Mozilla Bleach versions up to 6.3.0 fail to sanitize URLs containing high-plane Unicode or invisible characters in the scheme prefix. This allows blocked protocols like 'javascript:' to bypass sanitization filters, creating stored Cross-Site Scripting (XSS) risks in downstream environments that normalize or strip Unicode data.
Mozilla Bleach is an open-source HTML sanitizing library for Python. Versions up to and including 6.3.0 contain an incomplete filtering implementation in the URI validation logic ('sanitize_uri_value'). This logic fails to detect disallowed protocols, such as 'javascript:', if they contain Unicode invisible characters, whitespace characters, or characters with a code point greater than U+00A0. While standard-compliant web browsers do not directly execute invalid URI schemes containing these non-standard characters, downstream systems that normalize Unicode text by stripping invisible or non-ASCII characters can unintentionally reactivate the 'javascript:' prefix, causing Cross-Site Scripting (XSS). Additionally, this behavior violates Bleach's core sanitization contract by outputting URIs that bypass protocol allowlists configured by the caller.
Mozilla Bleach is a Python-based HTML sanitization library that parses, cleans, and filters HTML fragments from untrusted inputs. It operates as a primary defense boundary in web applications, validating allowed tags, attributes, and URI schemes to prevent Cross-Site Scripting (XSS) and data injection vulnerabilities.
Under normal execution, when developers enable anchor tags (<a>) and link attributes (href), Bleach checks the value of the href attribute. It guarantees that the parsed scheme matches an explicit allowlist of safe protocols, such as http, https, or mailto. Any URI containing an disallowed scheme, such as javascript:, is stripped or nullified.
The vulnerability tracked as GHSA-8RFP-98V4-MMR6 resides in the preprocessing loop of the URI validation logic. When processing attributes, Bleach fails to identify blocked protocols if they contain specific high-range Unicode code points, invisible characters, or non-standard whitespace. Consequently, the invalid URI bypasses validation checks and is preserved in the output, which violates Bleach's security guarantees and compromises downstream rendering components.
The underlying technical flaw lies within the sanitize_uri_value function inside the bleach/sanitizer.py component. Prior to passing a URI string to Python's standard library parser (urllib.parse.urlparse), Bleach performs an initial cleanup operation to strip backticks, control characters, and standard whitespace.
This cleanup is executed using the regular expression re.sub(r"[\000-\040\177-\240\s]+", "", normalized_uri). This regex matches only ASCII control characters (up to \040), standard whitespace characters matched by \s, and characters in the \177-\240 range. It fails to match high-plane Unicode characters or invisible formatting characters such as the Zero-Width Space (\u200b), Byte-Order Mark (\ufeff), or Soft Hyphen (\u00ad`).
When an attacker constructs a link containing an invisible Unicode character inside the protocol scheme name (e.g., javascript\u200b:alert(1)), the character is not removed during the regex cleanup stage. The resulting string is subsequently passed to urllib.parse.urlparse. Because the string contains a non-ASCII character within the scheme sequence, urlparse fails to recognize javascript\u200b as a valid scheme under RFC 3986 rules and parses the value as a relative path.
Since urlparse does not extract javascript as the scheme, Bleach treats the input as a safe, relative path. The sanitizer permits the obfuscated href attribute to pass through unchanged. This allows the dangerous payload to reside in the cleaned dataset.
In version 6.3.0 and prior, the preprocessing code in bleach/sanitizer.py is configured as follows:
# Vulnerable Code: bleach/sanitizer.py (<= v6.3.0)
def sanitize_uri_value(self, value, allowed_protocols):
# Convert HTML entities to raw characters
normalized_uri = html5lib_shim.convert_entities(value)
# Remove backtick, space, and control characters (ASCII only)
normalized_uri = re.sub(r"[`\000-\040\177-\240\s]+", "", normalized_uri)
# Remove REPLACEMENT characters
normalized_uri = normalized_uri.replace("\ufffd", "")
# Lowercase for pattern matching
normalized_uri = normalized_uri.lower()The vulnerability was remediated in commit 7c4867c32344d1c961107fae62240a6f0dc680dc by removing the specific replacement of the \ufffd character and introducing a regular expression that strips all non-ASCII characters from the URI prior to the validation phase:
# Patched Code: bleach/sanitizer.py (v6.4.0)
def sanitize_uri_value(self, value, allowed_protocols):
# Convert HTML entities to raw characters
normalized_uri = html5lib_shim.convert_entities(value)
# Strip backtick, whitespace, and control characters
normalized_uri = re.sub(r"[`\000-\040\177-\240\s]+", "", normalized_uri)
# Strip non-ASCII characters so that urlparse can parse the url into
# components correctly. This drops invisible and whitespace unicode
# characters among other things.
normalized_uri = re.sub(r"[^\x00-\x7f]", "", normalized_uri)
# Lowercase value to make matching easier
normalized_uri = normalized_uri.lower()This remediation ensures that any high-plane Unicode or non-ASCII invisible character is removed before scheme parsing. An input of javascript\u200b:alert(1) is sanitized to javascript:alert(1) during preprocessing. This collapsed string is successfully parsed by urlparse as the javascript scheme, which Bleach flags as a blocked protocol and correctly strips from the tag.
While this fix is highly complete and secures the parser against Unicode-based bypasses, it strips legitimate internationalized domain names (IDNs) or paths that utilize non-ASCII characters. This functional limitation was accepted by the maintainers as a necessary trade-off for security enforcement in the final release of the project.
Exploiting this vulnerability requires a multi-tier environment where Bleach's output is processed by a downstream component before rendering. Standard browsers do not execute schemes containing raw invisible Unicode characters directly, as they interpret them as relative paths or unrecognized schemas.
The vulnerability is triggered when the database, template formatter, or downstream backend processor normalizes the Unicode input (e.g., stripping non-ASCII characters, applying compatibility decompositions, or running character cleanup scripts). If a backend normalizes the string javascript\u200b:alert(document.cookie) by removing non-ASCII elements, the invisible character is discarded, resulting in the executable payload javascript:alert(document.cookie).
A Python proof of concept demonstrates how Bleach preserves the payload while downstream execution collapses it into a functional exploit vector:
import bleach
import re
# Configure sanitizer allowlists
allowed_tags = ['a']
allowed_attrs = {'a': ['href']}
allowed_protocols = ['http', 'https']
# Construct input payload with Zero-Width Space (\u200b)
raw_input = '<a href="javascript\u200b:alert(document.cookie)">Target Link</a>'
# Stage 1: Bleach sanitization (v6.3.0)
sanitized_output = bleach.clean(raw_input, tags=allowed_tags, attributes=allowed_attrs, protocols=allowed_protocols)
print("Sanitized Output:", repr(sanitized_output))
# Result: '<a href="javascript\u200b:alert(document.cookie)">Target Link</a>'
# Stage 2: Downstream Normalization (dropping non-ASCII characters)
final_html = re.sub(r"[^\x00-\x7f]", "", sanitized_output)
print("Final Rendered HTML:", repr(final_html))
# Result: '<a href="javascript:alert(document.cookie)">Target Link</a>'The CVSS Base Score is evaluated as 0.0 (Low) due to the reliance on downstream processing to trigger the vulnerability in typical web browser environments. However, in applications that perform multi-stage processing, database character conversion, or Unicode normalization post-sanitization, this represents a severe stored Cross-Site Scripting (XSS) pathway.
An attacker who successfully executes XSS via this bypass can bypass the Same-Origin Policy (SOP), hijack user sessions, read confidential application data, extract session cookies, or execute arbitrary authenticated actions on behalf of the victim.
Furthermore, this vulnerability breaks the absolute safety contract promised by Bleach to callers. Developers rely on sanitizers to output clean, safe fragments regardless of the downstream processing architecture. Failing to filter these vectors introduces security degradation across complex application architectures.
The immediate remediation is to upgrade to Mozilla Bleach version 6.4.0. This is the final release of the library and patches the vulnerability by stripping non-ASCII characters prior to scheme parsing.
pip install --upgrade bleach==6.4.0Because Bleach is officially deprecated and has reached End-of-Life (EOL), organizations should plan a migration path to an actively maintained alternative. The recommended alternative is 'nh3', which provides Python bindings to the Rust-based 'ammonia' HTML sanitizer library. Ammonia enforces strict, RFC-compliant URI validation and is not subject to this Unicode parsing issue.
Where upgrading is not immediately possible, implement a pre-processing filter to strip Unicode invisible characters (such as \u200b, \ufeff, and \u00ad) before submitting strings to Bleach:
import re
def pre_sanitize(html_input):
# Remove common invisible Unicode characters prior to Bleach processing
unicode_pattern = re.compile(r"[\u200b\ufeff\u00ad]")
return unicode_pattern.sub("", html_input)Deploy a robust Content Security Policy (CSP) header to mitigate potential script execution if an active scheme bypasses the filters. Ensure the policy prevents inline scripts and script execution via URI schemes:
Content-Security-Policy: default-src 'self'; script-src 'self';CVSS:3.1/AV:N/AC:H/PR:N/UI:R/S:U/C:N/I:N/A:N| Product | Affected Versions | Fixed Version |
|---|---|---|
bleach Mozilla | <= 6.3.0 | 6.4.0 |
| Attribute | Detail |
|---|---|
| CWE ID | CWE-184 (Incomplete List of Disallowed Inputs) |
| Attack Vector | Network (AV:N) |
| CVSS v3.1 Score | 0.0 (Low due to indirect downstream dependency) |
| Impact | Bypass of protocol validation filters / Secondary stored XSS |
| Exploit Status | Proof-of-Concept (PoC) available |
| KEV Status | Not listed in CISA KEV |
The product or library uses a list of disallowed inputs but does not fully validate or normalize characters, allowing attackers to bypass protection mechanisms using alternative Unicode representations.
A local security vulnerability in the Nuxt development server (nuxt dev) allows local unprivileged users to access sensitive configuration files and source code. On Linux environments running Node.js 20+, Nuxt bound its internal vite-node IPC server to an abstract-namespace Unix socket without any peer authentication, enabling co-resident local users to connect and request module code directly.
An uncontrolled resource consumption vulnerability exists in the Python package Bleach when parsing text to linkify email addresses. When `parse_email=True` is enabled, the regular expression engine is forced into a quadratic-time complexity scan on specially crafted payloads lacking an '@' symbol. This causes immediate CPU exhaustion and blocks application server worker processes.
A path traversal and sandbox escape vulnerability in LangChain and LangChain-Anthropic Python packages allows unauthenticated local attackers to access files outside the restricted directory via crafted input, symbolic links, or prefix bypasses.
The PHP Secure Communications Library (phpseclib) contains a Server-Side Request Forgery (SSRF) vulnerability due to an insecure default implementation of Authority Information Access (AIA) certificate chasing. This flaw allows remote, unauthenticated attackers to coerce applications validating user-supplied X.509 certificates into generating arbitrary outbound HTTP requests to internal networks or local interfaces.
A directory traversal vulnerability exists in the Microsoft .NET System.Formats.Tar library during archive extraction. When extracting a TAR archive using the TarFile.ExtractToDirectory API, the extraction engine improperly resolves symbolic links prior to file creation, allowing local unauthorized attackers to write or overwrite arbitrary files outside the target directory. This can lead to local tampering, privilege escalation, or arbitrary code execution.
A client-side HTML sanitization bypass vulnerability exists in the Bleach library where the formaction attribute is not recognized as a URI. This allows attackers to inject javascript: URIs when formaction is on the allowed list, resulting in Cross-Site Scripting (XSS).