Executive Summary (TL;DR)

Mozilla Bleach versions up to 6.3.0 fail to sanitize URLs containing high-plane Unicode or invisible characters in the scheme prefix. This allows blocked protocols like 'javascript:' to bypass sanitization filters, creating stored Cross-Site Scripting (XSS) risks in downstream environments that normalize or strip Unicode data.

The underlying technical flaw lies within the sanitize_uri_value function inside the bleach/sanitizer.py component. Prior to passing a URI string to Python's standard library parser (urllib.parse.urlparse), Bleach performs an initial cleanup operation to strip backticks, control characters, and standard whitespace.

This cleanup is executed using the regular expression re.sub(r"[\000-\040\177-\240\s]+", "", normalized_uri). This regex matches only ASCII control characters (up to \040), standard whitespace characters matched by \s, and characters in the \177-\240 range. It fails to match high-plane Unicode characters or invisible formatting characters such as the Zero-Width Space (\u200b), Byte-Order Mark (\ufeff), or Soft Hyphen (\u00ad`).

When an attacker constructs a link containing an invisible Unicode character inside the protocol scheme name (e.g., javascript\u200b:alert(1)), the character is not removed during the regex cleanup stage. The resulting string is subsequently passed to urllib.parse.urlparse. Because the string contains a non-ASCII character within the scheme sequence, urlparse fails to recognize javascript\u200b as a valid scheme under RFC 3986 rules and parses the value as a relative path.

Since urlparse does not extract javascript as the scheme, Bleach treats the input as a safe, relative path. The sanitizer permits the obfuscated href attribute to pass through unchanged. This allows the dangerous payload to reside in the cleaned dataset.

# Vulnerable Code: bleach/sanitizer.py (<= v6.3.0) def sanitize_uri_value(self, value, allowed_protocols): # Convert HTML entities to raw characters normalized_uri = html5lib_shim.convert_entities(value) # Remove backtick, space, and control characters (ASCII only) normalized_uri = re.sub(r"[`\000-\040\177-\240\s]+", "", normalized_uri) # Remove REPLACEMENT characters normalized_uri = normalized_uri.replace("\ufffd", "") # Lowercase for pattern matching normalized_uri = normalized_uri.lower()

# Patched Code: bleach/sanitizer.py (v6.4.0) def sanitize_uri_value(self, value, allowed_protocols): # Convert HTML entities to raw characters normalized_uri = html5lib_shim.convert_entities(value) # Strip backtick, whitespace, and control characters normalized_uri = re.sub(r"[`\000-\040\177-\240\s]+", "", normalized_uri) # Strip non-ASCII characters so that urlparse can parse the url into # components correctly. This drops invisible and whitespace unicode # characters among other things. normalized_uri = re.sub(r"[^\x00-\x7f]", "", normalized_uri) # Lowercase value to make matching easier normalized_uri = normalized_uri.lower()

import bleach import re # Configure sanitizer allowlists allowed_tags = ['a'] allowed_attrs = {'a': ['href']} allowed_protocols = ['http', 'https'] # Construct input payload with Zero-Width Space (\u200b) raw_input = '<a href="javascript\u200b:alert(document.cookie)">Target Link</a>' # Stage 1: Bleach sanitization (v6.3.0) sanitized_output = bleach.clean(raw_input, tags=allowed_tags, attributes=allowed_attrs, protocols=allowed_protocols) print("Sanitized Output:", repr(sanitized_output)) # Result: '<a href="javascript\u200b:alert(document.cookie)">Target Link</a>' # Stage 2: Downstream Normalization (dropping non-ASCII characters) final_html = re.sub(r"[^\x00-\x7f]", "", sanitized_output) print("Final Rendered HTML:", repr(final_html)) # Result: '<a href="javascript:alert(document.cookie)">Target Link</a>'

import re def pre_sanitize(html_input): # Remove common invisible Unicode characters prior to Bleach processing unicode_pattern = re.compile(r"[\u200b\ufeff\u00ad]") return unicode_pattern.sub("", html_input)

Product

Affected Versions

Fixed Version

bleach

Mozilla

<= 6.3.0

6.4.0

Attribute

Detail

CWE ID

CWE-184 (Incomplete List of Disallowed Inputs)

Attack Vector

Network (AV:N)

CVSS v3.1 Score

0.0 (Low due to indirect downstream dependency)

Impact

Bypass of protocol validation filters / Secondary stored XSS

Exploit Status

Proof-of-Concept (PoC) available

KEV Status

Not listed in CISA KEV

GHSA-8RFP-98V4-MMR6

GHSA-8RFP-98V4-MMR6: Protocol-Filtering Bypass via Unicode Obfuscation in Mozilla Bleach

Amit Schendel

Senior Security Researcher

Jun 16, 2026·7 min read·7 visits

Executive Summary (TL;DR)

Mozilla Bleach is an open-source HTML sanitizing library for Python. Versions up to and including 6.3.0 contain an incomplete filtering implementation in the URI validation logic ('sanitize_uri_value'). This logic fails to detect disallowed protocols, such as 'javascript:', if they contain Unicode invisible characters, whitespace characters, or characters with a code point greater than U+00A0. While standard-compliant web browsers do not directly execute invalid URI schemes containing these non-standard characters, downstream systems that normalize Unicode text by stripping invisible or non-ASCII characters can unintentionally reactivate the 'javascript:' prefix, causing Cross-Site Scripting (XSS). Additionally, this behavior violates Bleach's core sanitization contract by outputting URIs that bypass protocol allowlists configured by the caller.

Attack Flow Diagram

Vulnerability Overview

Mozilla Bleach is a Python-based HTML sanitization library that parses, cleans, and filters HTML fragments from untrusted inputs. It operates as a primary defense boundary in web applications, validating allowed tags, attributes, and URI schemes to prevent Cross-Site Scripting (XSS) and data injection vulnerabilities.

Under normal execution, when developers enable anchor tags (<a>) and link attributes (href), Bleach checks the value of the href attribute. It guarantees that the parsed scheme matches an explicit allowlist of safe protocols, such as http, https, or mailto. Any URI containing an disallowed scheme, such as javascript:, is stripped or nullified.

The vulnerability tracked as GHSA-8RFP-98V4-MMR6 resides in the preprocessing loop of the URI validation logic. When processing attributes, Bleach fails to identify blocked protocols if they contain specific high-range Unicode code points, invisible characters, or non-standard whitespace. Consequently, the invalid URI bypasses validation checks and is preserved in the output, which violates Bleach's security guarantees and compromises downstream rendering components.

Root Cause Analysis

Code Analysis and Patch Verification

In version 6.3.0 and prior, the preprocessing code in bleach/sanitizer.py is configured as follows:

# Vulnerable Code: bleach/sanitizer.py (<= v6.3.0)
def sanitize_uri_value(self, value, allowed_protocols):
    # Convert HTML entities to raw characters
    normalized_uri = html5lib_shim.convert_entities(value)
 
    # Remove backtick, space, and control characters (ASCII only)
    normalized_uri = re.sub(r"[`\000-\040\177-\240\s]+", "", normalized_uri)
 
    # Remove REPLACEMENT characters
    normalized_uri = normalized_uri.replace("\ufffd", "")
 
    # Lowercase for pattern matching
    normalized_uri = normalized_uri.lower()

The vulnerability was remediated in commit 7c4867c32344d1c961107fae62240a6f0dc680dc by removing the specific replacement of the \ufffd character and introducing a regular expression that strips all non-ASCII characters from the URI prior to the validation phase:

# Patched Code: bleach/sanitizer.py (v6.4.0)
def sanitize_uri_value(self, value, allowed_protocols):
    # Convert HTML entities to raw characters
    normalized_uri = html5lib_shim.convert_entities(value)
 
    # Strip backtick, whitespace, and control characters
    normalized_uri = re.sub(r"[`\000-\040\177-\240\s]+", "", normalized_uri)
 
    # Strip non-ASCII characters so that urlparse can parse the url into
    # components correctly. This drops invisible and whitespace unicode
    # characters among other things.
    normalized_uri = re.sub(r"[^\x00-\x7f]", "", normalized_uri)
 
    # Lowercase value to make matching easier
    normalized_uri = normalized_uri.lower()

This remediation ensures that any high-plane Unicode or non-ASCII invisible character is removed before scheme parsing. An input of javascript\u200b:alert(1) is sanitized to javascript:alert(1) during preprocessing. This collapsed string is successfully parsed by urlparse as the javascript scheme, which Bleach flags as a blocked protocol and correctly strips from the tag.

While this fix is highly complete and secures the parser against Unicode-based bypasses, it strips legitimate internationalized domain names (IDNs) or paths that utilize non-ASCII characters. This functional limitation was accepted by the maintainers as a necessary trade-off for security enforcement in the final release of the project.

Exploitation Methodology

Exploiting this vulnerability requires a multi-tier environment where Bleach's output is processed by a downstream component before rendering. Standard browsers do not execute schemes containing raw invisible Unicode characters directly, as they interpret them as relative paths or unrecognized schemas.

The vulnerability is triggered when the database, template formatter, or downstream backend processor normalizes the Unicode input (e.g., stripping non-ASCII characters, applying compatibility decompositions, or running character cleanup scripts). If a backend normalizes the string javascript\u200b:alert(document.cookie) by removing non-ASCII elements, the invisible character is discarded, resulting in the executable payload javascript:alert(document.cookie).

A Python proof of concept demonstrates how Bleach preserves the payload while downstream execution collapses it into a functional exploit vector:

import bleach
import re
 
# Configure sanitizer allowlists
allowed_tags = ['a']
allowed_attrs = {'a': ['href']}
allowed_protocols = ['http', 'https']
 
# Construct input payload with Zero-Width Space (\u200b)
raw_input = '<a href="javascript\u200b:alert(document.cookie)">Target Link</a>'
 
# Stage 1: Bleach sanitization (v6.3.0)
sanitized_output = bleach.clean(raw_input, tags=allowed_tags, attributes=allowed_attrs, protocols=allowed_protocols)
print("Sanitized Output:", repr(sanitized_output))
# Result: '<a href="javascript\u200b:alert(document.cookie)">Target Link</a>'
 
# Stage 2: Downstream Normalization (dropping non-ASCII characters)
final_html = re.sub(r"[^\x00-\x7f]", "", sanitized_output)
print("Final Rendered HTML:", repr(final_html))
# Result: '<a href="javascript:alert(document.cookie)">Target Link</a>'

Impact & Security Risk Assessment

The CVSS Base Score is evaluated as 0.0 (Low) due to the reliance on downstream processing to trigger the vulnerability in typical web browser environments. However, in applications that perform multi-stage processing, database character conversion, or Unicode normalization post-sanitization, this represents a severe stored Cross-Site Scripting (XSS) pathway.

An attacker who successfully executes XSS via this bypass can bypass the Same-Origin Policy (SOP), hijack user sessions, read confidential application data, extract session cookies, or execute arbitrary authenticated actions on behalf of the victim.

Furthermore, this vulnerability breaks the absolute safety contract promised by Bleach to callers. Developers rely on sanitizers to output clean, safe fragments regardless of the downstream processing architecture. Failing to filter these vectors introduces security degradation across complex application architectures.

Remediation and Migration Guidance

The immediate remediation is to upgrade to Mozilla Bleach version 6.4.0. This is the final release of the library and patches the vulnerability by stripping non-ASCII characters prior to scheme parsing.

pip install --upgrade bleach==6.4.0

Because Bleach is officially deprecated and has reached End-of-Life (EOL), organizations should plan a migration path to an actively maintained alternative. The recommended alternative is 'nh3', which provides Python bindings to the Rust-based 'ammonia' HTML sanitizer library. Ammonia enforces strict, RFC-compliant URI validation and is not subject to this Unicode parsing issue.

Where upgrading is not immediately possible, implement a pre-processing filter to strip Unicode invisible characters (such as \u200b, \ufeff, and \u00ad) before submitting strings to Bleach:

import re
 
def pre_sanitize(html_input):
    # Remove common invisible Unicode characters prior to Bleach processing
    unicode_pattern = re.compile(r"[\u200b\ufeff\u00ad]")
    return unicode_pattern.sub("", html_input)

Deploy a robust Content Security Policy (CSP) header to mitigate potential script execution if an active scheme bypasses the filters. Ensure the policy prevents inline scripts and script execution via URI schemes:

Content-Security-Policy: default-src 'self'; script-src 'self';

Official Patches

MozillaCommit implementing non-ASCII character removal before validation

Fix Analysis (1)

Technical Appendix

CVSS Score

0.0/ 10

CVSS:3.1/AV:N/AC:H/PR:N/UI:R/S:U/C:N/I:N/A:N

Affected Systems

Mozilla Bleach <= 6.3.0

Affected Versions Detail

Product	Affected Versions	Fixed Version
bleach Mozilla	<= 6.3.0	6.4.0

Attribute	Detail
CWE ID	CWE-184 (Incomplete List of Disallowed Inputs)
Attack Vector	Network (AV:N)
CVSS v3.1 Score	0.0 (Low due to indirect downstream dependency)
Impact	Bypass of protocol validation filters / Secondary stored XSS
Exploit Status	Proof-of-Concept (PoC) available
KEV Status	Not listed in CISA KEV

MITRE ATT&CK Mapping

T1204.001User Execution: Malicious Link

Execution

T1036Masquerading

Defense Evasion

CWE-184

Incomplete List of Disallowed Inputs

The product or library uses a list of disallowed inputs but does not fully validate or normalize characters, allowing attackers to bypass protection mechanisms using alternative Unicode representations.

Known Exploits & Detection

Research ContextDemonstrates Bleach v6.3.0 bypass using Zero-Width Space in the protocol scheme to evade urlparse and trigger XSS on downstream normalization.

Vulnerability Timeline

Fix commit developed and merged into the main development branch

2026-03-16

Bleach version 6.4.0 tagged and published, announcing EOL deprecation

2026-06-05

GHSA-8RFP-98V4-MMR6 advisory published

2026-06-16

More Reports

•about 2 hours ago•CVE-2026-67437

7.5

CVE-2026-67437: Unauthenticated Denial of Service via OAuth2 State Memory Exhaustion in OliveTin

An uncontrolled resource consumption vulnerability (CWE-400) in OliveTin allows unauthenticated remote attackers to exhaust server memory and trigger a denial of service (DoS). By repeatedly initiating the OAuth2 login flow without completing it, attackers can force the server to allocate state variables in an unbounded in-memory map. This heap-based resource exhaustion eventually causes the host operating system to terminate the OliveTin process via the Out-Of-Memory (OOM) killer.

Amit Schendel

3 views•8 min read

•about 3 hours ago•CVE-2026-67439

4.3

CVE-2026-67439: Incorrect Authorization Leading to Log Leak in OliveTin

An incorrect authorization vulnerability (CWE-863) exists in OliveTin prior to version 3000.17.0. The flaw allows authenticated users who are authorized to execute commands but restricted from viewing logs to bypass this restriction. By utilizing synchronous endpoints, attackers can directly access execution outputs containing sensitive system data, credentials, and environmental configurations.

Alon Barad

5 views•5 min read

•about 4 hours ago•CVE-2026-67438

6.6

CVE-2026-67438: OS Command Injection via Custom regex: Argument Type Bypassing Shell Safety Check in OliveTin

An OS command injection vulnerability exists in OliveTin versions >= 3000.2.0 and < 3000.17.0. The flaw stems from a validation bypass in the shell safety engine, which fails to recognize custom regular expression arguments as unsafe for actions run in shell execution mode. Furthermore, because these custom regex checks evaluate partial string matches, attackers can append arbitrary shell metacharacters to valid inputs. This allows unauthenticated or low-privilege users who are authorized to run configured actions to inject shell commands and achieve arbitrary remote code execution on the host system.

Amit Schendel

5 views•6 min read

•about 5 hours ago•CVE-2026-63118

6.9

CVE-2026-63118: DNS-Rebinding and Cross-Origin Request Execution in Model Context Protocol (MCP) Ruby SDK

A critical vulnerability (CVE-2026-63118) in the Model Context Protocol (MCP) Ruby SDK allows attackers to execute arbitrary JSON-RPC commands and exfiltrate sensitive local data from an MCP server bound to the local loopback interface. This is achieved through DNS-rebinding and cross-origin request execution due to missing validation of the HTTP Host and Origin headers in the StreamableHTTPTransport component.

Alon Barad

7 views•6 min read

•about 7 hours ago•CVE-2026-63119

6.2

CVE-2026-63119: Denial of Service via Uncontrolled Resource Consumption in Model Context Protocol Ruby SDK

CVE-2026-63119 is a high-impact denial-of-service vulnerability in the Model Context Protocol (MCP) Ruby SDK (distributed as the 'mcp' gem) before version 0.23.0. The vulnerability allows an attacker to cause resource exhaustion and process termination by streaming unbounded input to standard I/O streams.

Alon Barad

6 views•6 min read

•about 8 hours ago•CVE-2026-67430

5.3

CVE-2026-67430: Denial of Service via Unbounded Session Retention in Model Context Protocol Ruby SDK

CVE-2026-67430 is a medium-severity Denial of Service (DoS) vulnerability in the Model Context Protocol (MCP) Ruby SDK (packaged as the mcp gem) versions prior to 0.23.0. In stateful deployments using the StreamableHTTPTransport class, client session states are retained in an in-memory hash map. Because the transport implements a nil idle timeout by default, the background scavenger process is suppressed. Remote, unauthenticated attackers can flood the endpoint with initialize requests, rapidly consuming system memory and triggering an Out-of-Memory (OOM) crash.

Alon Barad

7 views•8 min read