Mar 29, 2026·7 min read·3 visits
OpenCC versions <= 1.1.9 fail to validate the bounds of truncated UTF-8 strings, resulting in heap out-of-bounds reads that cause DoS or information disclosure. The issue is patched in version 1.2.0 via strict length clamping.
The OpenCC (Open Chinese Convert) library prior to version 1.2.0 contains two independent heap-based out-of-bounds read vulnerabilities. These flaws reside in the UTF-8 processing logic and occur when handling malformed or truncated multi-byte character sequences. Exploitation results in denial-of-service conditions or the disclosure of adjacent heap memory.
The OpenCC (Open Chinese Convert) library contains two independent heap-based out-of-bounds read vulnerabilities. These flaws reside in the library's UTF-8 parsing utilities and manifest when the software processes malformed or truncated multi-byte character sequences. The vulnerability is tracked as GHSA-7FQQ-Q52P-2JJG and affects all versions up to and including 1.1.9.
The root cause of both vulnerabilities stems from insufficient bounds checking during UTF-8 sequence decoding. The library relies on a utility function to determine character length based entirely on the leading byte of a sequence. This function does not verify if the required subsequent bytes exist within the allocated buffer boundaries.
When an attacker supplies a truncated UTF-8 sequence, the parsing logic calculates a sequence length that exceeds the actual remaining buffer size. This incorrect length calculation propagates to downstream components, specifically the segmentation and conversion modules. The resulting memory corruption leads to either a denial-of-service condition or the disclosure of adjacent heap memory.
The vulnerability originates in the UTF8Util::NextCharLength(const char* pstr) function. This function determines the expected length of a UTF-8 character, which ranges from one to six bytes, by examining the bit pattern of the first byte. The implementation assumes that the input buffer contains a well-formed UTF-8 string and does not accept a bounds parameter to verify the actual remaining buffer length.
In the MaxMatchSegmentation component, the segmentation logic tracks remaining buffer bytes using an unsigned integer variable named length. When NextCharLength() returns a size greater than the actual remaining bytes, the code executes the subtraction length -= matchedLength;. This operation causes an integer underflow, wrapping the length variable to an extremely large value such as SIZE_MAX.
The Conversion component exhibits a different failure mode driven by the same underlying parsing logic. The Convert function iterates over the input string using a loop that increments a pointer by the calculated character length. When encountering a truncated character immediately preceding a null terminator, the pointer advances past the null terminator, entirely missing the loop's exit condition.
The vulnerable code in the Conversion component uses a standard for loop to iterate through the input string. The pointer pstr is incremented by the value returned from UTF8Util::NextCharLength, without verifying if this increment pushes the pointer beyond the string's null terminator.
// Vulnerable implementation pattern
for (const char* pstr = phrase; *pstr != '\0';) {
size_t matchedLength = UTF8Util::NextCharLength(pstr);
// Process character...
pstr += matchedLength; // Vulnerable: increments past '\0'
}The patch implemented in OpenCC version 1.2.0 introduces explicit boundary tracking. The fix calculates the end of the input string, phraseEnd, prior to entering the processing loops. During each iteration, the code calculates the exact number of bytes remaining in the buffer.
// Patched implementation
const char* phraseEnd = phrase + strlen(phrase);
for (const char* pstr = phrase; pstr < phraseEnd;) {
size_t remainingLength = phraseEnd - pstr;
size_t matchedLength = UTF8Util::NextCharLength(pstr);
if (matchedLength > remainingLength) {
matchedLength = remainingLength; // Bounds clamping applied
}
// Process character...
pstr += matchedLength;
}This clamping mechanism guarantees that the matchedLength never exceeds the valid boundaries of the allocated buffer. Identical defensive bounds checking was added to the dictionary matching routines to prevent the integer underflow vulnerabilities in the segmentation logic.
Exploitation of these vulnerabilities requires the ability to supply untrusted input strings to the OpenCC processing pipeline. An attacker constructs a payload ending with a malformed UTF-8 sequence, such as the bytes \xE5\xB9. The byte \xE5 indicates a three-byte UTF-8 sequence, but the payload only provides two bytes before the null terminator.
When the Conversion component processes this input, NextCharLength returns a length of three. The internal pointer, initially positioned at the \xE5 byte, increments by three bytes. This operation advances the pointer directly over the terminating null byte at index two, landing on adjacent heap memory.
Once the pointer bypasses the null terminator, the processing loop continues to read from the heap until it happens to encounter another null byte. The library processes this leaked heap data and appends it to the legitimate conversion output. The attacker receives this output, achieving information disclosure of sensitive data residing in adjacent heap allocations.
The exploitability and impact of the information disclosure vulnerability depend heavily on the structure and state of the application's heap allocator. When the pointer bypasses the null terminator, the adjacent memory blocks determine the contents of the leaked data. Heap allocation patterns are dictated by the underlying operating system and the specific memory allocator in use.
In heavily utilized server applications, the heap contains a dense mixture of data structures. These structures include network request buffers, active database connection strings, and cryptographic session keys. The sequential nature of the out-of-bounds read guarantees that contiguous memory chunks following the vulnerable buffer will be disclosed sequentially until a null byte terminates the read.
Attackers manipulate the heap layout prior to exploitation to maximize the value of the disclosed data. By issuing a specific sequence of valid requests, an attacker forces the allocator to place sensitive data structures immediately adjacent to the buffer used for UTF-8 conversion. This technique transforms a random memory leak into a targeted data extraction primitive.
The mitigation introduced in OpenCC version 1.2.0 entirely eliminates this attack vector by constraining memory reads to the bounds of the original allocation. The explicit boundary calculation ensures that the pointer logic remains strictly within the intended buffer, regardless of the underlying heap topology or attacker-controlled allocation patterns.
The primary impact of the Conversion component vulnerability is the disclosure of sensitive heap memory. Because the read operation continues until an arbitrary null byte is encountered, the amount of leaked data depends entirely on the state of the heap at the time of exploitation. This memory frequently contains data from other user sessions, application configuration secrets, or internal memory pointers.
The disclosure of internal memory pointers facilitates the bypass of memory layout randomization protections, such as ASLR. This information serves as a critical primitive when chaining vulnerabilities to achieve remote code execution. The attacker effectively gains a reliable memory oracle by observing the converted string output.
The vulnerability in the MaxMatchSegmentation component results in a denial-of-service condition. The integer underflow causes the application to pass a massive length value to the dictionary matching routines. The subsequent out-of-bounds read rapidly encounters unmapped memory pages, triggering a segmentation fault and crashing the host process.
The authoritative remediation for this vulnerability is upgrading the OpenCC library to version 1.2.0 or later. This release contains the comprehensive bounds checking and length clamping logic required to safely process malformed UTF-8 sequences. Development teams must recompile statically linked applications to ensure the patched library is fully integrated.
If immediate upgrading is not feasible, organizations must implement strict input validation at the application boundary. All user-supplied strings must be verified as well-formed UTF-8 before being passed to any OpenCC API. Modern programming languages provide standard library functions to efficiently validate UTF-8 encoding prior to processing.
Security teams should also review the deployment architecture of applications utilizing OpenCC. Sandboxing the processing environment or running the conversion logic in an isolated microservice minimizes the impact of potential denial-of-service conditions. These architectural controls limit the scope of heap data accessible during an information disclosure event.
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:H| Product | Affected Versions | Fixed Version |
|---|---|---|
OpenCC BYVoid | <= 1.1.9 | 1.2.0 |
| Attribute | Detail |
|---|---|
| CWE ID | CWE-125 |
| Attack Vector | Network |
| CVSS Score | 7.5 |
| Impact | Denial of Service / Information Disclosure |
| Exploit Status | Proof of Concept available |
| KEV Status | Not Listed |
The software reads data past the end, or before the beginning, of the intended buffer.