CVEReports
CVEReports

Automated vulnerability intelligence platform. Comprehensive reports for high-severity CVEs generated by AI.

Product

  • Home
  • Sitemap
  • RSS Feed

Company

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 CVEReports. All rights reserved.

Made with love by Amit Schendel & Alon Barad



CVE-2026-1260

Broken Tokens: Heap Corruption in Google Sentencepiece

Alon Barad
Alon Barad
Software Engineer

Jan 23, 2026·5 min read·45 visits

Executive Summary (TL;DR)

Google Sentencepiece, the backbone of many NLP pipelines, was treating non-null-terminated string views as C-strings. By crafting a malicious model file, an attacker can trick the library into reading past heap boundaries, leading to crashes, information leaks, or RCE. Fixed in version 0.2.1.

A classic C-string assumption error in Google's Sentencepiece library allows malicious model files to trigger a heap-based buffer over-read and potential overflow, leading to Arbitrary Code Execution.

The Hook: Your AI Model is a Trojan Horse

In the gold rush of Large Language Models (LLMs), everyone is downloading terabytes of weights and tokenizer files from HuggingFace, trusting that model.bin is just math. It isn't. It's data parsed by C++ code, and C++ code hates you.

Google Sentencepiece is the de facto standard for unsupervised text tokenization. It's used by BERT, ALBERT, T5, and countless custom implementations. It sits right at the entrance of your NLP pipeline, chewing on raw text and model definitions before your neural network even wakes up.

CVE-2026-1260 is a reminder that even modern C++ libraries written by top-tier engineers can fall victim to the oldest trick in the book: assuming a string ends with a null byte. This isn't just a parser bug; it's a loaded gun pointing at the memory heap of any server processing untrusted models.

The Flaw: A Tale of Two String Types

The root cause is a type confusion—not in the compiler sense, but in the developer's mental model. The code uses absl::string_view, a modern C++ construct that represents a string as a pointer and a length. It is not guaranteed to be null-terminated. It's often just a slice of a larger buffer.

However, the code interfaces with Darts::DoubleArray, an older trie-building library that acts like it's still 1995. The Darts::build method accepts an optional array of lengths. If you pass nullptr for that array, Darts shrugs and assumes, "Okay, these must be standard C-strings then," and proceeds to read memory until it hits a \0.

In src/normalizer.cc, the PrefixMatcher constructor takes a set of absl::string_view objects (the dictionary keys) and passes them to Darts. The fatal mistake? It passed nullptr for the lengths. The moment a dictionary key lacks a null terminator (which is common for string_view), the Darts builder goes on a heap-walking adventure, reading adjacent memory until it finds a lucky zero or crashes.

The Code: The Smoking Gun

Let's look at the crime scene in src/normalizer.cc. The developer extracted raw pointers from safe string_view objects and fed them to a hungry C-API without protection.

The Vulnerable Code:

PrefixMatcher::PrefixMatcher(const std::set<absl::string_view> &dic) {
  std::vector<const char *> key;
  key.reserve(dic.size());
  for (const auto &it : dic) key.push_back(it.data()); // <--- Danger: Raw pointer extraction
 
  trie_ = std::make_unique<Darts::DoubleArray>();
  // The third argument is 'lengths'. Passing nullptr implies null-terminated strings.
  // This is a lie.
  if (trie_->build(key.size(), const_cast<char **>(&key[0]), nullptr, nullptr) != 0) {
     // ...
  }
}

The Fix:

The patch is simple: stop lying to the library. Explicitly calculate and pass the length of each string. This forces Darts to respect the boundaries of the string_view.

PrefixMatcher::PrefixMatcher(const std::set<absl::string_view> &dic) {
  std::vector<const char *> key;
  std::vector<size_t> lengths; // <--- The safety net
  // ...
  for (const auto &it : dic) {
    key.push_back(it.data());
    lengths.push_back(it.size()); // <--- Capture actual size
  }
  
  trie_ = std::make_unique<Darts::DoubleArray>();
  // Pass lengths.data() instead of nullptr
  if (trie_->build(key.size(), const_cast<char **>(key.data()),
                   const_cast<size_t *>(lengths.data()), nullptr) != 0) {
    // ...
  }
}

The Exploit: Building a Poisoned Model

To exploit this, we don't send a network packet; we bake a cake. The "cake" is a Sentencepiece model file (protobuf based). We need to craft a model where the normalizer_spec or dictionary contains strings that are packed tightly in the binary blob without null terminators.

When the victim loads our Evil-GPT.model:

  1. The protobuf parser allocates a large heap chunk for the string data.
  2. PrefixMatcher is initialized with string_views pointing into this chunk.
  3. The vulnerable build() call is triggered.
  4. Darts reads past the end of our strings.

The Impact:

At best, you get a SegFault (DoS). At worst, if we carefully groom the heap before loading the model, the OOB read can be used to leak heap addresses (bypassing ASLR) or, due to the nature of Trie construction, corrupt the internal state of the DoubleArray being built. Since this object manages its own memory, corrupting its structure can lead to a secondary Heap Buffer Overflow when it tries to resize or write nodes based on the invalid length data it just read.

The Fix: Mitigation & Survival

If you are running an inference API that allows users to upload custom tokenizers, you are currently vulnerable. The fix is binary: update or die.

Immediate Actions:

  1. Patch: Update to Sentencepiece v0.2.1 immediately. This version includes the fix where lengths are explicitly passed to the underlying Darts builder.
  2. Audit: Scan your environment for libsentencepiece.so. If it's older than Jan 2026, flag it.
  3. Sanitize: Treat model files like executable binaries. Do not load them from untrusted sources. If you must, run the loading process in a disposable sandbox with strict memory limits and no network access.

This vulnerability highlights the fragility of the AI supply chain. We treat models as "assets," but to the parser, they are just complex input streams that can be weaponized.

Official Patches

GoogleSentencepiece v0.2.1 Release

Fix Analysis (1)

Technical Appendix

CVSS Score
8.5/ 10
CVSS:4.0/AV:L/AC:L/AT:N/PR:N/UI:P/VC:H/VI:H/VA:H/SC:N/SI:N/SA:N

Affected Systems

Google Sentencepiece < 0.2.1PyTorch (via bundled sentencepiece)TensorFlow (via bundled sentencepiece)HuggingFace Transformers (dependent on sentencepiece)

Affected Versions Detail

Product
Affected Versions
Fixed Version
sentencepiece
Google
< 0.2.10.2.1
AttributeDetail
CWECWE-119 (Improper Restriction of Operations within the Bounds of a Memory Buffer)
CVSS v4.08.5 (High)
Attack VectorLocal (User Interaction Required)
ImpactArbitrary Code Execution / Information Disclosure
Affected Componentsrc/normalizer.cc (PrefixMatcher)
Patch Commitd856b67fdb3492e035489abf9b3aaf486144b2c0

MITRE ATT&CK Mapping

T1204.002User Execution: Malicious File
Execution
T1190Exploit Public-Facing Application
Initial Access
CWE-119
Improper Restriction of Operations within the Bounds of a Memory Buffer

The software performs operations on a memory buffer, but it can read from or write to a memory location that is outside of the intended boundary of the buffer.

Known Exploits & Detection

TheoryExploitation requires crafting a protobuf model with non-null-terminated dictionary entries.

Vulnerability Timeline

Fix committed to master branch
2025-08-03
CVE-2026-1260 Assigned and Public Disclosure
2026-01-22

References & Sources

  • [1]Fix Commit on GitHub
  • [2]Red Hat Security Advisory

Attack Flow Diagram

Press enter or space to select a node. You can then use the arrow keys to move the node around. Press delete to remove it and escape to cancel.
Press enter or space to select an edge. You can then press delete to remove it or escape to cancel.

More Reports

•42 minutes ago•CVE-2026-39829
7.5

CVE-2026-39829: Denial of Service in Go SSH Parser

A high-severity Denial of Service (DoS) vulnerability exists in the golang.org/x/crypto/ssh package prior to version 0.52.0. The vulnerability is caused by a lack of size and range validation on incoming RSA and DSA public key parameters during SSH authentication. An unauthenticated attacker can submit a crafted public key with pathologically large parameters, triggering intensive CPU computation during signature verification and leading to a complete Denial of Service.

Alon Barad
Alon Barad
3 views•5 min read
•about 3 hours ago•CVE-2026-39831
9.1

CVE-2026-39831: Authentication Bypass in golang.org/x/crypto/ssh via FIDO/U2F User Presence Bypass

An authentication bypass vulnerability was identified in the golang.org/x/crypto/ssh package. The library's verification logic for FIDO/U2F security keys failed to check the User Presence (UP) flag. This omission allows an attacker with access to a hardware token interface or an agent-forwarding socket to authenticate without physical user interaction.

Alon Barad
Alon Barad
5 views•5 min read
•about 4 hours ago•CVE-2026-39834
9.1

CVE-2026-39834: Infinite Loop and CPU Exhaustion via Integer Truncation in Go SSH Channel Write

A critical vulnerability exists in the Go SSH sub-repository (golang.org/x/crypto/ssh) before version 0.52.0. When an application writes payloads of 4GB or larger in a single write operation, integer truncation in the remote window calculation causes an infinite loop. This results in complete CPU core exhaustion and a denial-of-service condition.

Amit Schendel
Amit Schendel
6 views•7 min read
•about 6 hours ago•CVE-2026-42508
9.1

CVE-2026-42508: Bypass of SSH Certificate Authority Revocation in golang.org/x/crypto/ssh/knownhosts

An issue was discovered in Go's `golang.org/x/crypto/ssh/knownhosts` package where a revoked Certification Authority (CA) public key was not correctly checked for revocation during SSH host certificate validation. This allowed clients or servers utilizing the library to validate and trust host certificates issued by explicitly revoked CAs.

Alon Barad
Alon Barad
9 views•5 min read
•about 7 hours ago•CVE-2026-46595
10.0

CVE-2026-46595: Critical Authorization Bypass via source-address Validation Failure in golang.org/x/crypto/ssh

An authorization bypass vulnerability exists in the golang.org/x/crypto/ssh package prior to version 0.52.0. When an SSH server is configured with a custom VerifiedPublicKeyCallback that returns a Permissions object containing a source-address critical option, the server fails to validate and enforce the restriction. This allows remote clients with valid public keys to bypass IP-based access restrictions and authenticate from unauthorized network locations.

Alon Barad
Alon Barad
6 views•7 min read
•about 9 hours ago•CVE-2026-48517
7.5

CVE-2026-48517: Remote Code Execution via Typeless Deserialization Blocklist Bypass in MessagePack-CSharp

A critical vulnerability exists in MessagePack-CSharp's typeless deserialization mechanism where configured blocklists fail to recursively inspect nested types. An attacker can bypass security restrictions by wrapping unauthorized types in arrays or generic collections, allowing insecure deserialization and remote code execution.

Alon Barad
Alon Barad
6 views•7 min read