CVE-2026-1260

Broken Tokens: Heap Corruption in Google Sentencepiece

Alon Barad
Alon Barad
Software Engineer

Jan 23, 2026·5 min read·2 visits

Executive Summary (TL;DR)

Google Sentencepiece, the backbone of many NLP pipelines, was treating non-null-terminated string views as C-strings. By crafting a malicious model file, an attacker can trick the library into reading past heap boundaries, leading to crashes, information leaks, or RCE. Fixed in version 0.2.1.

A classic C-string assumption error in Google's Sentencepiece library allows malicious model files to trigger a heap-based buffer over-read and potential overflow, leading to Arbitrary Code Execution.

The Hook: Your AI Model is a Trojan Horse

In the gold rush of Large Language Models (LLMs), everyone is downloading terabytes of weights and tokenizer files from HuggingFace, trusting that model.bin is just math. It isn't. It's data parsed by C++ code, and C++ code hates you.

Google Sentencepiece is the de facto standard for unsupervised text tokenization. It's used by BERT, ALBERT, T5, and countless custom implementations. It sits right at the entrance of your NLP pipeline, chewing on raw text and model definitions before your neural network even wakes up.

CVE-2026-1260 is a reminder that even modern C++ libraries written by top-tier engineers can fall victim to the oldest trick in the book: assuming a string ends with a null byte. This isn't just a parser bug; it's a loaded gun pointing at the memory heap of any server processing untrusted models.

The Flaw: A Tale of Two String Types

The root cause is a type confusion—not in the compiler sense, but in the developer's mental model. The code uses absl::string_view, a modern C++ construct that represents a string as a pointer and a length. It is not guaranteed to be null-terminated. It's often just a slice of a larger buffer.

However, the code interfaces with Darts::DoubleArray, an older trie-building library that acts like it's still 1995. The Darts::build method accepts an optional array of lengths. If you pass nullptr for that array, Darts shrugs and assumes, "Okay, these must be standard C-strings then," and proceeds to read memory until it hits a \0.

In src/normalizer.cc, the PrefixMatcher constructor takes a set of absl::string_view objects (the dictionary keys) and passes them to Darts. The fatal mistake? It passed nullptr for the lengths. The moment a dictionary key lacks a null terminator (which is common for string_view), the Darts builder goes on a heap-walking adventure, reading adjacent memory until it finds a lucky zero or crashes.

The Code: The Smoking Gun

Let's look at the crime scene in src/normalizer.cc. The developer extracted raw pointers from safe string_view objects and fed them to a hungry C-API without protection.

The Vulnerable Code:

PrefixMatcher::PrefixMatcher(const std::set<absl::string_view> &dic) {
  std::vector<const char *> key;
  key.reserve(dic.size());
  for (const auto &it : dic) key.push_back(it.data()); // <--- Danger: Raw pointer extraction
 
  trie_ = std::make_unique<Darts::DoubleArray>();
  // The third argument is 'lengths'. Passing nullptr implies null-terminated strings.
  // This is a lie.
  if (trie_->build(key.size(), const_cast<char **>(&key[0]), nullptr, nullptr) != 0) {
     // ...
  }
}

The Fix:

The patch is simple: stop lying to the library. Explicitly calculate and pass the length of each string. This forces Darts to respect the boundaries of the string_view.

PrefixMatcher::PrefixMatcher(const std::set<absl::string_view> &dic) {
  std::vector<const char *> key;
  std::vector<size_t> lengths; // <--- The safety net
  // ...
  for (const auto &it : dic) {
    key.push_back(it.data());
    lengths.push_back(it.size()); // <--- Capture actual size
  }
  
  trie_ = std::make_unique<Darts::DoubleArray>();
  // Pass lengths.data() instead of nullptr
  if (trie_->build(key.size(), const_cast<char **>(key.data()),
                   const_cast<size_t *>(lengths.data()), nullptr) != 0) {
    // ...
  }
}

The Exploit: Building a Poisoned Model

To exploit this, we don't send a network packet; we bake a cake. The "cake" is a Sentencepiece model file (protobuf based). We need to craft a model where the normalizer_spec or dictionary contains strings that are packed tightly in the binary blob without null terminators.

When the victim loads our Evil-GPT.model:

  1. The protobuf parser allocates a large heap chunk for the string data.
  2. PrefixMatcher is initialized with string_views pointing into this chunk.
  3. The vulnerable build() call is triggered.
  4. Darts reads past the end of our strings.

The Impact:

At best, you get a SegFault (DoS). At worst, if we carefully groom the heap before loading the model, the OOB read can be used to leak heap addresses (bypassing ASLR) or, due to the nature of Trie construction, corrupt the internal state of the DoubleArray being built. Since this object manages its own memory, corrupting its structure can lead to a secondary Heap Buffer Overflow when it tries to resize or write nodes based on the invalid length data it just read.

The Fix: Mitigation & Survival

If you are running an inference API that allows users to upload custom tokenizers, you are currently vulnerable. The fix is binary: update or die.

Immediate Actions:

  1. Patch: Update to Sentencepiece v0.2.1 immediately. This version includes the fix where lengths are explicitly passed to the underlying Darts builder.
  2. Audit: Scan your environment for libsentencepiece.so. If it's older than Jan 2026, flag it.
  3. Sanitize: Treat model files like executable binaries. Do not load them from untrusted sources. If you must, run the loading process in a disposable sandbox with strict memory limits and no network access.

This vulnerability highlights the fragility of the AI supply chain. We treat models as "assets," but to the parser, they are just complex input streams that can be weaponized.

Fix Analysis (1)

Technical Appendix

CVSS Score
8.5/ 10
CVSS:4.0/AV:L/AC:L/AT:N/PR:N/UI:P/VC:H/VI:H/VA:H/SC:N/SI:N/SA:N

Affected Systems

Google Sentencepiece < 0.2.1PyTorch (via bundled sentencepiece)TensorFlow (via bundled sentencepiece)HuggingFace Transformers (dependent on sentencepiece)

Affected Versions Detail

Product
Affected Versions
Fixed Version
sentencepiece
Google
< 0.2.10.2.1
AttributeDetail
CWECWE-119 (Improper Restriction of Operations within the Bounds of a Memory Buffer)
CVSS v4.08.5 (High)
Attack VectorLocal (User Interaction Required)
ImpactArbitrary Code Execution / Information Disclosure
Affected Componentsrc/normalizer.cc (PrefixMatcher)
Patch Commitd856b67fdb3492e035489abf9b3aaf486144b2c0
CWE-119
Improper Restriction of Operations within the Bounds of a Memory Buffer

The software performs operations on a memory buffer, but it can read from or write to a memory location that is outside of the intended boundary of the buffer.

Vulnerability Timeline

Fix committed to master branch
2025-08-03
CVE-2026-1260 Assigned and Public Disclosure
2026-01-22

Subscribe to updates

Get the latest CVE analysis reports delivered to your inbox.