AI security auditors can be fooled by code they cannot see past

When Cloudflare’s Cloudforce One team began noticing unusual patterns in Workers scripts in March 2026, the anomaly was not a new strain of malware or a novel exfiltration technique. It was text. Thousands of lines of repetitive, multilingual natural language instructions, embedded inside malicious code and addressed directly to the AI systems scanning it.

The technique, known as indirect prompt code injection (IDPI), exploits a fundamental property of large language models: they process all text in their context window as potential instruction, regardless of whether it was placed there by a legitimate user or an adversary operating at one remove. By embedding deceptive “Notice to AI” safety lures inside scripts the AI was already tasked with auditing, attackers were attempting to manipulate the model into misclassifying malicious code as benign.

Cloudforce One’s response was to turn the attack into an experiment. Across 18,400 API calls, seven LLMs, and a controlled dataset of 100 confirmed malicious Workers scripts, the team systematically varied the number, placement, and language of injected bypass comments to map how different models responded under pressure. The findings, published in April 2026, complicate the assumption that frontier AI models are broadly resilient to social engineering.

The bypass is narrower than it looks

The study’s most operationally significant discovery is what the researchers call the “1% bypass zone.” When deceptive safety comments account for less than 1% of a file’s total content, average detection rates across all tested models dropped from a baseline of 67.3% to just 53.3%. At that ratio, the lures are subtle enough to nudge a model’s reasoning without triggering the kind of repetition alarm that flags heavily saturated files as suspicious.

The inverse is equally revealing. Once safety claims exceed roughly 25% of file content, they function as a detection fingerprint rather than a bypass, pushing identification rates to 99% as models recognise the saturation as a high-confidence marker of malicious intent. The pattern forms what researchers describe as a U-curve: moderate injection volumes are most dangerous; extremes at either end either have no effect or actively betray the attacker.

The threshold for that curve varies sharply by model tier. Non-frontier models, the high-speed, cost-optimised systems, showed the greatest vulnerability: detection fell from 90% to 67% with as few as 20 injected comments. At 1,000 comments, performance began to recover as the volume became its own signal. Frontier models showed considerably greater resilience to low-volume linguistic deception, though several exhibited a separate failure mode at extreme saturation: rather than being deceived, they produced garbled output or refused to return a structured verdict at all.

That second failure mode carries its own implications. Automated security pipelines that depend on structured responses, such as binary malicious/benign verdicts returned as JSON, can be neutralised not by convincing the AI that a file is safe, but by inducing a state in which the model fails to issue any verdict at all.

Volume defeats reasoning where language cannot

The most consequential finding in the research has nothing to do with the cleverness of the deceptive text. It concerns file size. Detection rates for Workers scripts under 500KB remained near 100% across the tested models. For files exceeding 3MB, those same rates collapsed to between 12% and 18%.

The mechanism is structural rather than linguistic. Large scripts, particularly those incorporating legitimate framework libraries such as React or TON SDKs, effectively exhaust a model’s attention across thousands of lines of benign code. When a short malicious payload, including Telegram-based data exfiltration logic, is buried within that volume, the AI’s capacity to isolate the harmful signal degrades regardless of whether the file fits entirely within the context window. As Cloudforce One notes, even with context windows of 2.8MB, models consistently failed to locate malicious logic surrounded by the noise of legitimate framework code.

This means that an attacker does not need to convince a model that their code is safe. They need only ensure the malicious signal is too small for the model to find.

Language carries its own biases

The study also documented significant model-specific sensitivities to the language of injected comments, findings that the researchers attribute to what they describe as internal “stereotypes” developed during training.

One non-frontier model showed the strongest language effect: Russian-language comments increased malicious detection by 14 percentage points. Two frontier models flagged Chinese and Arabic scripts at substantially higher rates than the English baseline. By contrast, low-resource languages such as Estonian were associated with lower detection in several models, suggesting that familiarity, or the lack of it, influences how models weight comment text during classification.

One outlier among the non-frontier models uniquely treated English-language safety notices as its strongest detection signal. For all other models, English lures were the least likely to increase suspicion, a finding with direct implications for which language a skilled attacker would choose.

A pipeline problem, not just a model problem

Cloudforce One’s defensive recommendations are notably pre-model in focus. Automated removal of comments before analysis eliminates linguistic lures entirely without requiring any change to the underlying model. Intentional truncation strategies, prioritising functional code blocks over boilerplate and SDK libraries, directly address the volume problem. Variable anonymisation, replacing descriptive names with neutral identifiers, removes the emotional anchoring that can subtly shift model verdicts. And replacing general-purpose prompts with vector-specific queries produces meaningfully more reliable results.

The underlying argument is that AI auditing pipelines need a denoising layer upstream of the model, not just better models downstream. The signal exists in the malicious code; the challenge is ensuring the model can actually see it.

As AI-powered security operations continue to displace human analysts in triage workflows, the findings from Cloudforce One reframe the threat model for those systems. The risk is not that adversaries will always find the perfect sentence to fool an AI. The risk is that they will simply add enough noise that the model runs out of attention before it finds the problem.

Sindhu V Kashyap

Global Technology Journalist & Multimedia Storyteller | Covering Founders, Investors & Leaders Reshaping Tech | Writer · Interviewer · Moderator · Editor

Previous
Previous

275 Million Students, One Point of Failure, and No Way Out

Next
Next

1,200 arrests, 11,400 takedowns, one report: what Fortinet's 2025 sustainability disclosure actually says