← Back to article

Sources

Grounding, citations, and further reading for The Instruction You Didn't Write.

All of this is optional. These are the sources used to write the article, shown here as grounding for the research behind each claim. Nothing on this page is required reading, and you do not need to purchase any of these books.

The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper on a specific topic knows where to look.

About the Sources

Lakshmanan & Hapke: Generative AI Design Patterns

Lakshmanan, Valliappa & Hannes Hapke. O'Reilly Media, 2025. GitHub issue #11

Ch. 7 covers agentic patterns with injection mitigations, including Action-Selector, Dual-LLM, and Context-Minimization patterns. Ch. 9 presents Pattern 32 Guardrails, covering LLM Guard's PromptInjection scanner and LLM-as-Judge postprocessing for output validation.

Ozdemir: Building Agentic AI

Ozdemir, Sinan. "Building Agentic AI: Workflows, Fine-Tuning, Optimization, and Deployment." Addison-Wesley, 2025. GitHub issue #12

Covers security in agentic workflows where tool-calling amplifies injection risk. Particularly relevant to the MCP amplification and multi-agent propagation sections of the article.

Albada: Building Applications with AI Agents

Albada, Michael. O'Reilly Media, 2025. GitHub issue #9

Covers agent sandboxing, permission scoping, and human-in-the-loop patterns. Provides practical architecture guidance for the defensive measures discussed in the practitioner section.

Widdows & Cohen: Large Language Models

Widdows, Dominic & Trevor Cohen. "Large Language Models: How They Work and Why They Matter." SemanticVectors Publishing, 2025. GitHub issue #45

Ch. 4-5 cover the attention mechanism and sequence modeling that explain why instruction/data separation is architecturally impossible. The core technical argument of this article rests on the transformer architecture they describe.

Introduction

3SpAIware: Persistent Data Exfiltration via ChatGPT Memory ↑ article

Rehberger's SpAIware disclosure revealed that ChatGPT's memory feature could be weaponized through indirect prompt injection. A crafted web page, when summarized by the model, planted a persistent instruction that silently exfiltrated all subsequent user messages. OpenAI's initial triage decision, classifying it as a "model safety issue" rather than a security vulnerability, illustrates how the industry classified prompt injection in 2024: as a nuisance, not a vulnerability. The CVE system had no category for it. OWASP had to create one.

Rehberger, J. "SpAIware: Persistent Data Exfiltration via ChatGPT Memory." Embrace The Red, 2024.

The Core Problem

17NCSC: Prompt Injection May Never Be Totally Mitigated ↑ article

The UK's National Cyber Security Centre stated in December 2025 that prompt injection "may never be totally mitigated." Their recommendation was to stop seeking a silver-bullet solution and instead reduce risk through system design. The SQL injection analogy is the single most misleading framing in LLM security discourse. Every time someone says "we'll solve prompt injection like we solved SQL injection," they are implicitly claiming that a channel-separation mechanism exists for LLMs. It does not. NCSC is right to call this out explicitly.

NCSC. "Mistaking AI Vulnerability Could Lead to Large-Scale Breaches." UK National Cyber Security Centre, 2025.

Indirect Injection

1Greshake et al.: Indirect Prompt Injection in LLM-Integrated Applications ↑ article

Greshake's paper reframed the entire threat model for prompt injection. The key insight was that the attacker does not need access to the user's input field; they need access to any data the model will process. The paper was presented at Black Hat USA 2023 and documented in MITRE ATLAS as case study AML.CS0020. The fact that it took a formal MITRE classification to get the industry to treat this as a real attack class, rather than a curiosity, is telling.

Greshake, K., et al. "Not What You've Signed Up For." arXiv:2302.12173, 2023.

The Incidents

2ASCII Smuggling Against Microsoft 365 Copilot ↑ article

Rehberger discovered that Unicode Tags (U+E0000 to U+E007F) are invisible to humans in every major user interface but readable by language models. He built a four-step attack chain against Microsoft 365 Copilot that could exfiltrate email contents, MFA codes, and any document in the user's environment. Microsoft initially classified the vulnerability as "low severity," a pattern the industry repeats: the CVE system struggles with prompt injection because no single step in the chain looks dangerous in isolation. It is only the composition that is lethal.

Rehberger, J. "Microsoft Copilot: From Prompt Injection to Data Exfiltration Using ASCII Smuggling." Embrace The Red, 2024.

6GeminiJack: Zero-Click Enterprise Data Exfiltration ↑ article

Noma Labs discovered that hidden instructions embedded in shared Google Docs, Calendar events, or emails could commandeer Gemini's enterprise search. A single injection could reach Gmail, Calendar, Docs, Drive, and Vertex AI Search. Google's response was structural: they separated Vertex AI Search from Gemini Enterprise entirely, fundamentally restructuring how the two systems interact with retrieval.

Levi, S. "GeminiJack: Zero-Click Enterprise Data Exfiltration." Noma Security, 2025.

5EchoLeak: Zero-Click Prompt Injection in Microsoft 365 Copilot ↑ article

EchoLeak (CVE-2025-32711) is the most instructive incident because Microsoft had already deployed multiple defensive layers against exactly this kind of attack. The researchers at Aim Labs bypassed all four: the XPIA Classifier, link redaction, image auto-fetch blocking, and content security policy. EchoLeak should be the case study in every security architecture course that covers LLM integration. It is not a failure of engineering effort; Microsoft had invested heavily in exactly the right defenses. It is a failure of the defensive model itself. You cannot reliably filter instructions out of data when the processing engine treats all input identically.

Reddy, P. and Gujral, A. "EchoLeak: Zero-Click Prompt Injection in Microsoft 365 Copilot." arXiv:2509.10540, 2025.

4GitHub Copilot: Wormable Remote Code Execution ↑ article

CVE-2025-53773 demonstrated that prompt injection in a code repository could achieve remote code execution on a developer's machine. The attack modified VS Code settings to disable all user confirmation dialogs, then executed shell commands without prompting the user. The attack is wormable: a single compromised GitHub repository can infect any developer who opens it in VS Code with Copilot enabled. CVSS score: 7.8 (HIGH).

Rehberger, J. "GitHub Copilot Remote Code Execution via Prompt Injection." Embrace The Red, 2025.

The MCP Amplification

14Unit 42: MCP Attack Vectors ↑ article

Unit 42 at Palo Alto Networks documented three primary MCP attack vectors: resource theft through hidden instructions that burn API credits, conversation hijacking through persistent instructions embedded across turns, and covert tool invocation through hidden prompts that trigger file operations without user consent. The MCP protocol's design assumed benign servers, and each of these vectors exploits that assumption.

Unit 42. "New Prompt Injection Attack Vectors Through MCP." Palo Alto Networks, 2025.

15AuthZed: Timeline of MCP Breaches ↑ article

The MCP breach timeline maintained by AuthZed reads like a controlled demolition of trust assumptions. WhatsApp chat exfiltration in April. GitHub private repo access in May. Cross-tenant data leakage in June. RCE in the MCP Inspector itself in June. Supply chain compromise in September. Registry-wide control in October. Each incident exploits a different trust boundary, and together they demonstrate that the protocol's design assumed benign servers.

AuthZed. "Timeline of MCP Breaches." AuthZed Blog, 2025.

16Willison: The Lethal Trifecta for AI Agents ↑ article

Simon Willison, who coined the term "prompt injection," summarized the MCP situation directly: the protocol creates systems that combine access to private data, exposure to untrusted content, and the ability to communicate externally. He calls this combination the "lethal trifecta," and argues that any system exhibiting all three properties is trivially exploitable. This framework provides the simplest useful heuristic for practitioners evaluating their own systems.

Willison, S. "The Lethal Trifecta for AI Agents." Substack, 2025.

Multi-Agent Propagation

7Lee & Tiwari: Prompt Infection in Multi-Agent Systems ↑ article

Lee and Tiwari showed that prompt injection in multi-agent systems behaves like a computer virus. A compromised agent propagates malicious instructions to other agents through inter-agent messages, with success rates above 80% against GPT-4o. Most prompt injection discourse still assumes a single model processing a single request. In agentic architectures, a single injected document can cascade through an orchestrator, infect sub-agents, trigger tool calls across multiple services, and exfiltrate data from sources the original agent never had access to.

Lee, D. and Tiwari, M. "Prompt Infection: LLM-to-LLM Prompt Injection in Multi-Agent Systems." arXiv:2410.07283, 2024.

18Schneider: From LLM to Agentic AI Prompt Injection ↑ article

Schneider framed the escalation clearly: what was a single manipulated output in a monolithic LLM becomes an orchestrated multi-tool kill chain across a multi-agent system. The injection does not just produce a wrong answer. It triggers a sequence of tool calls, each of which extends the attack's reach. The attack surface scales with the number of agents in the system.

Schneider, C. "From LLM to Agentic AI: Prompt Injection Got Worse." 2025.

RAG Poisoning

8PoisonedRAG: Knowledge Corruption Attacks ↑ article

The Penn State researchers demonstrated that injecting just five malicious documents into a retrieval database could manipulate the model's answers with a 97% success rate on the NQ dataset and 99% on HotpotQA. The paper's conclusion was blunt: "RAG is extremely vulnerable to knowledge corruption attacks." This is not a theoretical concern; the GeminiJack attack against Google Workspace was RAG poisoning in the wild.

Zou, W., et al. "PoisonedRAG: Knowledge Corruption Attacks on Retrieval-Augmented Generation." USENIX Security, 2025.

What the Defenders Have Built

9Microsoft: Spotlighting and Layered Defense ↑ article

Microsoft Research introduced "spotlighting" in 2024: a family of techniques that mark untrusted input so the model can distinguish it from instructions. The most effective variant, datamarking, reduces attack success rates to near zero in controlled tests. By 2025 Microsoft had built a four-layer production defense stack on top of this. EchoLeak then bypassed all four layers, demonstrating that controlled-test performance does not transfer to adaptive adversaries.

Hines, K., et al. "Defending Against Indirect Prompt Injection Attacks With Spotlighting." Microsoft Research, 2024.

10Google DeepMind: Adversarial Training for Gemini ↑ article

Google's findings were sobering. Against an undefended Gemini 2.0, at least one attack succeeded on over 70% of test examples. The most effective automated attack (TAP) achieved close to 100% success at under $10 cost. Their critical finding was that better instruction-following makes models easier to attack. This is the most important sentence in the 2025 defense literature: the very capability that makes models useful, their responsiveness to instructions, is the same capability that makes them vulnerable. You cannot train one away without degrading the other.

Google DeepMind. "Lessons from Defending Gemini Against Indirect Prompt Injections." arXiv:2505.14534, 2025.

11Anthropic: Constitutional Classifiers ↑ article

Anthropic's Constitutional Classifiers reduced jailbreak success rates from 86% to 4.4%. A human red team of 183 participants spent over 3,000 hours attempting to find a universal jailbreak against the prototype and found none. The overhead was modest: a 0.38% increase in false refusals and 23.7% additional compute. This represents the strongest published defensive result as of 2025, though Anthropic's own disclosure emphasizes it is progress, not a solution.

Anthropic. "Constitutional Classifiers." Anthropic Research, 2025.

12Anthropic: Prompt Injection Defenses for Browser Agents ↑ article

For browser-based agents, Anthropic combined adversarial reinforcement learning with classifier-based detection. Claude Opus 4.5 achieved a 1% attack success rate against their internal adaptive attacker. Their disclosure was notably honest: "No browser agent is immune to prompt injection, and we share these findings to demonstrate progress, not to claim the problem is solved."

Anthropic. "Prompt Injection Defenses for Browser Agents." Anthropic Research, 2025.

The OWASP Framework

13OWASP Top 10 for LLM Applications ↑ article

The 2025 edition placed prompt injection at position #1 for the second consecutive edition. OWASP's assessment was unusually direct for a standards body: "Prompt Injection vulnerabilities are possible due to the nature of generative AI. Given the stochastic influence at the heart of the way models work, it is unclear if there are fool-proof methods of prevention." The 2025 list also separated system prompt leakage into its own category (LLM07) and added vector/embedding weaknesses (LLM08), reflecting the growth of RAG-specific attacks.

OWASP. "LLM01:2025 Prompt Injection." OWASP Top 10 for LLM Applications, 2025.

Perplexity

19Brave: Prompt Injection in Perplexity's Comet Browser ↑ article

Brave Research documented prompt injection vulnerabilities in Perplexity's Comet browser, demonstrating that the attack surface extends to any product that integrates LLM-based browsing or summarization. The findings reinforce the broader pattern: wherever a model reads untrusted web content as part of its workflow, indirect injection becomes possible.

Brave Research. "Prompt Injection in Perplexity's Comet Browser." Brave Blog, 2025.