← Back to article

Sources

Grounding, citations, and further reading for When Tools Become Attack Surface.

All of this is optional. These are the sources used to write the article, shown here as grounding for the research behind it. Nothing on this page is required reading.

The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper on a specific topic knows where to look.

About the Sources

SLP3: Jurafsky & Martin

Jurafsky, Daniel & James H. Martin. Speech and Language Processing, 3rd ed. (draft).

The standard academic textbook for NLP. Freely available in draft form at web.stanford.edu/~jurafsky/slp3/. Chapter 7 provides the formal treatment of autoregressive generation and alignment training that underpins the security analysis in this article.

Widdows & Cohen: Large Language Models

Widdows, Dominic & Trevor Cohen. SemanticVectors Publishing, 2025.

Accessible and mathematically grounded survey of LLM architecture and behavior. Chapter 6 is particularly relevant here, covering LLM security vulnerabilities, sycophancy, and the limitations of guardrails as safety mechanisms.

Alammar & Grootendorst: Hands-On Large Language Models

Alammar, Jay & Maarten Grootendorst. O'Reilly Media, 2024.

Practitioner-oriented survey covering tool use, agent architectures, and ethical considerations in LLM deployment. Chapter 7 discusses how tools extend LLM capabilities and why authorization must be external to the model.

Farris et al.: LLM Textbook

Course textbook for agentic systems.

Contains the car dealership $1 sale example and arguments for adversarial thinking as the first step in LLM deployment. Referenced for the threat model framing.

The Threat Model

1Tool calls are still token prediction

Jurafsky and Martin (SLP3, Ch. 7, Section 7.1) describe decoder-only LLMs as systems that generate text autoregressively, one token at a time. The critical security insight is that function calling does not add a separate execution module to this architecture. The model is still performing next-token prediction; it is simply predicting tokens that happen to form JSON tool calls instead of prose. This means every vulnerability of text generation (hallucination, sycophancy, prompt sensitivity) applies equally to tool-call generation. The model "decides" to delete a file the same way it "decides" to write a sentence: by predicting the most probable next token given the context.

SLP3 Ch. 7, §7.1.

2Tool-equipped LLMs as a qualitatively different risk category

Widdows and Cohen draw a sharp distinction in Ch. 6 between "using LLMs to suggest code" and "trusting them as components in production systems," calling the latter "quite another" matter entirely. They note that LLMs "have been shown to exhibit and enable all sorts of security vulnerabilities," and point to NVIDIA's garak project for tracking this evolving space.

Widdows & Cohen, Ch. 6.

3The $1 car dealership example

The textbook's most vivid security example: a car company integrated an LLM into their website to help sell cars. Within a day, users convinced it to sell a car for $1. Farris et al. use this to argue that adversarial thinking should be the first step in any LLM deployment, not an afterthought.

Farris et al., Ch. 8.

4Why safety training resists direct injection

Jurafsky and Martin (SLP3, Ch. 7, Sections 7.5.2-7.5.3) describe the two-stage process that produces this resistance. First, instruction tuning (SFT) trains the model on curated examples of safe behavior. Then, preference alignment via RLHF or DPO trains the model to prefer safe outputs over unsafe ones by learning from human feedback. Direct injection asks the model to override both layers simultaneously, which is why it tends to fail. But this defense is probabilistic, not absolute. The same sections note that alignment can be reversed through fine-tuning, and that safety training may create a "veneer" that sophisticated attacks can penetrate.

SLP3 Ch. 7, §7.5.2-7.5.3.

5Sycophancy and indirect injection vulnerability

Widdows and Cohen provide useful context for why models are susceptible to indirect injection. In Ch. 6, they discuss LLM sycophancy: the tendency of assistant-trained models "to agree with viewpoints presented to them." This same tendency to comply rather than challenge makes models especially vulnerable to injected instructions embedded in retrieved content.

Widdows & Cohen, Ch. 6.

Indirect Injection

6Why indirect injection works at the architectural level

Jurafsky and Martin (SLP3, Ch. 7, Section 7.2) formalize why this attack works at the architectural level. Conditional generation computes P(y|x), where x is the entire context: system prompt, user message, and any retrieved content. The model makes no structural distinction between "instructions from the developer" and "text from a retrieved document." Both are tokens in x. The autoregressive mechanism (Section 7.4) conditions on all preceding tokens equally, so injected instructions in retrieved content have the same conditioning weight as legitimate system prompt instructions. This is not a bug that can be patched; it is inherent to how conditional generation works.

SLP3 Ch. 7, §7.2, §7.4.

7RAG and poisoned-document attacks

Widdows and Cohen discuss RAG in Ch. 3 and Ch. 5, noting that retrieved documents can lead models astray. They caution that RAG "is easily misinterpreted," since it helps produce more factual answers but "doesn't constrain [the model] to produce only sentences that are equally authoritative." This is the exact vulnerability exploited in poisoned-document attacks.

Widdows & Cohen, Ch. 3, Ch. 5.

8Why filtering cannot solve prompt injection

Jurafsky and Martin (SLP3, Ch. 7, Section 7.5.1) explain why filtering cannot solve this. Pretraining on massive text corpora teaches the model to process language in all its forms: formal, informal, encoded, multilingual, even intentionally obfuscated. The same generalization ability that lets the model understand "plz check my grade" alongside "Please retrieve my grade" also lets it understand Base64-encoded instructions and Pig Latin attacks. You cannot selectively disable the model's ability to understand certain encodings without degrading its general language capabilities.

SLP3 Ch. 7, §7.5.1.

9The spam filtering arms race as historical parallel

Widdows and Cohen describe the spam filtering arms race in Ch. 1 as a direct historical parallel. Email admins wrote rules to block words like "Free" and "Offer," but spammers adapted, and legitimate messages like "I'm free on Tuesday" got caught. The authors note that "bureaucratic rule-based processes sometimes demand expert human attention, and still deliver clearly-inappropriate blanket outcomes." This is exactly the pattern-matching arms race described for prompt injection filtering, and it failed for spam too, giving way to probabilistic approaches.

Widdows & Cohen, Ch. 1.

Least Privilege and Sandboxing

10LLMs for autonomous social engineering

Widdows and Cohen discuss in Ch. 6 how LLMs can be deliberately exploited for social engineering by "those with harmful intentions." They cite a Microsoft Research paper showing GPT-4 (before alignment) generating a multi-step plan to spread anti-vaccine misinformation, including identifying target communities and crafting emotional appeals. If such a model also had access to send_email or post_to_forum tools, the attack would execute autonomously rather than requiring human distribution.

Widdows & Cohen, Ch. 6.

11Empirical evidence for human-in-the-loop

Widdows and Cohen cite a CMU simulation called TheAgentStudy (Ch. 6) showing that "AI agents are still deeply unreliable when it comes to carrying out tasks responsibly." This provides empirical support for the human-in-the-loop pattern: if agents cannot be trusted to act responsibly even in benign scenarios, confirmation gates for destructive actions are essential safeguards.

Widdows & Cohen, Ch. 6.

12Non-deterministic code generation and sandboxing

Jurafsky and Martin (SLP3, Ch. 7, Section 7.4) describe how sampling with temperature introduces randomness into generation: higher temperature means more diverse, less predictable outputs. When the model generates code, this same sampling process means the exact code produced is stochastic. Two identical prompts with temperature > 0 can generate different code. Sandboxing is essential precisely because you cannot predict at design time what code the model will generate at runtime.

SLP3 Ch. 7, §7.4.

13Hallucinated variable names in generated code

Widdows and Cohen warn in Ch. 6 that LLMs have "a habit of inventing convenient names for imaginary variables," for example suggesting cet_server_DEBUG_LOGS=True when no such variable exists. Their advice: "Don't assume that any code works that you haven't explicitly tested and seen work!" When the code execution tool runs such hallucinated code, the failure mode can range from benign errors to unintended system state changes.

Widdows & Cohen, Ch. 6.

Authorization and Monitoring

14Perplexity as a model for anomaly detection

Jurafsky and Martin (SLP3, Ch. 3, Section 3.3, Eq. 3.14-3.17) define perplexity as a measure of how "surprised" a language model is by a sequence. The anomaly detection pattern described in the article is conceptually parallel: you are building a model of expected tool-call sequences and flagging sequences that are "surprising." A support agent that suddenly calls send_email after processing retrieved content is a high-perplexity event in the tool-call distribution.

SLP3 Ch. 3, §3.3.

15Agent autonomy and external authorization

Alammar and Grootendorst describe how tools extend LLM capabilities to interact with the real world, and how agents determine their own actions. This autonomy is precisely why authorization layers must be external to the model: an agent that decides its own actions cannot also be trusted to decide its own permissions.

Alammar & Grootendorst, Ch. 7.

16Alignment as probabilistic veneer, not enforceable rule

Jurafsky and Martin (SLP3, Ch. 7, Sections 7.5.2-7.5.3) describe how alignment training teaches the model to prefer certain behaviors, but alignment operates on the model's outputs, not on an internal concept of permission. Zhan et al. (2024) showed that alignment can be reversed through fine-tuning, and SLP3 notes that alignment may be more "veneer" than structure (Section 7.7). Authorization requires deterministic enforcement that no amount of alignment training can provide.

SLP3 Ch. 7, §7.5.2-7.5.3, §7.7.

17Why guardrails cannot substitute for external authorization

Widdows and Cohen provide a sobering example in Ch. 6: guardrails placed to redirect users showing suicide risk were bypassed, because "the space of possible conversations between a person and an LLM is too broad to cordon off exhaustively for safety." If instruction-level guardrails cannot reliably prevent a model from generating harmful text, they certainly cannot be trusted to prevent harmful tool calls.

Widdows & Cohen, Ch. 6.

Defense in Depth

18Internal versus external layered defense

Jurafsky and Martin (SLP3, Ch. 7, Section 7.5) describe the model's own three-stage training pipeline as a form of layered defense: pretraining establishes general language capabilities, instruction tuning teaches safe behavior, and preference alignment trains the model to prefer safe outputs. The defense-in-depth strategy is the external counterpart to this internal pipeline. The model's internal layers are probabilistic and bypassable (Section 7.7 discusses reversal). The external layers (tool minimization, argument validation, authorization checks, sandboxing) are deterministic and enforceable.

SLP3 Ch. 7, §7.5, §7.7.

19Ethical concerns operationalized as engineering layers

Alammar and Grootendorst discuss ethical considerations in LLM development, including the risks of harmful content generation and the need for transparency about model capabilities. The defense-in-depth approach operationalizes those ethical concerns: each layer is a concrete implementation of the principle that LLM capabilities must be bounded by external safeguards.

Alammar & Grootendorst, Ch. 1.