Sources

Grounding, citations, and further reading for How NOT to Write a Prompt.

All of this is optional. These are the sources behind the article. Nothing on this page is required reading, and you do not need to purchase any of these books.

The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper knows where to look.

References

1Brown, T

Brown, T., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." arXiv:2005.14165.

2Wei, J

Wei, J., Wang, X., Schuurmans, D., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2201.11903.

3Liu, P

Liu, P., Yuan, W., Fu, J., et al. (2021). "Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in NLP." arXiv:2107.13586.

4Liang, P

Liang, P., Bommasani, R., et al. (2022). "Holistic Evaluation of Language Models." arXiv:2211.09110. Stanford CRFM.

8Anthropic

Anthropic. (2024). "Prompt Engineering Guide." Anthropic Documentation.

9OpenAI

OpenAI. (2024). "Prompt Engineering." OpenAI Platform Documentation.

10OpenAI

OpenAI. (2024). "Techniques to Improve Reliability." OpenAI Cookbook.

Introduction

11Grounding note

The broken feedback loop is the core insight. Bad code crashes. Bad SQL returns empty results. Bad prompts return something plausible. This is why prompt engineering is harder to learn than most people think: the error signal is invisible unless you have ground truth to compare against.

Anti-Pattern 1: The Vague Request

12Grounding note

OpenAI's prompt engineering guide lists "write clear instructions" as strategy #1. Anthropic's guide leads with "be specific about what you want." Both of them are saying the same thing in different words: the model cannot read your mind. If you leave ambiguity, the model fills it. You might not like what it fills it with.

Anti-Pattern 2: No Examples

13Grounding note

The GPT-3 paper is one of the most-cited in AI history and the core finding is astonishingly simple: show the model what you want. That's it. The entire few-shot paradigm reduces to "give examples." The fact that most production prompts still skip this step tells you something about how little the research literature penetrates into practice.

Anti-Pattern 3: Asking "What" Without Showing "How"

14Grounding note

Wei et al. (2022) showed that chain-of-thought prompting isn't just marginally better. On arithmetic, commonsense, and symbolic reasoning, it's a qualitative leap. The model goes from "unreliable guessing" to "systematic reasoning." The technique is trivial to implement and the payoff is enormous. If you're asking a model to reason and not using CoT, you're leaving performance on the table.

15Grounding note

I have watched senior engineers send letter-counting tasks to GPT-4 in production code. Something about working with an LLM lowers the collective IQ by 20 points. The model becomes a hammer, and suddenly every string operation looks like a nail.

16Grounding note

There's also the PII angle that people forget in the excitement. Every prompt you send to an API is data leaving your network. If the input contains customer names, medical records, financial details, or anything covered by GDPR/HIPAA/SOC2, you've just created a compliance event. For deterministic tasks, the LLM adds risk with zero upside.

Anti-Pattern 4: The Kitchen Sink Prompt

17Grounding note

The kitchen sink prompt is the engineering equivalent of a requirements document that tries to specify everything in one paragraph. The OpenAI Cookbook recommends breaking complex tasks into subtasks. This is the prompt engineering version of the single responsibility principle: each prompt should do one thing well, not fourteen things adequately.

Anti-Pattern 5: Write One Prompt, Ship It

18Grounding note

HELM tested 30+ language models on 42 scenarios with 7 metrics each. The variance across prompt phrasings was often larger than the variance across models. Let that sink in: how you phrase the prompt can matter more than which model you use. And yet most teams spend weeks evaluating models and minutes evaluating prompts.

Anti-Pattern 6: Ignoring the System Prompt

19Grounding note

Both OpenAI and Anthropic document the system prompt as the mechanism for setting behavior, persona, tone, and guardrails. It's not a suggestion. It's the primary control surface for production deployments. Teams that put everything in the user message are using a screwdriver to hammer nails.

Anti-Pattern 7: Trusting the Input

20Grounding note

The indirect injection paper changed the threat model entirely. Before it, prompt injection was a user-facing problem: don't let users type "ignore previous instructions." After it, the attack surface includes every piece of external data the model touches. RAG pipelines are particularly exposed because the whole point is feeding the model content it hasn't seen before.

21Grounding note

HackAPrompt is the red-team-at-scale paper. The fact that they ran it as a competition and still found bypasses for every defensive strategy tells you something about the maturity of prompt-level defenses. Defense in depth isn't optional. A single layer of prompt defense is a single point of failure.

For Practitioners

22Grounding note

This is test-driven development for prompts. Write the test first. Then write the prompt to pass the test. Most teams do it backwards: write the prompt, eyeball the output, call it done. TDD for prompts isn't glamorous, but it's the single highest-leverage practice for production prompt engineering.

23Grounding note

I keep coming back to the HELM finding: how you phrase the prompt can matter more than which model you use. Teams spend weeks evaluating GPT-4 vs Claude vs Gemini and minutes evaluating their prompt formulations. The leverage is inverted. Fix the prompt first. Then evaluate the model.