The Anatomy of a Prompt
Language models don't follow instructions. They complete text. System prompts, few-shot examples, and chain-of-thought are three layers of context that make the desired output the most probable continuation.
Language models don't follow instructions. They complete text. System prompts, few-shot examples, and chain-of-thought are three layers of context that make the desired output the most probable continuation.
Most teams treat prompts as magic strings embedded in application code, then wonder why their LLM features break silently after every edit. Prompts deserve the same discipline as source code: version control, review, testing, and deployment pipelines.
The model reads the rule, restates it in its own words, agrees to follow it, and then violates it in the next turn. The acknowledgment is a text-generation event, and the action is a separate text-generation event, connected only by attention.
Hallucination, refusal, instruction drift, format non-compliance, and prompt injection. A taxonomy of the five most common failure modes and a systematic protocol for diagnosing each one.
The most underrated prompt engineering skill is recognizing when you do not need prompt engineering at all. A decision framework for choosing the right tool before you build the wrong system.
Seven research-backed anti-patterns that make models fail, hallucinate, and leak instructions. Vague requests, missing examples, kitchen-sink prompts, and trusting untrusted input.
Persona prompting, template patterns, meta-prompting, and self-consistency. The reusable structures that solve most prompting problems, and how they compose.
LLM systems fail in ways that traditional software does not. Six documented failure types, each with a case study and a post-mortem template that surfaces the detection gap.
The documentation hierarchy and the runtime hierarchy disagree. Specificity and recency outrank the stated precedence order, and the gap is where your rules quietly fail.
From Taylor's 1953 Cloze procedure through Shannon, BERT, few-shot learning, chain-of-thought, and RLHF. Seventy years of the same problem, with an evolving substrate.
In 1983, a cognitive psychologist wrote five pages about why automating factories makes operators worse at their jobs. In 2026, every company deploying agentic AI is learning the same lesson for the first time.
An AMD engineer published the most detailed empirical study of AI model degradation ever conducted in production. The tool she used to catch the model cutting corners was a bash script.
CLAUDE.md is a letter to the model. Hooks are a law. A practical guide to deterministic enforcement for anyone responsible for safeguarding AI usage.
Three system shapes for the same task, walked from a toy classifier through function calling to a production AWS Step Functions workflow. The architectural argument for code-as-orchestrator with the LLM constrained to leaf operations where its strength lives.
Function calling is usually described as the moment models gained the ability to use tools. Mechanically, it is the opposite: the model emits a small structured object, and the surrounding code does every part of the work that involves the world. The four-move cycle and the three mistakes the mainstream framing hides.
Lines of code never measured productivity, and token consumption does not measure it either. The companies selling the tokens have a reason to suggest otherwise. A toy example shows what happens at the function level when a developer internalizes the narrative.
A tool is only as reliable as the schema behind it: vague descriptions produce vague calls, constrained types produce correct ones. The discipline of schema design that turns most reliability problems into the model's strength rather than its weakness.
One tool call rarely finishes the job. Real workflows chain calls, run them in parallel, and recover from intermediate failures. The four-move cycle, the five distinct ways a tool call can fail, and the runtime discipline that separates a working demo from a system you can trust.
Prompt injection is not a bug class that will be patched. It is a consequence of how language models process input. Four documented incidents (Microsoft Copilot exfiltration, SpAIware persistent memory injection, GitHub Copilot RCE, EchoLeak), the lethal-trifecta framework for production defense, and an honest assessment of what no major lab has solved.
The four-move cycle is invariant across providers; the wire format is not. A side-by-side reference for the same tool defined five different ways across OpenAI, Anthropic, Gemini, Bedrock, and Ollama, with an honest accounting of where the differences leak through the wrappers that promise to hide them.
Tool use rarely fails the way the headlines suggest. Seven representative incidents (idempotency cascades, out-of-order parallel calls, hallucinated tools, confused-deputy authorization, schema drift, runaway loops, silent coercion) analyzed through a postmortem template, with the unifying observation that all seven live at a tool boundary.
For two years, every LLM application reinvented tool integration from scratch. MCP is the attempt to make that stop. The integration tax behind the protocol, what it actually is and is not, the architecture and capabilities, the security surface that comes with standardization, and what it leaves unsolved.
Most RAG tutorials open with a vector database. This one does not. Elasticsearch keyword search, top-3 passages, one LLM call, forty lines of Python. With real cost numbers, real latency numbers, and Anthropic's own Contextual Retrieval data showing that even frontier-lab RAG recipes use BM25 as a core component.
Embedding models, vector indexes, and chunking as three layers of one stack. Part 1 walks the embedding model landscape (OpenAI, Cohere, BGE, E5, MTEB) and the six-step selection-and-fine-tuning framework. Part 2 is the layered descent through HNSW, IVF, and product quantization. Part 3 is chunking strategies and the lost-in-the-middle effect. Part 4 is the net-new section: how a decision in any one layer cascades into the other two. Sits on top of the lexical floor established in Classic Search.
You cannot improve what you cannot measure. Two distinct measurement disciplines depending on the retriever: a closed BM25 loop you run inside your own engineering (five steps, k1 and b sweep, stratified metrics), and the MTEB external coordinate system that the field uses to compare embedding models nobody owns end-to-end. Extended preface contrasting the two disciplines, the BM25 evaluation loop step by step, the MTEB user manual (datasets, extensions, the maintainers' reproducibility paper, the BRIGHT 59.0 to 18.3 nDCG drop), and three rules for reading the leaderboard without being misled.
RAG is the most-taught pattern in production LLM systems. It is also the wrong default for the workload most people actually have: a handful of PDFs, ten minutes of questions, never opened again. The case for Cache-Augmented Generation, prompt caching, and Self-Route routing, with a decision matrix that maps corpus size, query volume, and document persistence to the correct pattern.
In high-cost industries, the path to the answer is the answer. A four-field schema (source, confidence, timestamp, agent_id) that turns retrieved chunks into traceable evidence and turns conflicts between sources into a tractable resolution problem rather than an averaging exercise. With an oil and gas worked example where a geologist and a petrophysicist disagree on the same well log.
Classic search is not pre-AI. It is the lexical retriever sitting underneath most production RAG systems in 2026, in the same Elasticsearch engine that also runs dense vectors and ELSER. Analyzer, inverted index, BM25, top-k, and the binary heuristic for when pure-vector retrieval is a mistake. Built around the classic-search-walkthrough demo embedded in the article.
For workloads where the user query is the problem rather than the retriever. Six LLM-driven query-side patterns that run over a pure-lexical BM25 retriever: multi-query retrieval with RRF, HyDE, step-back prompting, Query2doc, query decomposition for multi-hop, and rewrite-retrieve-read. Cost ledger and decision framework. Pure non-dense-vector at the retrieval layer; the LLM only touches the query side.
The two-stage retrieval pattern that most production RAG systems converge on: a fast bi-encoder first pass returns a broad candidate set, a slower cross-encoder reranks the top with full attention over query and document. Covers the architectural tradeoff, reciprocal rank fusion across retrievers, the 2026 reranker landscape, and worked examples showing where reranking earns its compute budget.
The structural alternative to vector RAG. The index is a typed knowledge graph extracted by a language model at ingest; retrieval is graph traversal or community-summary aggregation rather than similarity search. Walks the global-versus-local query distinction, the unusual economics where expensive indexing buys cheaper per-query inference, and the decision framework for when GraphRAG actually beats vector RAG.
The piece that makes GraphRAG do real work. A curated ontology (taxonomy plus relation schema) commits at ingest time to what entities and relations the corpus contains. Walks how typed extraction differs from raw NER, the W3C standards landscape (RDF, OWL, SKOS, schema.org), an oil-and-gas worked example, and the operational cost of curation. Companion to GraphRAG.
When the answer lives in a relational database rather than documents or a graph. Covers Text-to-SQL patterns, schema access, the sandboxing discipline (arbitrary SQL never touches production), and the query-router layer that ties vector, graph, and relational backends together with a single provenance schema.
The RAGAS framework for end-to-end RAG evaluation: faithfulness, answer relevance, context precision, and context recall. Covers implementation patterns, threshold selection, golden evaluation datasets, and stratified reporting so per-category failures do not hide inside an aggregate score.
Using models to evaluate models: rubric design, calibration against human judgment, and the known failure modes of automated evaluation. Covers session isolation between generator and judge so a same-session self-review does not produce confirmation-biased verdicts.
Annotation guidelines, inter-rater reliability, and the moments when human evaluation is irreplaceable. Walks field-level confidence, the schema a reviewer needs to make a structured decision, and the escalation triggers that prove well-calibrated in practice versus the ones that do not.
In 1954, three years before Firth's famous one-line aphorism, a Penn linguistics professor named Zellig Harris published the seventeen pages of math behind it. Then the GPUs arrived, and the framework that scaled with corpus size was the teacher's.
In 1992, five researchers at IBM Yorktown Heights published a twelve-page paper on grouping English vocabulary into classes. Two of the authors would walk out of that group and help build the most profitable hedge fund in history.
From Shannon's hand-picked letters to modern LLMs. The real outputs from ELIZA, RACTER, char-rnn, and GPT, and why each generation felt like a breakthrough.
How a 1994 data compression algorithm became the foundation of modern AI. The untold story of Byte Pair Encoding's journey from C Users Journal to GPT-4.
In June 1949, Alan Turing delivered a three-page paper at the inaugural EDSAC conference. He proposed flowchart assertions and variant functions for termination, the first written method for proving a program correct by checking its pieces. The paper was lost for thirty-five years before Floyd, Hoare, and Dijkstra independently rediscovered the same machinery.
In 1953, a psychologist deleted every fifth word from a paragraph and asked people to guess what was missing. Seventy years later, every large language model on earth runs a mechanized version of the same experiment.
Understanding how LLMs transform text into tokens, and why this seemingly simple process has profound implications for cost, context limits, and model behavior.
You learned how tokenization works and why context windows are a hard constraint. Here is a tool with 22.4K GitHub stars built entirely around the premise that most of those tokens are wasted.
A description of things that exist and how they relate to each other. From Aristotle's Categories to W3C OWL, with a Middle-earth worked example and the building blocks every ontology shares: classes, properties, relationships, constraints.
BPE, WordPiece, SentencePiece, Unigram. Four algorithms, four trade-offs, none of them know what a word is.
If you dumped every word of Pride and Prejudice into a hat and drew them out at random, the vocabulary would match the novel and the prose would be gibberish. The gap between the hat and Austen is exactly what PMI measures.
Tamil speakers pay 7x more tokens than English speakers for the same meaning. The hidden cost of tokenization and why morphology sets a compression ceiling.
Over a million AI agents registered on a social network built exclusively for them. They formed religions and drafted constitutions. A critical analysis of what is actually happening, and what the security implications mean.
Reddit usernames that break GPT. Invisible characters that bypass filters. The edge cases where tokenization fails spectacularly.
You know how to push code. But pushing code is not collaboration. Issues, branches, commits that reference those issues, pull requests, code review, and merge — the workflow every engineering team uses daily.
Most developers treat GitHub like a filing cabinet. When you start working with LLMs seriously, a question emerges that most people skip past: where should your project's memory actually live?
How a 2001 method for comparing corpora became a detector for AI-generated text, pasted content, and ghostwriters. Chi-squared drift detection across sliding windows.
251,022 tokens across five books, measured against the British National Corpus. The frequency data draws a portrait of a man who wrote with his lungs and his skin.
Everyone wants their chatbot to sound like them. The problem is that "sounding like you" means different things depending on how you actually write.
The word "the" should appear about 6,185 times in every 100,000 words of English. When it doesn't, something interesting is happening.
How words turn into coordinates, and why "king minus man plus woman" equals "queen". The story of how meaning became geometry.
Static embeddings gave "oracle" one vector for priestess, database, and Matrix character. Then attention learned to compute meaning from context.
Explore how LLMs manage context windows, from quadratic attention scaling to truncation strategies, and why the most expensive tokens are often the ones you never meant to send.
A model with a one-million-token window does not actually use one million tokens. The Chroma study, Lost in the Middle, RULER, NoLiMa, and BABILong all measure the same gap between marketed and effective context. Five failure modes, three mechanistic causes, and what context engineering can and cannot fix.
A review of the Transformer for engineers who call LLM APIs every day but have never looked inside the box. Connects tokenization, embeddings, attention, FFN, and sampling into a single system.
You type a question. A second later, words start appearing. Between your keypress and that first token lies a pipeline that most practitioners never examine. This is what happens inside the model during that second.