Writing - Craig Trim

The Anatomy of a Prompt

Language models don't follow instructions. They complete text. System prompts, few-shot examples, and chain-of-thought are three layers of context that make the desired output the most probable continuation.

14 min read 2026

Prompting

Prompts Are Code

Most teams treat prompts as magic strings embedded in application code, then wonder why their LLM features break silently after every edit. Prompts deserve the same discipline as source code: version control, review, testing, and deployment pipelines.

12 min read 2026

Prompting

Acknowledgment Is Not Adherence

The model reads the rule, restates it in its own words, agrees to follow it, and then violates it in the next turn. The acknowledgment is a text-generation event, and the action is a separate text-generation event, connected only by attention.

16 min read 2026

Prompting

When Prompts Fail

Hallucination, refusal, instruction drift, format non-compliance, and prompt injection. A taxonomy of the five most common failure modes and a systematic protocol for diagnosing each one.

14 min read 2026

Prompting

When to Prompt, When to Code, When to Train

The most underrated prompt engineering skill is recognizing when you do not need prompt engineering at all. A decision framework for choosing the right tool before you build the wrong system.

18 min read 2026

Prompting

How NOT to Write a Prompt

Seven research-backed anti-patterns that make models fail, hallucinate, and leak instructions. Vague requests, missing examples, kitchen-sink prompts, and trusting untrusted input.

14 min read 2026

Prompting

The Prompt Engineer's Pattern Book

Persona prompting, template patterns, meta-prompting, and self-consistency. The reusable structures that solve most prompting problems, and how they compose.

16 min read 2026

Prompting

What Breaks

LLM systems fail in ways that traditional software does not. Six documented failure types, each with a case study and a post-mortem template that surfaces the detection gap.

18 min read 2026

Prompting

System-Prompt Priority Inversion

The documentation hierarchy and the runtime hierarchy disagree. Specificity and recency outrank the stated precedence order, and the gap is where your rules quietly fail.

10 min read 2026

History

The Academic History of Prompt Engineering

From Taylor's 1953 Cloze procedure through Shannon, BERT, few-shot learning, chain-of-thought, and RLHF. Seventy years of the same problem, with an evolving substrate.

15 min read 2026

Human Factors

The Ironies of Automation, Forty Years Later

In 1983, a cognitive psychologist wrote five pages about why automating factories makes operators worse at their jobs. In 2026, every company deploying agentic AI is learning the same lesson for the first time.

18 min read 2026

Evaluation

Unexpected Model Behavior

An AMD engineer published the most detailed empirical study of AI model degradation ever conducted in production. The tool she used to catch the model cutting corners was a bash script.

20 min read 2026

Tool Security

The Line Claude Cannot Cross

CLAUDE.md is a letter to the model. Hooks are a law. A practical guide to deterministic enforcement for anyone responsible for safeguarding AI usage.

8 min read 2026

Function Calling

From Rules to LLMs: Three Architectures

Three system shapes for the same task, walked from a toy classifier through function calling to a production AWS Step Functions workflow. The architectural argument for code-as-orchestrator with the LLM constrained to leaf operations where its strength lives.

20 min read 2026

Function Calling

From Prompts to Actions

Function calling is usually described as the moment models gained the ability to use tools. Mechanically, it is the opposite: the model emits a small structured object, and the surrounding code does every part of the work that involves the world. The four-move cycle and the three mistakes the mainstream framing hides.

14 min read 2026

Cost

Token Burn Is Not Productivity

Lines of code never measured productivity, and token consumption does not measure it either. The companies selling the tokens have a reason to suggest otherwise. A toy example shows what happens at the function level when a developer internalizes the narrative.

18 min read 2026

Tool Schemas

Schemas That Models Can Follow

A tool is only as reliable as the schema behind it: vague descriptions produce vague calls, constrained types produce correct ones. The discipline of schema design that turns most reliability problems into the model's strength rather than its weakness.

14 min read 2026

Function Calling

Tool Loops: Multi-Step and Parallel Calls

One tool call rarely finishes the job. Real workflows chain calls, run them in parallel, and recover from intermediate failures. The four-move cycle, the five distinct ways a tool call can fail, and the runtime discipline that separates a working demo from a system you can trust.

16 min read 2026

Security

The Instruction You Didn't Write

Prompt injection is not a bug class that will be patched. It is a consequence of how language models process input. Four documented incidents (Microsoft Copilot exfiltration, SpAIware persistent memory injection, GitHub Copilot RCE, EchoLeak), the lethal-trifecta framework for production defense, and an honest assessment of what no major lab has solved.

18 min read 2026

Function Calling

Function Calling Across Providers

The four-move cycle is invariant across providers; the wire format is not. A side-by-side reference for the same tool defined five different ways across OpenAI, Anthropic, Gemini, Bedrock, and Ollama, with an honest accounting of where the differences leak through the wrappers that promise to hide them.

16 min read 2026

Tool Use

Tool Use Postmortems

Tool use rarely fails the way the headlines suggest. Seven representative incidents (idempotency cascades, out-of-order parallel calls, hallucinated tools, confused-deputy authorization, schema drift, runaway loops, silent coercion) analyzed through a postmortem template, with the unifying observation that all seven live at a tool boundary.

18 min read 2026

MCP

The Protocol Layer

For two years, every LLM application reinvented tool integration from scratch. MCP is the attempt to make that stop. The integration tax behind the protocol, what it actually is and is not, the architecture and capabilities, the security surface that comes with standardization, and what it leaves unsolved.

22 min read 2026

RAG

The Simplest Possible RAG

Most RAG tutorials open with a vector database. This one does not. Elasticsearch keyword search, top-3 passages, one LLM call, forty lines of Python. With real cost numbers, real latency numbers, and Anthropic's own Contextual Retrieval data showing that even frontier-lab RAG recipes use BM25 as a core component.

20 min read 2026

RAG

Vector RAG: Inside the Dense-Vector Retrieval Stack

Embedding models, vector indexes, and chunking as three layers of one stack. Part 1 walks the embedding model landscape (OpenAI, Cohere, BGE, E5, MTEB) and the six-step selection-and-fine-tuning framework. Part 2 is the layered descent through HNSW, IVF, and product quantization. Part 3 is chunking strategies and the lost-in-the-middle effect. Part 4 is the net-new section: how a decision in any one layer cascades into the other two. Sits on top of the lexical floor established in Classic Search.

60 min read 2026

RAG

Measuring Retrieval

You cannot improve what you cannot measure. Two distinct measurement disciplines depending on the retriever: a closed BM25 loop you run inside your own engineering (five steps, k1 and b sweep, stratified metrics), and the MTEB external coordinate system that the field uses to compare embedding models nobody owns end-to-end. Extended preface contrasting the two disciplines, the BM25 evaluation loop step by step, the MTEB user manual (datasets, extensions, the maintainers' reproducibility paper, the BRIGHT 59.0 to 18.3 nDCG drop), and three rules for reading the leaderboard without being misled.

35 min read 2026

RAG

The Amortization Assumption

RAG is the most-taught pattern in production LLM systems. It is also the wrong default for the workload most people actually have: a handful of PDFs, ten minutes of questions, never opened again. The case for Cache-Augmented Generation, prompt caching, and Self-Route routing, with a decision matrix that maps corpus size, query volume, and document persistence to the correct pattern.

22 min read 2026

RAG

Retrieval Provenance

In high-cost industries, the path to the answer is the answer. A four-field schema (source, confidence, timestamp, agent_id) that turns retrieved chunks into traceable evidence and turns conflicts between sources into a tractable resolution problem rather than an averaging exercise. With an oil and gas worked example where a geologist and a petrophysicist disagree on the same well log.

20 min read 2026

RAG

What Classic Search Does Before the LLM

Classic search is not pre-AI. It is the lexical retriever sitting underneath most production RAG systems in 2026, in the same Elasticsearch engine that also runs dense vectors and ELSER. Analyzer, inverted index, BM25, top-k, and the binary heuristic for when pure-vector retrieval is a mistake. Built around the classic-search-walkthrough demo embedded in the article.

22 min read 2026

RAG

Fixing the Query: LLM-Driven Transformation over BM25

For workloads where the user query is the problem rather than the retriever. Six LLM-driven query-side patterns that run over a pure-lexical BM25 retriever: multi-query retrieval with RRF, HyDE, step-back prompting, Query2doc, query decomposition for multi-hop, and rewrite-retrieve-read. Cost ledger and decision framework. Pure non-dense-vector at the retrieval layer; the LLM only touches the query side.

30 min read 2026

RAG

Re-ranking: The Second Chance

The two-stage retrieval pattern that most production RAG systems converge on: a fast bi-encoder first pass returns a broad candidate set, a slower cross-encoder reranks the top with full attention over query and document. Covers the architectural tradeoff, reciprocal rank fusion across retrievers, the 2026 reranker landscape, and worked examples showing where reranking earns its compute budget.

22 min read 2026

RAG

GraphRAG: When the Index Is a Graph

The structural alternative to vector RAG. The index is a typed knowledge graph extracted by a language model at ingest; retrieval is graph traversal or community-summary aggregation rather than similarity search. Walks the global-versus-local query distinction, the unusual economics where expensive indexing buys cheaper per-query inference, and the decision framework for when GraphRAG actually beats vector RAG.

35 min read 2026

RAG

Ontology-Driven Parsing for Retrieval

The piece that makes GraphRAG do real work. A curated ontology (taxonomy plus relation schema) commits at ingest time to what entities and relations the corpus contains. Walks how typed extraction differs from raw NER, the W3C standards landscape (RDF, OWL, SKOS, schema.org), an oil-and-gas worked example, and the operational cost of curation. Companion to GraphRAG.

20 min read 2026

RAG

Structured Data RAG: Routing and Text-to-SQL

When the answer lives in a relational database rather than documents or a graph. Covers Text-to-SQL patterns, schema access, the sandboxing discipline (arbitrary SQL never touches production), and the query-router layer that ties vector, graph, and relational backends together with a single provenance schema.

25 min read 2026

Evaluation

RAGAS Evaluation

The RAGAS framework for end-to-end RAG evaluation: faithfulness, answer relevance, context precision, and context recall. Covers implementation patterns, threshold selection, golden evaluation datasets, and stratified reporting so per-category failures do not hide inside an aggregate score.

28 min read 2026

Evaluation

LLM-as-Judge

Using models to evaluate models: rubric design, calibration against human judgment, and the known failure modes of automated evaluation. Covers session isolation between generator and judge so a same-session self-review does not produce confirmation-biased verdicts.

30 min read 2026

Evaluation

Human Evaluation Frameworks

Annotation guidelines, inter-rater reliability, and the moments when human evaluation is irreplaceable. Walks field-level confidence, the schema a reviewer needs to make a structured decision, and the escalation triggers that prove well-calibrated in practice versus the ones that do not.

26 min read 2026

History

Language has a Distributional Structure

In 1954, three years before Firth's famous one-line aphorism, a Penn linguistics professor named Zellig Harris published the seventeen pages of math behind it. Then the GPUs arrived, and the framework that scaled with corpus size was the teacher's.

15 min read 2026

History

The Paper That Funded a Fortune

In 1992, five researchers at IBM Yorktown Heights published a twelve-page paper on grouping English vocabulary into classes. Two of the authors would walk out of that group and help build the most profitable hedge fund in history.

20 min read 2026

History

A Brief History of Text Generation

From Shannon's hand-picked letters to modern LLMs. The real outputs from ELIZA, RACTER, char-rnn, and GPT, and why each generation felt like a breakthrough.

13 min read 2026

History

The 30-Year Journey of an Algorithm That Accidentally Learned to Read

How a 1994 data compression algorithm became the foundation of modern AI. The untold story of Byte Pair Encoding's journey from C Users Journal to GPT-4.

15 min read 2026

History

How Testing Proves Code

In June 1949, Alan Turing delivered a three-page paper at the inaugural EDSAC conference. He proposed flowchart assertions and variant functions for termination, the first written method for proving a program correct by checking its pieces. The paper was lost for thirty-five years before Floyd, Hoare, and Dijkstra independently rediscovered the same machinery.

14 min read 2026

History

The Blank That Predicted GPT

In 1953, a psychologist deleted every fifth word from a paragraph and asked people to guess what was missing. Seventy years later, every large language model on earth runs a mechanized version of the same experiment.

11 min read 2026

NLP

The Elegant Hack Powering Modern AI

Understanding how LLMs transform text into tokens, and why this seemingly simple process has profound implications for cost, context limits, and model behavior.

12 min read 2026

NLP

RTK: Token Compression in Practice

You learned how tokenization works and why context windows are a hard constraint. Here is a tool with 22.4K GitHub stars built entirely around the premise that most of those tokens are wasted.

9 min read 2026

NLP

What Is an Ontology?

A description of things that exist and how they relate to each other. From Aristotle's Categories to W3C OWL, with a Middle-earth worked example and the building blocks every ontology shares: classes, properties, relationships, constraints.

15 min read 2026

Algorithms

Breaking Text

BPE, WordPiece, SentencePiece, Unigram. Four algorithms, four trade-offs, none of them know what a word is.

6 min read 2026

Algorithms

Pointwise Mutual Information and the Independence Baseline

If you dumped every word of Pride and Prejudice into a hat and drew them out at random, the vocabulary would match the novel and the prose would be gibberish. The gap between the hat and Austen is exactly what PMI measures.

14 min read 2026

Equity

Why Non-English Speakers Pay More for AI

Tamil speakers pay 7x more tokens than English speakers for the same meaning. The hidden cost of tokenization and why morphology sets a compression ceiling.

10 min read 2026

Security

What Happens When Agents Get a Social Network

Over a million AI agents registered on a social network built exclusively for them. They formed religions and drafted constitutions. A critical analysis of what is actually happening, and what the security implications mean.

20 min read 2026

Security

When Tokens Glitch and Users Attack

Reddit usernames that break GPT. Invisible characters that bypass filters. The edge cases where tokenization fails spectacularly.

8 min read 2026

Engineering

How Developers Actually Collaborate

You know how to push code. But pushing code is not collaboration. Issues, branches, commits that reference those issues, pull requests, code review, and merge — the workflow every engineering team uses daily.

10 min read 2026

Engineering

GitHub as Infrastructure

Most developers treat GitHub like a filing cabinet. When you start working with LLMs seriously, a question emerges that most people skip past: where should your project's memory actually live?

12 min read 2026

Algorithms

When Writing Changes Voice and Statistics Listen

How a 2001 method for comparing corpora became a detector for AI-generated text, pasted content, and ghostwriters. Chi-squared drift detection across sliding windows.

12 min read 2026

Stylometry

Exhaled, Trembling, Dark

251,022 tokens across five books, measured against the British National Corpus. The frequency data draws a portrait of a man who wrote with his lungs and his skin.

12 min read 2026

Stylometry

Attaining Tonality

Everyone wants their chatbot to sound like them. The problem is that "sounding like you" means different things depending on how you actually write.

10 min read 2026

Stylometry

Every Word Has a Price Tag

The word "the" should appear about 6,185 times in every 100,000 words of English. When it doesn't, something interesting is happening.

9 min read 2026

NLP

The Hidden Geography of Language

How words turn into coordinates, and why "king minus man plus woman" equals "queen". The story of how meaning became geometry.

9 min read 2026

NLP

Words Learning the Company They Keep

Static embeddings gave "oracle" one vector for priestess, database, and Matrix character. Then attention learned to compute meaning from context.

11 min read 2026

Algorithms

The Invisible Boundaries of AI Conversation

Explore how LLMs manage context windows, from quadratic attention scaling to truncation strategies, and why the most expensive tokens are often the ones you never meant to send.

11 min read 2026

Algorithms

Context Rot

A model with a one-million-token window does not actually use one million tokens. The Chroma study, Lost in the Middle, RULER, NoLiMa, and BABILong all measure the same gap between marketed and effective context. Five failure modes, three mechanistic causes, and what context engineering can and cannot fix.

16 min read 2026

Architecture

Inside the Decoder-Only Transformer

A review of the Transformer for engineers who call LLM APIs every day but have never looked inside the box. Connects tokenization, embeddings, attention, FFN, and sampling into a single system.

18 min read 2026

Architecture

From Prompt to Token

You type a question. A second later, words start appearing. Between your keypress and that first token lies a pipeline that most practitioners never examine. This is what happens inside the model during that second.

16 min read 2026