Sources
Grounding, citations, and further reading for When to Prompt, When to Code, When to Train.
All of this is optional. These are the sources used to ground the decision framework in the article. Nothing on this page is required reading, and you do not need to purchase or access any of these resources.
The article itself is self-contained. This page exists so that practitioners who want to go deeper on a specific strategy, benchmark, or framework know where to look.
About the Sources
Yan: Patterns for Building LLM-based Systems & Products
Widely cited practitioner guide identifying seven core patterns for production LLM systems. Yan's emphasis on evals as the differentiator between "hot garbage" and serious products directly informed the article's framing of evaluation as strategy-agnostic. Available at eugeneyan.com.
Huyen: Building a Generative AI Platform
Platform-level analysis of LLM development challenges. Huyen identifies three properties that make LLM engineering difficult: ambiguity of natural language, stochasticity, and rapid model evolution. Her observation that "analysis on cost-latency, build vs. buy, prompting vs. fine-tuning must be redone constantly" is the motivation for anchoring the article's framework to task properties rather than model properties. Available at huyenchip.com.
Aggarwal et al.: Embeddings vs. Prompting for Classification
The primary quantitative source for the embed-to-classify strategy. Benchmarks embedding-based classification against LLM prompting on accuracy, latency, cost, and calibration. The 44.1% vs. 29.5% accuracy gap and the 81x latency difference are the most cited numbers in the article. Available at arxiv.org.
Five Strategies
8When traditional tools outperform LLMs ↩ Back to article
Cook's blog post makes the case concisely: "Sometimes it takes a low-tech tool to find problems with a high-tech tool." Regex runs in O(n) time using a finite-state automaton. An LLM runs through billions of parameters to approximate the same computation, slower, more expensive, and less reliably. The gap between deterministic string processing and probabilistic generation is not marginal; it is orders of magnitude in every dimension that matters for production systems.
Cook, J. D. (2024). "LLMs and Regular Expressions."
3The production-readiness challenge ↩ Back to article
Huyen frames the core tension of hybrid systems: "It's easy to build something cool with LLMs; very hard to build production-ready." The production-readiness comes from the deterministic wrapper, not from the LLM itself. Input validation, retries, fallback logic, output schema enforcement, and logging are all conventional engineering that makes the probabilistic step reliable. The KreditFlow system at Maryville University is a concrete example: 330 Lambda functions, four Claude models, twelve prompt templates. If you stripped out the LLM calls, you would still have a functioning data pipeline. The LLMs add the intelligence; the code provides the skeleton.
Huyen, C. (2024). "Building a Generative AI Platform."
7The distillation pipeline in practice ↩ Back to article
OpenAI's own cookbook demonstrates the prompt-to-train-to-distill pipeline. GPT-4o generates labeled completions on real-world inputs. Those completions become supervised training data for GPT-4o-mini. The distilled model achieved 79.3% accuracy compared to 64.7% for the non-distilled baseline: a 22% relative improvement. The key principle: teacher output quality is the ceiling for student performance. Simon Willison publicly questioned whether anyone has a fine-tuning success story that beats "prompting existing hosted models or waiting for next-gen models." For most tasks, the answer is no. The exception is when you need the cost structure, not just the accuracy. At 63 million requests per year, the math does not care about model quality parity.
OpenAI. (2024). "Leveraging Model Distillation to Fine-Tune a Model." OpenAI Cookbook.
9Data requirements for distillation ↩ Back to article
Predibase's guide to distilling smaller models emphasizes diverse, non-repetitive, balanced datasets. The more scenarios covered, the better generalization. Quality of the teacher's output serves as an upper limit for the student's performance, which means maximizing teacher quality before distillation is a non-negotiable prerequisite. The guide also addresses a common failure mode: teams that distill from a narrow subset of production data end up with a model that performs well on the common cases but fails on the edge cases that matter most.
Predibase. (2024). "12 Best Practices for Distilling Smaller Models."
6Embeddings vs. prompting: the numbers ↩ Back to article
Aggarwal et al. benchmark embedding-based classification against LLM prompting across multiple datasets. The results are unambiguous for classification tasks: embeddings achieve 44.1% accuracy versus 29.5% for prompting (+49.5% relative), with 81x faster text classification latency (15ms vs 1,220ms) and 10x lower annual cost at 63 million requests ($9,179 vs $92,325). Crucially, embeddings also produce well-calibrated probability distributions suitable for confidence thresholds, while LLM prompting returns concentrated scores that are "mostly uninformative" for real-world decisions. This means embeddings not only classify better but also tell you how confident they are in the classification.
Aggarwal, M., et al. (2025). "Beyond the Hype: Embeddings vs. Prompting for Classification Tasks." arXiv.
The Decision Flowchart
2Evals as the differentiator ↩ Back to article
Yan identifies evaluation as "a major differentiator between folks rushing out hot garbage and those seriously building products." His seven-pattern framework (evals, RAG, fine-tuning, caching, guardrails, defensive UX, user feedback) maps to the strategies in this article: evals sit inside every strategy, RAG is a variant of the hybrid pattern, fine-tuning is the distillation strategy. The vocabulary differs; the engineering principles converge. His emphasis on hybrid retrieval (traditional BM25 + semantic search) also supports the article's point that embedding-based approaches and deterministic code are not alternatives to LLMs but complements.
Yan, E. (2023). "Patterns for Building LLM-based Systems & Products."
The Lifecycle Dimension
1Meta's decision framework ↩ Back to article
Meta's engineering blog articulates the lifecycle trajectory: start with prompting if baseline accuracy exceeds 85%, add RAG for knowledge-intensive problems, and fine-tune "only when the model's intrinsic behavior must be altered robustly and persistently." They note that hosting a fine-tuned model can cost more than the base model due to infrastructure overhead, making distillation to smaller models often more economical. Their recommendation of hybrid solutions combining fine-tuning and RAG as "often yield the best results" aligns with the article's lifecycle diagram showing systems migrating through strategies as they mature.
Meta AI. (2024). "When to Fine-Tune LLMs vs Other Techniques." Meta AI Blog.
5Where development time actually goes ↩ Back to article
Husain reports that 60-80% of development time in real LLM projects is spent on error analysis and evaluation, not building features. His recommended workflow is striking in its simplicity: thirty minutes reviewing 20-50 outputs after each change, with a single domain expert as "benevolent dictator" evaluating pass/fail. This holds regardless of which strategy you choose. The evaluation discipline described in the companion Week 3 articles (Prompts Are Code, promptfoo walkthrough) implements this loop with tooling; Husain's point is that the loop itself matters more than the tooling.
Husain, H. (2024). "LLM Evals FAQ."
The Moving Goalpost
10The hundred-fold price drop ↩ Back to article
Featherless AI's 2026 pricing survey documents the shift: competitive models like DeepSeek V3.2 charge $0.28 per million input tokens, compared to GPT-4's $30 per million in 2024. That is a hundred-fold reduction in two years. The report notes that "80% of applications work fine with $0.40-$2.50/MTok range; only coding agents, reasoning, multi-step planning benefit from $10+/MTok models." This price collapse recalibrates every decision in the framework. Tasks that were "too expensive to prompt" are now cheap. Tasks that were "worth fine-tuning" may no longer justify the upfront investment. The framework survives because it asks "what does the task require?" not "what does the model cost?" But the answers change. Revisit quarterly.
Featherless AI. (2026). "LLM API Pricing Comparison 2026."
What the Commentators Get Right
4The data-flywheel pattern ↩ Back to article
Shankar's data-flywheel pattern (evaluate, monitor, continually improve) describes the lifecycle within each strategy. Her SPADE framework for data quality assertions applies whether you are evaluating prompt outputs, embedding model accuracy, or fine-tuned model behavior. The PROMPTEVALS dataset (2,087 prompts, 12,623 assertions) is a useful benchmark for evaluating models' ability to generate guardrails. The tooling is strategy-agnostic; the practice is universal.
Shankar, S. (2024). "The AI Engineering Flywheel."