When to Prompt, When to Code, When to Train
The most underrated prompt engineering skill is recognizing when you do not need prompt engineering at all. A decision framework for choosing the right tool before you build the wrong system.
The Question Before the Prompt
The other articles in this week's reading list teach you how to write prompts well. How to structure them (The Anatomy of a Prompt), how to version and test them (Prompts Are Code), how to diagnose failure (When Prompts Fail), how to avoid common anti-patterns (How NOT to Write a Prompt). All of that assumes you have already decided that a prompt is the right solution.
This article sits upstream of the others. It asks the question you should answer first: given this task, should you prompt an LLM at all, or is there a better tool?
The default instinct, especially after a week studying prompt engineering, is to prompt everything. You have a hammer; everything looks like a nail. The engineering discipline is knowing when the nail is actually a screw.
Five Strategies
Every task that might involve an LLM can be solved by one of five approaches. They are not ranked from worst to best. They are ranked from most deterministic to most flexible, and the right choice depends on the properties of the task, not the properties of the tool.
Strategy 1: Write Code
If the task has a deterministic specification, the answer is almost always code. Counting the letters in a word. Validating an email address with a regex. Parsing a date string. Computing a running total. Sorting a list.
An LLM can do all of these things. It will get most of them right most of the time. But "most of the time" is not a specification. Ask GPT-4 to count the letters in "Gandalf" and it will sometimes say 7 and sometimes say 6. len("Gandalf") will say 7 every time, in zero milliseconds, at zero cost.
Those are trivial examples. The principle extends to non-trivial ones. Parsing a complex nested JSON schema, normalizing inconsistent date formats across twelve regional conventions, implementing a scoring rubric with weighted criteria: these are real engineering problems, but they are still finite and deterministic. The input space is bounded. The correct output for any valid input can be specified. The function, once written and tested, will produce that output every time.
The common objection is that the code is "too hard to write." In 2026, that objection has lost most of its force. You can use an LLM to generate the code, write test cases to prove it works, and then deploy the deterministic function. The LLM helped you build it; the LLM does not need to run it. You get the flexibility of natural-language specification during development and the reliability of deterministic execution in production.
This is not a minor distinction. A prompt that appears to solve the same problem carries hidden costs downstream: non-deterministic outputs require fuzzy assertions in your test suite, output validation in your pipeline, and retry logic for the cases where the model drifts. A function requires none of that. The apparent ease of writing a prompt obscures the ongoing cost of managing one.8
The decision criterion: if the task is finite and the correct output can be specified for every valid input, write a function. Use an LLM to help you write it if you need to. Then call the function, not the model.
Strategy 2: Prompt an LLM
Prompting is the right choice when the task requires natural-language understanding, the input space is too broad to enumerate, and the cost of occasional imprecision is acceptable.
Good prompting tasks share certain properties. The output is subjective or context-dependent (summarize this document, classify the tone of this message). The input varies enough that hard-coding rules would be brittle (extract the key claim from a customer complaint). The task benefits from world knowledge that would be impractical to encode in a knowledge base (identify the source work for a literary quote).
Prompting is also the right starting point when you do not yet know the shape of the problem. Before you build a classifier, a retrieval pipeline, or a fine-tuned model, you need to understand the data. A prompt is a prototype. It lets you explore the output space quickly, build intuition about what works and what fails, and collect examples that inform whatever you build next.
The cost of prompting is real. Every call costs money, adds latency, and introduces non-determinism. For a customer-facing chatbot handling 10,000 requests per day, those costs compound. For a research tool processing 50 documents, they are negligible. The decision depends on the volume, not the technique.
Strategy 3: Hybrid (Code Wraps LLM)
Most production LLM applications are not pure prompting. They are deterministic pipelines with a small LLM-powered step in the middle. The code handles input validation, output parsing, error handling, and business logic. The LLM handles the one step that requires language understanding.
Consider a system that routes customer support tickets. The routing logic is deterministic: tickets about billing go to the billing team, tickets about shipping go to logistics. But classifying the intent of a free-text ticket requires language understanding. The hybrid architecture uses an LLM to classify the intent (one API call, structured JSON output) and code to route the ticket based on the classification.
This is the pattern Chip Huyen describes as the core challenge of production LLM systems: "It's easy to build something cool with LLMs; very hard to build production-ready." The production-readiness comes from the code wrapping the LLM, not from the LLM itself. Input validation, retries, fallback logic, output schema enforcement, logging: all of this is deterministic code that makes the probabilistic step reliable.3
Strategy 4: Prompt-to-Train-to-Distill
This is the strategy most practitioners skip, and it is often the most cost-effective for high-volume production systems.
The workflow: start with a large model (GPT-4, Claude) and a well-crafted prompt. Use it to process hundreds or thousands of examples. Collect the outputs. Use those outputs as training data to fine-tune a smaller, cheaper model (GPT-4o-mini, Llama 3.2, a custom model). Deploy the fine-tuned model.
OpenAI's own distillation cookbook demonstrates the pattern. In their example, GPT-4o generates labeled completions on real-world inputs. Those completions become supervised training data for GPT-4o-mini. The distilled model achieved 79.3% accuracy, compared to 64.7% for the non-distilled baseline: a 22% relative improvement, at a fraction of the per-call cost.7
The economics are clear. If your system handles millions of requests, every cent per API call compounds. A fine-tuned 7B model running on a single GPU can serve the same request for a hundredth of the cost. The upfront investment (creating training data, running the fine-tuning job) is fixed. The marginal cost of each subsequent inference approaches hardware cost only.
The prerequisite is data. You need enough labeled examples (typically hundreds to low thousands) and enough diversity in those examples to cover the distribution of real inputs. The prompt-to-train pipeline generates this data, but the quality of the teacher's output serves as an upper limit for the student's performance. Maximize the teacher's quality before distilling.9
This strategy is explored in depth in Week 6 (LoRA and the Efficiency Revolution, From Raw Data to Fine-Tuned Model). For now, the point is that it exists as an option, and the decision to use it hinges on volume and cost.
Strategy 5: Embed-to-Classify
Many tasks that look like generation problems are actually classification or retrieval problems in disguise. "Read this ticket and tell me the category" sounds like it needs language understanding. It does. But it does not need a generative model to produce the answer.
An embedding model converts text into a vector. A classifier trained on those vectors maps the vector to a label. This is faster, cheaper, and often more accurate than asking an LLM to generate the label as text.
The numbers are striking. A recent benchmark compared embedding-based classification to LLM prompting (OpenAI) across multiple datasets. Embeddings achieved 44.1% accuracy; prompting achieved 29.5%. Embeddings were 14x faster for image classification and 81x faster for text. At 63 million annual requests, the cost difference was $9,179 (GPU embeddings) versus $92,325 (OpenAI API): ten times cheaper for the approach that was also more accurate.6
The embed-to-classify approach is explored in Weeks 4-5, where embedding models, vector stores, and retrieval pipelines are covered in depth. The point here is recognition: when you catch yourself writing a prompt that ends in "classify this as one of the following categories," stop and ask whether an embedding model would do the job better.
The Task-Property Matrix
The five strategies map to five task properties. Not every property matters for every task, but together they form a decision surface that points toward the right strategy.
| Property | Write Code | Prompt LLM | Hybrid | Distill | Embed |
|---|---|---|---|---|---|
| Determinism required | Yes | No | Partially | No | Mostly |
| Input space | Enumerable | Open-ended | Structured + open | Open-ended | Open-ended |
| Latency budget | < 10ms | 1-10s acceptable | 100ms-2s | 50-500ms | 10-50ms |
| Error cost | High (must be correct) | Low-medium | Medium | Medium | Medium |
| Volume | Any | Low-medium | Medium-high | High (millions) | High |
| Data available | N/A | No (zero-shot) | Some examples | Hundreds-thousands | Labeled set |
| Time to production | Hours-days | Hours | Days-weeks | Weeks-months | Days |
| Marginal cost/call | ~$0 | $0.001-0.10 | $0.001-0.10 | $0.0001 | $0.0001 |
Read the table by columns, not rows. For any given task, check which column best matches your constraints. If your task requires determinism and sub-10ms latency, the answer is code, regardless of how good the latest model is. If your task requires open-ended language understanding at low volume and you have no training data, the answer is prompting.
Worked Examples
Abstract frameworks are useful. Concrete examples are better. Six tasks, each mapped to a strategy using the properties above. Flip through them.
The Decision Flowchart
If the matrix feels like too many dimensions, here is the same logic as a sequential filter. Start at the top. Each question is a gate: if the answer is yes, you have your strategy. Only fall through to the next question if the answer is no.
The order of the questions matters. The flowchart starts with the cheapest, fastest, most deterministic option and moves toward the most expensive and flexible. Each step is a filter, not a recommendation. You only reach "prompt an LLM" after ruling out the alternatives.2
The Lifecycle Dimension
The five strategies are not permanent decisions. Most production LLM systems evolve through them over time.
A common trajectory: you start with Strategy 2 (prompt an LLM) to explore the problem and validate the approach. As the system matures, you move to Strategy 3 (hybrid) by wrapping the LLM call in deterministic code. As volume grows, you move to Strategy 4 (distillation) by fine-tuning a smaller model on the accumulated outputs. If the task stabilizes further, you might replace the model entirely with Strategy 5 (embeddings) or even Strategy 1 (code), if the patterns are regular enough to hard-code.
Meta's own engineering guidance describes this trajectory explicitly: start with prompting if baseline accuracy exceeds 85%, add RAG for knowledge-intensive problems, and fine-tune "only when the model's intrinsic behavior must be altered robustly and persistently."1
Hamel Husain, who has worked across the full spectrum from notebook prototypes to production fine-tuning, reports that 60-80% of development time in real LLM projects is spent on error analysis and evaluation, not building features. That ratio holds regardless of which strategy you choose. The decision framework tells you where to invest. The evaluation discipline described in the other Week 3 articles tells you how to invest.5
The Moving Goalpost
The cost and capability profiles of these strategies shift with every model release.10 A task that was too expensive to prompt in 2024 may be cheap in 2026. A task that required fine-tuning in 2024 may be solvable with a zero-shot prompt today.
This is why the framework is defined in terms of task properties (determinism, latency, volume, error cost, data availability) rather than model properties (pricing, capability, speed). Task properties are stable. Model properties change quarterly. A framework anchored to task properties survives model turnover. A framework anchored to "GPT-4 costs X" is obsolete before you finish reading it.
What the Commentators Get Right
The practitioner community has converged on a few principles that this framework formalizes. The vocabulary varies; the engineering does not. Yan's seven-pattern taxonomy maps cleanly onto the strategies above: RAG is a variant of hybrid; fine-tuning is distillation; evals sit inside every strategy.
Shankar's data-flywheel pattern describes the lifecycle within each strategy, and her SPADE framework for data quality assertions applies whether you are evaluating prompt outputs, embedding accuracy, or fine-tuned behavior. The discipline travels.
Husain's prescription is almost aggressively analog: manual error analysis with a human in the loop. It applies to prompt iteration, embedding evaluation, and fine-tuning alike. None of the three argue for a single strategy. They argue for the discipline that surrounds whichever one you choose.
For Practitioners
Default to the simplest strategy that meets the spec. If a function solves the problem, write the function. If an embedding classifier solves it, train the classifier. Reach for a generative LLM only when you genuinely need generation.
Start with prompting, but plan the migration. Prompting is the fastest way to validate an idea. It should not be the permanent architecture for a high-volume system. When you write a prompt, also write a note: "At what volume does this become too expensive? What would the distilled version look like?"
Measure the actual cost, not the theoretical cost. Run the numbers. How many API calls per day? What is the per-call cost? What would the same task cost with embeddings, with a fine-tuned model, or with code? Chip Huyen's observation is correct: this analysis must be redone constantly, because the pricing landscape shifts quarterly.3
Separate the language step from the pipeline. If your system is 95% deterministic code and 5% language understanding, the hybrid pattern (Strategy 3) almost always wins. Wrapping an LLM call in validation, retries, and schema enforcement makes the probabilistic step predictable enough for production.
Revisit the decision after every model release. The task that justified fine-tuning at GPT-3.5 prices may not justify it at GPT-4o-mini prices. The classification task that needed embeddings may work well enough with a cheap prompt now. The framework is stable, but the answers it produces are not.
References
- Meta AI. "When to Fine-Tune LLMs vs Other Techniques." Meta AI Blog, 2024.
- Yan, E. "Patterns for Building LLM-based Systems & Products." 2023.
- Huyen, C. "Building a Generative AI Platform." 2024.
- Shankar, S. "The AI Engineering Flywheel." 2024.
- Husain, H. "LLM Evals FAQ." 2024.
- Aggarwal, M., et al. "Beyond the Hype: Embeddings vs. Prompting for Classification Tasks." arXiv, 2025.
- OpenAI. "Leveraging Model Distillation to Fine-Tune a Model." OpenAI Cookbook, 2024.
- Cook, J. D. "LLMs and Regular Expressions." 2024.
- Predibase. "12 Best Practices for Distilling Smaller Models." 2024.
- Featherless AI. "LLM API Pricing Comparison 2026." 2026.
Further Reading
- Jurafsky, Daniel & James H. Martin. "Speech and Language Processing," 3rd ed. (draft). Chapters 7 (prompting, conditional generation, temperature) and 10 (in-context learning, fine-tuning).
- Widdows, Dominic & Trevor Cohen. "Large Language Models: How They Work and Why They Matter." SemanticVectors Publishing, 2025. Chapters 1, 4-7.
- Alammar, Jay & Maarten Grootendorst. "Hands-On Large Language Models." O'Reilly Media, 2024. Chapter 6.
- Raschka, Sebastian. "Build a Large Language Model (From Scratch)." Manning, 2024. Chapter 7.
- Extended grounding notes for all citations: Sources.