← All Articles

When to Prompt, When to Code, When to Train

The most underrated prompt engineering skill is recognizing when you do not need prompt engineering at all. A decision framework for choosing the right tool before you build the wrong system.

In Brief

Before deciding how to write a prompt, decide whether you need a prompt at all. Every task can be solved by one of five approaches: write deterministic code for finite problems with enumerable solutions, prompt an LLM for open-ended tasks where occasional imprecision is acceptable and volume is low, wrap an LLM step inside a deterministic pipeline for medium-volume production, distill a large model into a fine-tuned smaller one for high-volume cost-sensitive inference, or use embedding-based classification for tasks that are really classification problems disguised as generation. The right choice depends on task properties: whether the problem is deterministic or probabilistic, whether the input space is bounded, what the error cost looks like, what the volume is, how tight the latency budget is, and whether labeled training data is available. Skipping this diagnosis leads to systems that are either over-engineered (calling an expensive LLM API for a problem that needs a regex) or under-engineered (squeezing generative output quality out of what is really a classification task).

The economics get sharp once full cost is accounted for. A prompt that costs a fraction of a cent per call looks cheap until it is multiplied by millions of daily requests, while an embedding model on a single GPU costs a few dollars a day and often achieves higher accuracy on classification tasks. The upfront cost of building training data, running distillation, or implementing a retrieval system is real, but it amortizes quickly past a certain scale. The discipline is running through the task-property matrix before reaching for a prompt: a deterministic problem with a finite input space calls for code, deterministic classification calls for embeddings, high-volume generation calls for distillation to a fine-tuned smaller model, and only low-volume open-ended understanding genuinely calls for prompting. This upstream decision prevents the common trap of defaulting to prompting because it is familiar, only to discover months later that the system would have been simpler, cheaper, and more reliable as code, as embeddings, or as a fine-tuned model.

. . .

The Question Before the Prompt

The other articles in this week's reading list teach you how to write prompts well. How to structure them (The Anatomy of a Prompt), how to version and test them (Prompts Are Code), how to diagnose failure (When Prompts Fail), how to avoid common anti-patterns (How NOT to Write a Prompt). All of that assumes you have already decided that a prompt is the right solution.

This article sits upstream of the others. It asks the question you should answer first: given this task, should you prompt an LLM at all, or is there a better tool?

A small person holding a single peanut up toward an enormous elephant whose trunk waits patiently — Perhaps a different approach.

The default instinct, especially after a week studying prompt engineering, is to prompt everything. You have a hammer; everything looks like a nail. The engineering discipline is knowing when the nail is actually a screw.

. . .

Five Strategies

Every task that might involve an LLM can be solved by one of five approaches. They are not ranked from worst to best. They are ranked from most deterministic to most flexible, and the right choice depends on the properties of the task, not the properties of the tool.

Jump to a strategy

Strategy 1: Write Code

If the task has a deterministic specification, the answer is almost always code. Counting the letters in a word. Validating an email address with a regex. Parsing a date string. Computing a running total. Sorting a list.

An LLM can do all of these things. It will get most of them right most of the time. But "most of the time" is not a specification. Ask GPT-4 to count the letters in "Gandalf" and it will sometimes say 7 and sometimes say 6. len("Gandalf") will say 7 every time, in zero milliseconds, at zero cost.

Those are trivial examples. The principle extends to non-trivial ones. Parsing a complex nested JSON schema, normalizing inconsistent date formats across twelve regional conventions, implementing a scoring rubric with weighted criteria: these are real engineering problems, but they are still finite and deterministic. The input space is bounded. The correct output for any valid input can be specified. The function, once written and tested, will produce that output every time.

The common objection is that the code is "too hard to write." In 2026, that objection has lost most of its force. You can use an LLM to generate the code, write test cases to prove it works, and then deploy the deterministic function. The LLM helped you build it; the LLM does not need to run it. You get the flexibility of natural-language specification during development and the reliability of deterministic execution in production.

This is not a minor distinction. A prompt that appears to solve the same problem carries hidden costs downstream: non-deterministic outputs require fuzzy assertions in your test suite, output validation in your pipeline, and retry logic for the cases where the model drifts. A function requires none of that. The apparent ease of writing a prompt obscures the ongoing cost of managing one.⁸

The decision criterion: if the task is finite and the correct output can be specified for every valid input, write a function. Use an LLM to help you write it if you need to. Then call the function, not the model.

Strategy 2: Prompt an LLM

Prompting is the right choice when the task requires natural-language understanding, the input space is too broad to enumerate, and the cost of occasional imprecision is acceptable.

Good prompting tasks share certain properties. The output is subjective or context-dependent (summarize this document, classify the tone of this message). The input varies enough that hard-coding rules would be brittle (extract the key claim from a customer complaint). The task benefits from world knowledge that would be impractical to encode in a knowledge base (identify the source work for a literary quote).

Prompting is also the right starting point when you do not yet know the shape of the problem. Before you build a classifier, a retrieval pipeline, or a fine-tuned model, you need to understand the data. A prompt is a prototype. It lets you explore the output space quickly, build intuition about what works and what fails, and collect examples that inform whatever you build next.

The cost of prompting is real. Every call costs money, adds latency, and introduces non-determinism. For a customer-facing chatbot handling 10,000 requests per day, those costs compound. For a research tool processing 50 documents, they are negligible. The decision depends on the volume, not the technique.

Strategy 3: Hybrid (Code Wraps LLM)

Most production LLM applications are not pure prompting. They are deterministic pipelines with a small LLM-powered step in the middle. The code handles input validation, output parsing, error handling, and business logic. The LLM handles the one step that requires language understanding.

Consider a system that routes customer support tickets. The routing logic is deterministic: tickets about billing go to the billing team, tickets about shipping go to logistics. But classifying the intent of a free-text ticket requires language understanding. The hybrid architecture uses an LLM to classify the intent (one API call, structured JSON output) and code to route the ticket based on the classification.

This is the pattern Chip Huyen describes as the core challenge of production LLM systems: "It's easy to build something cool with LLMs; very hard to build production-ready." The production-readiness comes from the code wrapping the LLM, not from the LLM itself. Input validation, retries, fallback logic, output schema enforcement, logging: all of this is deterministic code that makes the probabilistic step reliable.³

Strategy 4: Prompt-to-Train-to-Distill

This is the strategy most practitioners skip, and it is often the most cost-effective for high-volume production systems.

The workflow: start with a large model (GPT-4, Claude) and a well-crafted prompt. Use it to process hundreds or thousands of examples. Collect the outputs. Use those outputs as training data to fine-tune a smaller, cheaper model (GPT-4o-mini, Llama 3.2, a custom model). Deploy the fine-tuned model.

OpenAI's own distillation cookbook demonstrates the pattern. In their example, GPT-4o generates labeled completions on real-world inputs. Those completions become supervised training data for GPT-4o-mini. The distilled model achieved 79.3% accuracy, compared to 64.7% for the non-distilled baseline: a 22% relative improvement, at a fraction of the per-call cost.⁷

The economics are clear. If your system handles millions of requests, every cent per API call compounds. A fine-tuned 7B model running on a single GPU can serve the same request for a hundredth of the cost. The upfront investment (creating training data, running the fine-tuning job) is fixed. The marginal cost of each subsequent inference approaches hardware cost only.

The prerequisite is data. You need enough labeled examples (typically hundreds to low thousands) and enough diversity in those examples to cover the distribution of real inputs. The prompt-to-train pipeline generates this data, but the quality of the teacher's output serves as an upper limit for the student's performance. Maximize the teacher's quality before distilling.⁹

This strategy is explored in depth in Week 6 (LoRA and the Efficiency Revolution, From Raw Data to Fine-Tuned Model). For now, the point is that it exists as an option, and the decision to use it hinges on volume and cost.

Strategy 5: Embed-to-Classify

Many tasks that look like generation problems are actually classification or retrieval problems in disguise. "Read this ticket and tell me the category" sounds like it needs language understanding. It does. But it does not need a generative model to produce the answer.

An embedding model converts text into a vector. A classifier trained on those vectors maps the vector to a label. This is faster, cheaper, and often more accurate than asking an LLM to generate the label as text.

The numbers are striking. A recent benchmark compared embedding-based classification to LLM prompting (OpenAI) across multiple datasets. Embeddings achieved 44.1% accuracy; prompting achieved 29.5%. Embeddings were 14x faster for image classification and 81x faster for text. At 63 million annual requests, the cost difference was $9,179 (GPU embeddings) versus $92,325 (OpenAI API): ten times cheaper for the approach that was also more accurate.⁶

For classification tasks, embeddings are more accurate, faster, and cheaper at scale.

The embed-to-classify approach is explored in Weeks 4-5, where embedding models, vector stores, and retrieval pipelines are covered in depth. The point here is recognition: when you catch yourself writing a prompt that ends in "classify this as one of the following categories," stop and ask whether an embedding model would do the job better.

. . .

The Task-Property Matrix

The five strategies map to five task properties. Not every property matters for every task, but together they form a decision surface that points toward the right strategy.

Property	Write Code	Prompt LLM	Hybrid	Distill	Embed
Determinism required	Yes	No	Partially	No	Mostly
Input space	Enumerable	Open-ended	Structured + open	Open-ended	Open-ended
Latency budget	< 10ms	1-10s acceptable	100ms-2s	50-500ms	10-50ms
Error cost	High (must be correct)	Low-medium	Medium	Medium	Medium
Volume	Any	Low-medium	Medium-high	High (millions)	High
Data available	N/A	No (zero-shot)	Some examples	Hundreds-thousands	Labeled set
Time to production	Hours-days	Hours	Days-weeks	Weeks-months	Days
Marginal cost/call	~$0	$0.001-0.10	$0.001-0.10	$0.0001	$0.0001

Task-property matrix mapping task characteristics to the five strategies.

Read the table by columns, not rows. For any given task, check which column best matches your constraints. If your task requires determinism and sub-10ms latency, the answer is code, regardless of how good the latest model is. If your task requires open-ended language understanding at low volume and you have no training data, the answer is prompting.

. . .

Worked Examples

Abstract frameworks are useful. Concrete examples are better. Six tasks, each mapped to a strategy using the properties above. Flip through them.

Strategy: Write Code

Count the words in a paragraph

The specification is exact. The input space is text strings. The correct answer is a single integer. len(text.split()) is 100% accurate, runs in microseconds, and costs nothing. There is no reason to involve a neural network.

Strategy: Prompt an LLM

Classify the faction of a Star Wars character from a user-submitted description

The input is free-text, the description could be worded any way, and the classification requires world knowledge ("a short green creature with a lightsaber" maps to "Jedi" via knowledge the model already has). Volume is low (a classroom demo). Occasional misclassification is not catastrophic.

Strategy: Hybrid

Extract structured metadata from academic paper abstracts

The pipeline reads a PDF, extracts the abstract (code), sends it to an LLM with a structured output prompt (prompt), validates the JSON schema (code), writes to a database (code). The LLM handles the one step that requires language understanding. Everything else is deterministic.

Strategy: Embed-to-classify

Classify customer support tickets for a company processing 50,000 per day

At 50,000 tickets per day, an LLM API call per ticket costs $50-500 daily. An embedding model on a single GPU handles the same volume for a few dollars. The task is classification, not generation. Embeddings produce calibrated confidence scores that let you route low-confidence tickets to human review.

Strategy: Prompt-to-train-to-distill

Translate short product descriptions for a multilingual e-commerce site

Use GPT-4 to translate 5,000 product descriptions across target languages. Validate the translations with native speakers. Fine-tune a smaller multilingual model on the validated outputs. Deploy the fine-tuned model. The initial investment in teacher-generated data pays for itself within weeks at e-commerce volume.

Strategy: Write Code

Detect whether a Doctor Who episode synopsis mentions a specific villain

A regex or keyword search for "Dalek" or "Cyberman" is faster, cheaper, and more reliable than asking an LLM. The input space is constrained (episode synopses have known structure), and the target is a string match, not semantic understanding.⁸

1 of 6

Six tasks, five strategies. Click the dots or arrows to flip through.

. . .

The Decision Flowchart

If the matrix feels like too many dimensions, here is the same logic as a sequential filter. Start at the top. Each question is a gate: if the answer is yes, you have your strategy. Only fall through to the next question if the answer is no.

Each question is a gate. Fall through only on "no."

The order of the questions matters. The flowchart starts with the cheapest, fastest, most deterministic option and moves toward the most expensive and flexible. Each step is a filter, not a recommendation. You only reach "prompt an LLM" after ruling out the alternatives.²

. . .

The Lifecycle Dimension

The five strategies are not permanent decisions. Most production LLM systems evolve through them over time.

A common trajectory: you start with Strategy 2 (prompt an LLM) to explore the problem and validate the approach. As the system matures, you move to Strategy 3 (hybrid) by wrapping the LLM call in deterministic code. As volume grows, you move to Strategy 4 (distillation) by fine-tuning a smaller model on the accumulated outputs. If the task stabilizes further, you might replace the model entirely with Strategy 5 (embeddings) or even Strategy 1 (code), if the patterns are regular enough to hard-code.

Most production systems migrate rightward as volume grows and patterns stabilize.

Meta's own engineering guidance describes this trajectory explicitly: start with prompting if baseline accuracy exceeds 85%, add RAG for knowledge-intensive problems, and fine-tune "only when the model's intrinsic behavior must be altered robustly and persistently."¹

Hamel Husain, who has worked across the full spectrum from notebook prototypes to production fine-tuning, reports that 60-80% of development time in real LLM projects is spent on error analysis and evaluation, not building features. That ratio holds regardless of which strategy you choose. The decision framework tells you where to invest. The evaluation discipline described in the other Week 3 articles tells you how to invest.⁵

. . .

The Moving Goalpost

The cost and capability profiles of these strategies shift with every model release.¹⁰ A task that was too expensive to prompt in 2024 may be cheap in 2026. A task that required fine-tuning in 2024 may be solvable with a zero-shot prompt today.

This is why the framework is defined in terms of task properties (determinism, latency, volume, error cost, data availability) rather than model properties (pricing, capability, speed). Task properties are stable. Model properties change quarterly. A framework anchored to task properties survives model turnover. A framework anchored to "GPT-4 costs X" is obsolete before you finish reading it.

. . .

What the Commentators Get Right

The practitioner community has converged on a few principles that this framework formalizes. The vocabulary varies; the engineering does not. Yan's seven-pattern taxonomy maps cleanly onto the strategies above: RAG is a variant of hybrid; fine-tuning is distillation; evals sit inside every strategy.

Shankar's data-flywheel pattern describes the lifecycle within each strategy, and her SPADE framework for data quality assertions applies whether you are evaluating prompt outputs, embedding accuracy, or fine-tuned behavior. The discipline travels.

Husain's prescription is almost aggressively analog: manual error analysis with a human in the loop. It applies to prompt iteration, embedding evaluation, and fine-tuning alike. None of the three argue for a single strategy. They argue for the discipline that surrounds whichever one you choose.

. . .

For Practitioners

Default to the simplest strategy that meets the spec. If a function solves the problem, write the function. If an embedding classifier solves it, train the classifier. Reach for a generative LLM only when you genuinely need generation.

Start with prompting, but plan the migration. Prompting is the fastest way to validate an idea. It should not be the permanent architecture for a high-volume system. When you write a prompt, also write a note: "At what volume does this become too expensive? What would the distilled version look like?"

Measure the actual cost, not the theoretical cost. Run the numbers. How many API calls per day? What is the per-call cost? What would the same task cost with embeddings, with a fine-tuned model, or with code? Chip Huyen's observation is correct: this analysis must be redone constantly, because the pricing landscape shifts quarterly.³

Separate the language step from the pipeline. If your system is 95% deterministic code and 5% language understanding, the hybrid pattern (Strategy 3) almost always wins. Wrapping an LLM call in validation, retries, and schema enforcement makes the probabilistic step predictable enough for production.

Revisit the decision after every model release. The task that justified fine-tuning at GPT-3.5 prices may not justify it at GPT-4o-mini prices. The classification task that needed embeddings may work well enough with a cheap prompt now. The framework is stable, but the answers it produces are not.

. . .

References

Meta AI. "When to Fine-Tune LLMs vs Other Techniques." Meta AI Blog, 2024.
Yan, E. "Patterns for Building LLM-based Systems & Products." 2023.
Huyen, C. "Building a Generative AI Platform." 2024.
Shankar, S. "The AI Engineering Flywheel." 2024.
Husain, H. "LLM Evals FAQ." 2024.
Aggarwal, M., et al. "Beyond the Hype: Embeddings vs. Prompting for Classification Tasks." arXiv, 2025.
OpenAI. "Leveraging Model Distillation to Fine-Tune a Model." OpenAI Cookbook, 2024.
Cook, J. D. "LLMs and Regular Expressions." 2024.
Predibase. "12 Best Practices for Distilling Smaller Models." 2024.
Featherless AI. "LLM API Pricing Comparison 2026." 2026.

When to Prompt, When to Code, When to Train

The Question Before the Prompt

Five Strategies

Strategy 1: Write Code

Strategy 2: Prompt an LLM

Strategy 3: Hybrid (Code Wraps LLM)

Strategy 4: Prompt-to-Train-to-Distill

Strategy 5: Embed-to-Classify

The Task-Property Matrix

Worked Examples

The Decision Flowchart

The Lifecycle Dimension

The Moving Goalpost

What the Commentators Get Right

For Practitioners

References

Further Reading