← All Articles

Token Burn Is Not Productivity

Lines of code never measured productivity, and token consumption does not measure it either. The companies selling the tokens have a reason to suggest otherwise.

In Brief

Token burn is the new lines-of-code: easy to count, correlated with effort, leaderboard-friendly, and wrong in every direction. The bragging genre is industry-wide; the empirical productivity case is weak; and the companies whose pricing depends on token consumption are also the loudest voices explaining how that consumption is the leading indicator of engineering progress. Anyone who has watched any other measurement-and-incentive cycle in the last thirty years already knows what is coming.

This article walks the structural argument and then makes it concrete with a toy example. A sort_numbers function that used to be five lines of defensive code has been quietly rewritten in production codebases as a runtime LLM call. Same signature, different ontology. The example exaggerates to make the pattern visible, but the pattern is real, and the question worth asking is how much code in the last two years has migrated across that line in the wrong direction.

We all have a version of the same FOMO. We see the posts. An engineer at a frontier AI lab burned $150,000 in Claude Code tokens in a single month. A startup founder went on LinkedIn to brag that the company's monthly AI bill had hit $113,000.

A small figure stands holding a shovel beside a burning pile of paper. Smoke rises from the pile and billows into a vast mushroom cloud that fills the sky and dwarfs the figure entirely. — The leaderboard rewards plumes.

The CEO of Nvidia said on stage that he would be worried about a $500,000 engineer who did not consume at least $250,000 in tokens. The CEO of a fintech told an audience that anyone who did not use Claude Code "would not make it". The implication is everywhere, even when nobody states it outright. If you are not burning tokens, you are falling behind.

It is a poor way to work, and the case is not subtle. Where the empirical evidence has been gathered, it does not support the bragging. METR's randomized trial of experienced open-source developers, published in July 2025, found that AI tools made them 19% slower at completing real tasks. The same developers, after the experience, still believed AI had sped them up by 20%. The narrative outpaced the measurement by nearly forty points.

We have done this before, with a different metric and the same logic.

From the late 1970s through the early 2000s, certain shops measured engineer output by lines of code shipped per week. The metric was easy to count, correlated with effort, and made productive engineers visible on a leaderboard. It was also wrong in every direction: engineers padded code to score better, refactoring that reduced line count was punished, and the best engineers, the ones whose three-line fix replaced a thousand-line module, scored worst.

Matt Calkins, the CEO of Appian, has compared the current pattern to "the Soviet practice of judging the quality of chandeliers by their weight." The judgment metric was easy to apply, and the chandeliers were terrible.

The bragging genre, in seven posts

1 / 7

Andrej Karpathy on X

February 2, 2025

"I 'Accept All' always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension."

Read on X →

New York Times (via summary)

March 2026

"An OpenAI engineer who processed 210 billion tokens in a week, and an Anthropic employee who ran up a Claude Code bill of $150,000 in a month."

Read coverage →

Amos Bar-Joseph, CEO, Swan AI

April 2026

"Our AI bill just hit $113,000 in a single month."

Read coverage →

Jensen Huang, CEO, Nvidia

April 23, 2026

"Would be concerned if a $500,000 engineer didn't consume at least $250,000 in tokens."

Read coverage →

Barney Hussey-Yeo, CEO, Cleo

April 23, 2026

"Anyone who does not use Claude Code to improve productivity and ways of working will not make it."

Read coverage →

Doug Kerwin, on Medium

December 4, 2025

"My spend today is at least $800/mo and rising. AI accelerates real work by 3-5x. If AI makes that developer even 20% more productive, that's already 3x the cost of the AI tooling. At that multiplier, $1K/month is a bargain."

Read on Medium →

Anthropic, official product page

Current

"Agentic, not autocomplete. Describe what you want to build, test, iterate, or ship. Claude Code handles the rest. The execution loop runs independently, saving days of work."

Read on Anthropic →

Token burn is the same metric in new language: easy to count, correlated with effort, leaderboard-friendly, and wrong in every direction. The companies whose pricing depends on token consumption are also the loudest voices explaining how that consumption is the leading indicator of engineering progress. Anyone who has watched any other measurement-and-incentive cycle in the last thirty years already knows what is coming.

It is worth seeing the pricing math in plain numbers. Anthropic charges $5 per million input tokens and $25 per million output tokens on its top-tier model: five times more for the words the model writes than for the words it reads. Every product feature that encourages models to write more, whether extended thinking, multi-shot reasoning, autonomous agent loops, or computer use, sits on the high-margin side of that asymmetry. When the latest model rolled out a new tokenizer that uses up to 35% more tokens for the same input text, that was a 35% revenue increase per identical workload, with no underlying change in the work being done. The productivity narrative is not separable from the pricing structure that produces it.

And the prices most engineers see today are not the prices Anthropic intends to charge them. The $200 monthly Max plan reportedly covers users running $1,000 to $5,000 per day in API-equivalent workloads. The current pricing is a loss leader. The habits being formed under it will be billed at retail when the subsidy runs out.

What follows is a small example of what happens when a developer internalizes the narrative.

I have been reading a lot of production Python this year, and a particular regression keeps showing up. A function that existed in clean, deterministic form for three decades has been rewritten to call a frontier model. Same name, same signature, very different function.

Take this one: sort_numbers(xs: list[int]) -> list[int]. The pre-2023 implementation was a handful of defensive lines. The version I keep finding in 2026 codebases is thirty. The line count is not the interesting part. What the function actually does now is.

. . .

Before

def sort_numbers(xs: list[int] | None) -> list[int]:
    if xs is None:
        return []
    if len(xs) == 0:
        return []
    if len(xs) == 1:
        return list(xs)         # nothing to sort, but return a fresh list
    return sorted(xs)

The pre-2023 implementation: deterministic, free, instant, and contract-holding.

That function does almost nothing. The actual sort is one line. The other four lines are the reason the function exists at all.

The None guard catches upstream parse failures so callers do not have to wrap every invocation in a try. The empty-list guard separates "no data" from "single point" semantically. The single-item shortcut signals a known-trivial case and avoids an unnecessary allocation in a hot loop. The fresh list on the trivial path makes sure the caller never gets back a mutable reference to their own input, because mutation surprises are how production bugs get filed at three in the morning.

The function existed in some form, in some library, in some language, since before most developers writing code today were born. It was understood, tested, and correct. You could call it on a 200-million-element vector of timestamps from the entire Star Wars expanded universe and it would not blink.

. . .

After

def sort_numbers(xs: list[int]) -> list[int]:
    client = OpenAI()
    resp = client.chat.completions.create(
        model="gpt-4o",
        temperature=0.0,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": """You are a precise sorting assistant. Your sole task is to
return the input list of integers, sorted in ascending order, and nothing else.

Rules:
- Return every input element exactly once. Do not omit, deduplicate, or merge.
- Do not invent elements that were not in the input.
- Do not normalize, round, truncate, or modify the values.
- Treat negative numbers as smaller than positive numbers.
- Do not include any prose, explanation, apology, or follow-up question.
- Do not return the answer in a language other than English.
- Output a single JSON object: {"sorted": [<int>, ...]}. Nothing else.
- If the input is empty, return {"sorted": []}.
- If you cannot parse the input, return {"sorted": []}. Do not refuse."""},
            {"role": "user", "content": f"Sort: {xs}"},
        ],
    )
    return resp.choices[0].message.content

The post-2023 implementation: shorter than it should be, longer than it needs to be, and lying about its return type.

That version takes about 1.4 seconds on a good day, costs roughly three hundredths of a cent per call, depends on a network and a vendor SLA, and trusts a frontier model to obey a system prompt that begs it, in writing, not to apologize. The line count is the least of its problems.

There are no input guards. None, an empty list, and a single element all get formatted into the prompt as Python str representations and shipped to the model untouched. There is no validation on the way out. The function returns resp.choices[0].message.content, a str, even though the signature still claims list[int]. And there is no fallback to sorted() when the model returns garbage, because the developer who wrote this version did not think it would.

A more cautious developer would have added Pydantic schemas, retry logic, fallback providers, and Prometheus instrumentation. A developer that cautious would not have replaced sorted() in the first place. This code exists because the person who wrote it does not see the problem.

Same signature, different ontology: the right column adds a network, a vendor, a non-determinism source, and four new failure modes that the deterministic call had ruled out by construction.

. . .

Authoring vs Runtime

The sort_numbers example is exaggerated. Few production codebases route a sort through a frontier model. The point of the example is not that this exact function exists in shipped software but that the pattern does. Over the last two years I have read functions of this shape doing increasingly serious work: deduplication of customer records, reconciliation of financial entries, classification of inputs that have a closed-form rule, validation of structured data against a Pydantic model that already exists in the same codebase.

A distinction worth holding in view: this article is not about LLM-assisted authoring. Using a model to write code, generate scaffolding, or refactor boilerplate is something else, and the article takes no position on it. Code produced with model assistance ends up as deterministic source. It runs without any model present, ships once, and pays its inference cost only at the moment of authorship. That is fine. That is, in many cases, productive.

What this article is about is runtime LLM usage. The function in the example does not call a model at build time to produce a sort. It calls a model every time the function executes. Every input pays the latency, the dollar cost, and the distribution of whatever the model decides to do that day. The deterministic algorithm has been replaced not in the source code but in the trace of every running invocation.

The same authoring story produces two different runtime traces. The architectural question is not what the code says but where the LLM call appears in the lifecycle.

That distinction collapses the moment you stop watching for it. A reasonable developer using a reasonable assistant can produce a function whose source reads fine and whose runtime is bizarre. The migration worth worrying about is the one that happens silently: a function that was a thirty-year-old primitive in version one becomes an LLM call in version two, both versions look reasonable on read, and only the production trace tells you which is which.

. . .

The New Failure Modes

The runtime LLM call is also exposed to a category of failure the pre-LLM function literally could not experience.

Schema drift. JSON mode guarantees the output parses, not that it has the right shape. The model can return {} with no sorted key at all, {"answer": [1, 2, 3]} with the right value under the wrong name, {"sorted": "1, 2, 3"} with the array nested inside a string, or {"sorted": [1, 2, 3], "explanation": "ascending order"} with the right value plus commentary the caller did not request. Each is valid JSON. None is the contract. The function does not parse the JSON before returning it, so the caller never sees the discrepancy until something downstream tries to iterate the result.

Hallucinated arguments. When the function is exposed through a tool-use schema, the model can invoke sort_numbers(xs=[1,2,3], reverse=True, key="age") even though the schema only defined xs. Now the function has to reject the call, ignore the extras silently, or fail in a stack trace whose root cause is that the model took initiative.

Wrong tool selection. With more than one tool registered, the model can confidently call count_numbers when the user asked for a sort, return the right type, and produce a totally wrong answer that the type system has no objection to. The bug looks exactly like a working function until somebody checks.

Confidently wrong, looks right. The 50-item list comes back as a 50-item list, ascending, with the right minimum and the right maximum. Three numbers in the middle are silently transposed. One number was replaced with a near neighbor. The test on a 5-item input passed in CI six months ago.

# Input (20 items)
[42, 17, 8, 23, 4, 91, 56, 2, 33, 88, 19, 67, 5, 71, 14, 39, 28, 6, 50, 11]

# Output (19 items, sorted, no error raised)
[2, 4, 5, 6, 8, 11, 14, 17, 19, 23, 28, 33, 39, 50, 56, 67, 71, 88, 91]

Twenty in, nineteen out, the minimum and maximum correct, and the 42 vanished without anyone noticing.

That output is sorted. It is also wrong. There is no exception, no validator complaint, no test failure. The downstream consumer sums the list, draws a chart, files a report, and something further along the pipeline finally notices that a count is off by one. The investigation that follows will not start at the function that sorted the list, because that function is the one part of the system that "always works."

. . .

Type Annotation and Runtime Disagreement

Read the last line of the post-LLM function:

    return resp.choices[0].message.content

A single expression that returns a str from a function whose signature declares list[int].

That expression evaluates to a str. The function's annotation declares list[int]. Nothing in the function body has noticed. mypy will not flag it, because mypy reads annotations and not return values. The runtime will not flag it, because Python type hints are advisory.

The caller assigned the result to a variable typed list[int]. The caller iterates the result. Each iteration step yields a single character of a JSON string: the first element is {, the second is ", the third is s. Several stack frames downstream, an integer-shaped consumer crashes with an error message that names neither sort_numbers nor OpenAI.

The pre-LLM function called sorted(). The post-LLM function calls a frontier model and hands back the raw text without parsing it. Same signature. Different ontology.

In the developer's notebook, calling sort_numbers([3, 1, 2]) shows '{"sorted": [1, 2, 3]}' on the screen, the values look right, and the function ships. Several services adopt it. The first integration that treats the return value as a list iterates over a string and crashes seventeen frames deep, in code nobody wrote. By the time the bug is traced, the function is load-bearing in three places, and the cost of pulling it out is higher than the cost of leaving it in.

. . .

Conclusion

This is fine, sometimes. There are problems where the LLM call earns its weight: ambiguous parsing, semantic similarity, fuzzy classification, anything where the inputs do not have a clean closed-form answer. Sorting integers is not on that list. Neither is counting characters, parsing dates with a known format, validating an email against a regex that has worked for fifteen years, or comparing two booleans.

The interesting question is how much code in the last two years has migrated across that line in the wrong direction. The companies billing for the inference are not the right place to ask.

The lines-of-code era ended when enough engineers refused to participate, and either the token-burn era ends the same way or it ends when the subsidies run out. The engineers who never made the migration in the wrong direction will be the ones with working code at the bottom of the cliff.

. . .

References

AI Productivity News. (March 2026). "Tokenmaxxing: Engineers Burn $150K/Month on AI Compute." Summary of New York Times reporting.
TechJuice. (April 2026). "Startups Spending More on AI Than Employees: The Tokenmaxxing Trend." Coverage of Amos Bar-Joseph (Swan AI) on the $113,000 monthly AI bill.
Trending Topics. (April 23, 2026). "Tokenmaxxing: Is AI Token Consumption a Productivity Metric or Vanity Trap?" Source for Jensen Huang ($250K of tokens), Barney Hussey-Yeo (Cleo), and Matt Calkins (Appian) on the Soviet-chandeliers metaphor.
Karpathy, A. (February 2, 2025). "I 'Accept All' always..." Post on X.
Anthropic. (Current). "Claude Code." Anthropic Product Page.
METR. (July 10, 2025). "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity." METR Blog. Randomized trial showing 19% slowdown perceived as 20% speedup.
Anthropic. (2025). "Pricing." Claude API Documentation. Source for the input-vs-output token pricing asymmetry.
Desight Studio. (2025). "Claude Costs: Who Really Pays the AI Bill?" Source for the Max plan loss-leader analysis.
Kerwin, D. (December 4, 2025). "$1K/Month per Developer: The AI Cost Nobody's Prepared For." Medium.

Productivity Metrics AI Economics Tokenmaxxing Runtime LLM Calls Type Annotations Software Engineering