Why LLMs Struggle with Arithmetic

Large language models are surprisingly bad at math. This isn't a bug that will be fixed with more training. It's structural.

Wrong direction

Humans add numbers right-to-left. Start with the ones column, carry to the tens. LLMs generate text left-to-right. To output 157 + 286 = 443, the model must produce the "4" in the hundreds place before calculating whether the tens place generates a carry.

It's being asked to write the answer before computing it.

No scratchpad

Arithmetic requires working memory. Hold the carry. Track your position. Transformers don't have this. They have attention over previous tokens, which is a powerful yet distinct feature.

Chain-of-thought prompting works because it externalizes the scratchpad into the token stream itself.

Pattern matching, not algorithms

LLMs learn that 7 + 8 is often followed by 15 in training data. They haven't learned the addition algorithm. They've learned a statistical approximation of its outputs.

This works for common cases and degrades for rare ones.

Tokenization makes it worse

12345 might become ["123", "45"]. The digit structure that makes arithmetic tractable has been obscured before the model sees it.

The model must reason about positional place value across arbitrary chunk boundaries.

enc.encode("12345")   # ["123", "45"]
enc.encode("12344")   # ["123", "44"]

# To compare these numbers, the model sees:
# "123" + "45" vs "123" + "44"
# It must reconstruct digit positions from chunks.

Training data sparsity

2 + 2 = 4 appears constantly. 847293 + 392847 = 1240140 appears rarely.

The model has strong priors on small numbers and weak priors on large ones.


The result

LLMs can do arithmetic. They just do it the hard way—through learned statistical patterns rather than algorithmic execution.

This is reliable for small numbers but unreliable for large ones, and fundamentally limited in ways that adding more parameters won't solve.