← All Articles

The Blank That Predicted GPT

In 1953, a psychologist deleted every fifth word from a paragraph and asked people to guess what was missing. Seventy years later, every large language model on earth runs a mechanized version of the same experiment.

Ducklings crossing a crosswalk in single file, one missing from the pattern
Readability is not a property of words. It is a property of patterns.

Wilson Taylor did not set out to invent the conceptual ancestor of GPT. He was measuring readability. His tool was simple: take a passage, knock out words at regular intervals, hand the mutilated text to readers, count how many blanks they fill correctly. He called each successful guess a "cloze unit," borrowing from Gestalt psychology's concept of closure, the human tendency to complete incomplete patterns.

The results were striking. Cloze scores ranked passages by readability more accurately than the formula-based methods of the day. They worked across different deletion systems, different scoring methods, and different populations of readers. And they measured something the formulas could not: the degree of fit between a writer's patterns and a reader's expectations.

Taylor published his findings in the Journalism Quarterly, Fall 1953. The paper is 19 pages long, methodical, and almost entirely forgotten outside of reading-comprehension research. It should not be.

. . .

The Paper

Taylor's method was disarmingly simple. Take a passage. Delete every nth word. Replace each deleted word with a blank of fixed length (so the blank size does not hint at the answer). Give the mutilated passage to readers. Count how many blanks they fill with the exact original word.

He called the procedure "cloze" and ran it through three pilot studies and two formal experiments. He tested every-fifth, every-seventh, and every-tenth deletion systems. He tested random deletion. He tested whether scoring synonyms as "half credit" improved discrimination. (It did not.) He tested whether the order of passage presentation affected scores. (It did not.)

The core finding: cloze scores ranked passages by readability with high statistical significance, and they did so more accurately than the Flesch and Dale-Chall formulas on passages that violated formula assumptions.

The Gertrude Stein Problem

Taylor's most interesting result involved a passage from Gertrude Stein's Geography and Plays. Both readability formulas rated it as among the easiest passages in the study. The Flesch method scored it as "very easy." The Dale-Chall formula placed it within the comprehension level of fourth or fifth graders.

The cloze scores told a different story. Readers could barely fill in any blanks. The passage used short, common words in short sentences, which is exactly what the formulas measure. But the word sequences were unpredictable. The patterns violated expectation at every turn. Easy words, impossible combinations.

This is the result that separates cloze from formula. Formulas count surface features: syllables, sentence length, word frequency. Cloze measures the thing that actually matters: can a reader predict what comes next? Readability is not a property of individual words. It is a property of patterns.

The Original, All 19 Pages

Taylor's paper was published in Journalism Quarterly, Volume 30, Issue 4, Fall 1953, pages 415-433. The full text is preserved here from the scanned journal pages.

. . .

From Blanks to [MASK] to Next-Token Prediction

Taylor's insight was that comprehension reveals itself through prediction. If you can guess the missing word, you have understood the pattern. The better the writing fits your expectations, the more blanks you fill. The cloze score is a measurement of pattern alignment between writer and reader.

This is exactly what a language model does.

In 2019, BERT mechanized the cloze task at scale. Replace 15% of tokens with a special [MASK] token. Train the model to predict what was masked. The entire pre-training process is Taylor's procedure automated: here is a sentence with blanks, fill them in. BERT's masked language modeling objective is a cloze task running billions of times over a massive corpus.

GPT took it one step further. Instead of masking random positions and predicting them, GPT masks everything to the right of the current position and predicts the next token. Every token is a blank. The entire generation process is a continuous cloze procedure, running left to right, one token at a time.

The connection is not metaphorical. It is structural. Taylor measured the probability that a human could predict a missing word given surrounding context. A language model computes the probability of a token given preceding context. The mathematics differ. The principle is identical: comprehension is prediction.

What Taylor Understood That Formulas Did Not

Taylor's theoretical framework anticipated modern NLP in specific, concrete ways. He wrote about three concepts that map directly to contemporary machine learning:

Redundancy. "Man coming" means the same as "A man is coming this way now." The longer version is redundant: it encodes the same information multiple times. Taylor observed that redundancy makes cloze easier, because the missing information can be recovered from what remains. In information theory, this is Shannon's insight. In language modeling, this is why models can predict tokens: natural language is massively redundant.

Transitional probabilities. "Merry Christmas" is more probable than "Merry birthday." "Please pass the ___" is more often completed by "salt" than by "sodium chloride" or "blowtorch." Taylor explicitly cited these conditional probabilities as the mechanism underlying cloze performance. This is the bigram model, stated in plain English, decades before it became a computational technique.

Dispositional language habits. Each person develops their own set of "bundles of skill sequences" that reflect the redundancies and transitional probabilities of their language experience. When a writer's habits match a reader's habits, communication is effortless. When they diverge, comprehension breaks down. This is the distributional hypothesis in embryonic form: meaning lives in usage patterns, not in words themselves.

. . .

Try It Yourself

The interactive demo below implements Taylor's procedure exactly as described in the paper. Four passages of varying difficulty. Every nth word deleted. Fixed-width blanks. Exact-match scoring only (Taylor found synonyms "unprofitable"). Your cloze score for each passage measures its readability, not your vocabulary.

Pay particular attention to the Dr. Seuss passage. It uses short, common words in short sentences. A readability formula would rate it as elementary-school text. Your cloze score will tell a different story.

Cloze Procedure interactive demo
Interactive: Taylor's (1953) cloze procedure. Fill in the blanks and compare your scores across passages.
. . .

Why a 1953 Readability Paper Matters for LLM Practitioners

The connection from Taylor to GPT is not just historical trivia. It is a lens for understanding what language models actually do.

When GPT generates text, it is running a continuous cloze procedure. Each token is a "blank" that the model fills based on everything to its left. When the model assigns high probability to the correct next token, that is a high cloze score. When it assigns low probability, that is a low cloze score. Perplexity, the standard metric for language model quality, is the inverse of aggregate cloze performance.

Taylor's Stein result also explains something practitioners encounter daily: models are worse at unpredictable text. Code with non-standard patterns, domain jargon, creative writing that violates convention, text in under-represented languages. These are all Gertrude Stein passages. The model's "cloze score" on them is low, because the patterns do not match what it learned during training. This is not a bug in the model. It is the cloze principle operating at scale.

Understanding the cloze principle also reframes prompt engineering. A good prompt is one that creates a highly predictable context for the model's next-token prediction. Few-shot examples work because they establish a pattern that the model can continue. Chain-of-thought works because intermediate reasoning tokens create a more predictable path to the answer. System prompts work because they constrain the distribution of likely continuations.

Every prompting technique is, at its core, a strategy for raising the model's cloze score on the desired output.

. . .

References

  1. Taylor, W.L. (1953). "Cloze Procedure": A New Tool for Measuring Readability. Journalism Quarterly, 30(4), 415-433.
  2. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019.
  3. Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report.
  4. Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020.
  5. Petroni, F., et al. (2019). Language Models as Knowledge Bases? EMNLP 2019.
  6. Shannon, C.E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379-423.