← All Articles

Language has a Distributional Structure

In 1954, three years before Firth's famous one-line aphorism, a Penn linguistics professor named Zellig Harris published the seventeen pages of math behind it. His most famous student would build a different framework that dominated linguistics for half a century. Then the GPUs arrived, and the framework that scaled with corpus size was the teacher's.

Every word vector inside every large language model is doing something a Penn linguistics professor described in 1954, in plain English, in a journal almost no one reads. The professor's name was Zellig Harris. The journal was WORD. The paper was titled "Distributional Structure," and it ran 17 pages in volume 10, issues 2 and 3.

First page of Harris (1954) Distributional Structure as published in WORD — Read the original: Harris, Z.S. (1954). "Distributional Structure." *WORD*, 10(2-3), 146-162.

Most pop accounts of the distributional hypothesis credit a different linguist for a different reason. They credit J.R. Firth for the line "you shall know a word by the company it keeps," published in 1957. The line is memorable. It fits in a tweet. It is the kind of phrase that gets quoted in conference talks, embedded in PowerPoint slides, and stenciled on the walls of NLP labs.

Firth had a memorable phrase, but Harris had the argument behind it: seventeen pages of formal definitions, worked examples, and a method for empirical verification. Most pop accounts credit Firth anyway, because his line is quotable and Harris's apparatus is not, which is exactly the kind of credit allocation that shapes textbook footnotes more than it shapes the underlying technology.

Craig Trim

I went looking for the original Harris paper after I noticed how often it gets cited as a parenthetical: "(Firth 1957; Harris 1954)." Always second. Always Firth's idea, with Harris as a footnote. Then I read both. Firth's "quote" appears once, in passing, with no formal definition behind it. Harris's paper is 17 pages of careful argument with worked examples. The credit allocation is backwards.

This article is about the paper, the man, and the quiet seventy-year arc from a structural-linguistics journal nobody read to the foundation of an industry worth trillions of dollars. It is also about the strangest detail in the whole story: the man whose competing framework would dominate linguistics for fifty years was, at the time the 1954 paper was being written, sitting in Harris's seminar room as a graduate student.

. . .

The Paper

Harris opened the paper with a question, not a thesis: "Does language have a distributional structure?" He defined his terms immediately. The "distribution" of a linguistic element means the sum of all the environments in which it occurs. An "environment" is the array of co-occurring elements, each in a particular position, that surround the element to form an utterance.

That is the entire conceptual apparatus. A word's distribution is the set of contexts it appears in. The set of contexts is observable. The contexts can be counted, compared, and aggregated. Harris's claim was that this observable, countable, comparable structure was sufficient to derive everything else about the word: its grammatical class, its syntactic role, and even, with some care, its meaning.

The paper proceeds through a series of formal results. Harris shows that distribution determines whether two sounds are different phonemes or variants of the same phoneme. He shows that distribution determines morpheme boundaries. He shows that distribution determines word classes. Each level of linguistic structure can be derived, in principle, from observable patterns of co-occurrence.

Craig Trim

This is a structuralist research program in the strict sense: the position of an element in the system is the only thing that matters. Saussure had said something similar at the level of philosophy. Harris was the one who tried to make it operationally precise.

The boldest claim arrives later in the paper. Harris extended distributional analysis from grammar to meaning itself. This was the part most linguists of the time considered impossible. Bloomfield had famously refused to define meaning, treating it as outside the scope of linguistics. Bar-Hillel, a logician and friend of Harris's, considered the project doomed. Harris went ahead anyway.

The Money Quote

On page 156 of the paper, Harris stated the principle that would, six decades later, become the operating logic of every word embedding system on earth:

If we consider words or morphemes A and B to be more different in meaning than A and C, then we will often find that the distributions of A and B are more different than the distributions of A and C. In other words, difference of meaning correlates with difference of distribution.

That sentence is the operating principle of word2vec, BERT, and every embedding produced by every transformer model in production today. The mathematical operation those systems perform on a pair of word vectors, computing a distance and interpreting that distance as a semantic difference, is exactly what Harris described in English prose in 1954.

Craig Trim

Cosine similarity is just the formal version of "how different are these distributions." Levy & Goldberg (2014) proved that word2vec's skip-gram is implicitly factoring a PMI matrix, which is itself a normalized measurement of distributional difference. The math caught up to Harris's prose six decades after the fact.

The next page contains the worked example that, more than anything else in the paper, deserves to be famous. Harris compares the words oculist and eye-doctor:

If we consider oculist and eye-doctor we find that, as our corpus of actually-occurring utterances grows, these two occur in almost the same environments... If A and B have almost identical environments except chiefly for sentences which contain both, we say they are synonyms.

Then he compares oculist and lawyer:

If A and B have some environments in common and some not (e.g. oculist and lawyer) we say that they have different meanings, the amount of meaning difference corresponding roughly to the amount of difference in their environments.

This is the cosine-similarity demonstration that every word embedding tutorial begins with. Oculist and eye-doctor are nearby in vector space because they appear in similar contexts. Oculist and lawyer are farther apart because their contexts diverge. Harris was running this thought experiment in 1954 with no computers, no corpus tools, and no neural networks. He had pencil and paper and a careful mind.

Craig Trim

The footnote on the oculist/eye-doctor example is worth its own article. Harris credits Yehoshua Bar-Hillel for the example, then notes that Bar-Hillel "considers that distributional correlates of meaning differences cannot be established." Bar-Hillel handed Harris an example to disprove the theory. Harris used it to prove the theory. There is something delightful about quietly demolishing a colleague's objection by promoting his own counterexample to evidence.

. . .

The Student

Portrait of Zellig S. Harris — Zellig S. Harris (1909-1992)

Portrait of Noam Chomsky — Zellig S. Harris (1909-1992)

The strangest fact about "Distributional Structure" is buried in its third footnote. Harris cites a "forthcoming article by Noam Chomsky, Some Comments on Simplicity and the Form of Grammars." That Chomsky was Harris's PhD student. Chomsky had arrived at the University of Pennsylvania in 1945 and fallen under Harris's tutelage as an undergraduate. By 1954, when "Distributional Structure" appeared, Chomsky was a graduate student writing his dissertation under Harris's supervision. He would receive his PhD from Penn in 1955.

The footnote is small, but the fact behind it is not. The most influential linguist of the twentieth century was, at the moment Harris was laying out the distributional hypothesis, sitting in Harris's seminar.

Craig Trim

According to Henry Hiż, who also taught at Penn, "the primary teacher of Noam was Zellig Harris." Harris convinced Chomsky to major in linguistics in the first place. The two men's intellectual relationship was foundational to everything Chomsky would do later, even as he came to repudiate the framework.

What happened next is one of the great intellectual reversals of twentieth-century linguistics. Chomsky received his doctorate in 1955. In 1957 he published Syntactic Structures, the slim monograph that founded generative grammar. The book argued, against Harris and against the entire structuralist tradition, that distribution was insufficient. Language could not be captured by surface patterns alone. There were deep structures, transformations, innate grammatical knowledge. The mind, not the corpus, was the seat of linguistic competence.

The break was philosophical, not personal. Harris and Chomsky remained on professional terms. But the two frameworks were fundamentally incompatible. Harris said: derive structure from observable distribution. Chomsky said: posit structure as a property of mind, then test predictions against observation. Harris was an empiricist. Chomsky was a rationalist. Harris built upward from data. Chomsky built downward from theory.

In the academic linguistics of the 1960s and 1970s, this was not a fair fight. Chomsky's framework offered explanatory ambition that distributional analysis could not match. Generative grammar promised to explain why human languages are the way they are, not just to describe how they pattern. It connected linguistics to philosophy of mind, cognitive science, and the rapidly growing field of artificial intelligence. It also offered something Harris's approach did not: the prospect of understanding language without ever leaving your office to gather a corpus.

Craig Trim

Chomsky famously dismissed statistical approaches with the example "Colorless green ideas sleep furiously," arguing that the sentence is grammatical but has zero probability under any n-gram model trained on real corpora. He was right at the time. Sixty years later, Pereira (2000) showed that with the right model, the sentence is about 200,000 times more probable than its scrambled counterpart. The argument from grammaticality lost its force as the models got better.

For the next several decades, Chomsky's framework dominated academic linguistics. By the 1970s, generative grammar was the mainstream paradigm. Harris's distributional analysis was treated as a relic of the structuralist past, useful for fieldwork descriptions but unfit for the deeper theoretical questions Chomsky had reframed as the central object of study. Computational linguistics, when it existed at all, mostly tried to implement Chomsky's grammars in code. The statistical, distributional approach went into hibernation.

It would stay there for thirty years.

. . .

The Wilderness Years

From the publication of Syntactic Structures in 1957 to the publication of "Indexing by Latent Semantic Analysis" in 1990, distributional methods had a quiet, mostly underground existence. They survived in information retrieval, where Karen Spärck Jones and Gerard Salton built TF-IDF on the assumption that word frequencies carried information. They survived in lexicography, where Patrick Hanks and others used distributional patterns to compile dictionaries. They survived in the IBM speech recognition group, where Frederick Jelinek's team was discovering that statistics worked better than linguistics.

None of this was called "distributional semantics." That label would come later. At the time, it was just "what people in IR and statistics-flavored NLP do," and it was considered intellectually unfashionable. The cool work was in syntax, transformations, government and binding. The cool work was Chomsky's.

Craig Trim

Jelinek's quote, "Every time I fire a linguist, the performance of the speech recognizer goes up," is from this period. The exact wording is debated and Jelinek later softened it. But the underlying complaint was real: Chomskyan linguistics was actively unhelpful to the statistical NLP that actually worked. The distributional approach was vindicated piecemeal, in industrial labs, while academic linguistics looked the other way.

Harris himself remained productive but increasingly outside the mainstream. He continued working on operator grammar and information transformation, frameworks that built on his distributional commitments but found smaller and smaller audiences. He retired from Penn in 1979. He died in 1992, two years after Deerwester and his colleagues at Bellcore published the first paper that put real numbers behind his 1954 claims.

Harris did not live to see the resurrection. He missed it by twenty years.

. . .

The Resurrection

Latent Semantic Analysis was the first paper to translate Harris's prose into linear algebra. Deerwester, Dumais, Furnas, Landauer, and Harshman built a matrix where each row was a word and each column was a document, with the entries indicating how often each word appeared in each document. Then they ran singular value decomposition on the matrix to compress it into a lower-dimensional space. Words that appeared in similar documents ended up nearby in the compressed space. Documents about similar topics clustered together even when they shared no exact words.

This was Harris's distribution turned into a coordinate system. The "environment" of a word was now the set of documents containing it, and the "difference between distributions" was now Euclidean distance in a 300-dimensional space. The oculist/eye-doctor example, which Harris had run in his head with corpus intuitions, could now be run on actual corpora and produce actual numbers.

Year	Event
1954	Harris publishes "Distributional Structure"
1955	Chomsky earns PhD under Harris at Penn
1957	Firth publishes "you shall know a word by the company it keeps"
1957	Chomsky publishes Syntactic Structures, breaking with Harris
1965	Chomsky publishes Aspects of the Theory of Syntax
1990	Deerwester et al. publish Latent Semantic Analysis
1992	Harris dies
2003	Bengio et al. publish neural language model with learned word vectors
2013	Mikolov et al. publish word2vec; distributional hypothesis goes mainstream
2014	Levy & Goldberg prove word2vec implicitly factors a PMI matrix
2018	BERT applies distributional principle contextually
2020+	Every transformer LLM is operationalizing Harris (1954)

LSA was a starting point, not an endpoint. The 1990s and 2000s saw a steady accumulation of techniques that pushed Harris's idea further. Brown clustering grouped words by their contexts. PMI matrices quantified co-occurrence strength. Latent Dirichlet Allocation modeled topics as distributions over words and documents as distributions over topics. Each method was a different way of making "distributional similarity" computable.

Craig Trim

Church & Hanks (1990) is another candidate for "the paper that already was word2vec." They used pointwise mutual information to measure word association strength, and Levy & Goldberg later showed that this is mathematically what skip-gram learns. The 1990s were full of papers that anticipated the neural revolution by two decades. The neural revolution just made the math cheaper to run.

The watershed was 2013. Mikolov, Sutskever, Chen, Corrado, and Dean at Google published the word2vec paper, demonstrating that a shallow neural network trained on a corpus would learn word representations with an extraordinary property. The vectors supported analogical reasoning. The vector for "king" minus the vector for "man" plus the vector for "woman" was approximately the vector for "queen." The vector for "Paris" minus the vector for "France" plus the vector for "Italy" was approximately the vector for "Rome."

The result was electric. It also turned out to be Harris's principle, scaled up and made differentiable. Levy and Goldberg proved this rigorously in 2014. The word2vec skip-gram model is implicitly factoring a pointwise mutual information matrix. PMI is a normalized measurement of how much two words co-occur beyond chance, which is Harris's "amount of difference in their environments" with information theory laid on top.

Every embedding trick since has been a refinement of the same approach. GloVe used global word-context co-occurrence statistics. ELMo and BERT introduced contextual embeddings, where a word's vector depends on its specific surrounding sentence rather than its average over the corpus. Modern transformers compute attention weights between every pair of tokens in a sequence, dynamically constructing distributional representations on the fly. The whole apparatus is a continuous, differentiable, thousand-dimensional version of Harris's central claim.

Difference of meaning correlates with difference of distribution.

. . .

What the Framework Captures

The right way to understand the modern vindication of Harris is not as proof that he was correct and Chomsky was wrong. Both men were asking different questions, and the answers run on different timescales. What changed is that one of those questions turned out to be tractable with billions of dollars of GPU compute, while the other did not. The vindication of Harris is the vindication of a framework, not the falsification of a rival.

It is worth being specific about which of Harris's claims his framework captures cleanly, because the capture is partial.

The structural claim. Harris argued that distribution is sufficient to recover linguistic structure. Modern transformer training is the strongest possible empirical confirmation: given enough text, a model with no grammatical priors learns syntax, morphology, semantic similarity, and pragmatic conventions purely from distributional patterns. The model is doing Harris's program at industrial scale, and the fact that it works at all is the most remarkable scientific result of the past decade.

The semantic claim. Harris argued that meaning differences correlate with distributional differences. Word embeddings demonstrate this almost too cleanly. Cosine similarity between embedding vectors is a reasonably good approximation of human judgments of semantic similarity, not perfect and not always interpretable, but good enough that the entire field of dense vector retrieval is built on it and works.

The methodological claim. Harris argued that you can study language without first solving the problem of meaning. You can describe distribution, then derive meaning from distribution, rather than starting with intuitions about meaning and trying to formalize them. This is exactly how language models are trained. Nobody hand-codes "what does oculist mean" into BERT, and the model learns it from contexts the same way Harris said it could be learned.

The synonymy claim. Harris's specific test for synonymy ("if A and B have almost identical environments except for sentences containing both, they are synonyms") is essentially the modern definition. Two words are synonyms in a vector space if their vectors are nearly identical. The "except for sentences containing both" caveat handles the case where two synonyms rarely appear together because they would be redundant, which is a remarkably sophisticated observation for 1954.

Craig Trim

The "except for sentences containing both" detail is the kind of thing you only notice if you actually read the paper. It anticipates the problem of "redundancy avoidance" in distributional semantics: true synonyms are often anti-correlated in usage because writers do not use both in the same sentence. Modern embedding methods quietly assume this.

What Harris Could Not Predict

Harris was not omniscient. He missed several things that modern NLP has had to figure out the hard way.

He did not anticipate the role of polysemy. A single word like "bank" has multiple meanings, and a single distributional vector is a poor representation of all of them at once. Harris's framework treats each word as having one distribution. Modern contextual embeddings (BERT, GPT) generate a different vector for each occurrence of a word, depending on its neighbors. This is a major refinement that Harris never proposed.

He did not anticipate compositional meaning at the phrase or sentence level. Distributional analysis works well for individual words. Combining word vectors into sentence vectors is harder, and progress on this front has come from neural architectures (transformers, attention) rather than from Harris's structural framework directly.

He did not anticipate the bitter lesson of scale. Harris's framework was elegant and minimal. It assumed that careful linguistic analysis would yield insight from modest amounts of data. The actual lesson of modern NLP is that brute-force scale beats elegance: train on a trillion tokens with a simple objective and the embeddings emerge automatically. Harris would, I think, have been pleased by the result and surprised by the means.

He did not anticipate the limits of his own framework. A purely distributional system has no grounding in the world. It knows that "oculist" and "eye-doctor" pattern alike, but it does not know what an eye is. Modern multimodal systems are starting to address this by joining text distributions with image distributions, but the grounding problem remains live.

. . .

Why Firth Got the Quote (And Harris Did Not)

If Harris's paper is so much more complete than Firth's one-line aphorism, why does Firth get cited first in every NLP textbook? Three reasons, all of them depressing in roughly equal measure.

Quotability. Firth wrote a sentence you can stencil on a wall, while Harris wrote a 17-page argument with formal definitions and footnotes. The compression ratio favors Firth, and pop accounts of any field tend to converge on whoever produced the most repeatable line, even when the substantive contribution belongs elsewhere.

Geography and reputation. Firth was the founder of the London School of linguistics, with an established name and an institutional circle around him. Harris was a Penn structuralist whose framework was about to be eclipsed by his own student. By the time distributional methods came back into fashion, Firth's reputation had survived and Harris's had faded, and the cited author is often the one whose name still rings a bell.

Convenience. When word2vec needed an intellectual ancestor, the field reached for the shortest plausible quote, and Firth had it. A footnote citation to Firth (1957) is faster to write than a paragraph explaining what Harris (1954) actually said, and the path of least resistance won.

Craig Trim

This is a general pattern in academic credit allocation. The person whose phrase travels well gets cited as the originator, and the person whose framework actually does the work gets cited as a "see also." Compare Bayes and Laplace: Bayes wrote one short note about inverse probability, Laplace built the entire theory of statistical inference, but the field is called Bayesian.

This is not a complaint. It is an observation about how attribution actually works. The right response to discovering it is not to demand that everyone start citing Harris instead. The right response is to read both papers, see the full picture, and credit the contribution that fits the context.

For the record: Firth had the quote, Harris had the math, both deserve to be in the citation. The standard footnote "Firth 1957; Harris 1954" is fine. The thing to remember is that the second name in that footnote is the one whose paper actually built the theory.

. . .

Why This Matters Now

There is an obvious historical pleasure in this story: the obscure 1954 paper that secretly powered a trillion-dollar industry. There is a sharper professional lesson underneath it.

Modern NLP is not just doing Harris's program. It is doing it in a particular way that has consequences. Every word embedding is a hypothesis about meaning that is grounded entirely in textual co-occurrence. The embedding does not know what an eye is. It knows that "eye" appears near "see" and "blink" and "color" and "doctor." Its understanding of "eye" is a position in a high-dimensional space defined by the company "eye" keeps in a corpus.

This is exactly what Harris said language is. He was making an empirical claim about how linguistic structure could be derived. He was not making a metaphysical claim about how meaning ultimately works. There is a difference between "you can recover synonymy from distributional patterns" and "synonymy is identical to distributional pattern." Harris stopped at the first claim. Some commentators have leapt to the second. The leap is unjustified by the paper, and the failure modes of modern language models, hallucination, lack of grounding, confident wrong answers, are exactly what you would expect from a system that has only the first kind of knowledge.

When a language model confidently misstates a fact, it is performing Harris's procedure faithfully on a corpus that contained the fact incorrectly. When a language model confuses two similar words in different contexts, it is doing exactly what Harris's framework predicts: words with overlapping distributions are semantically similar, and similarity is not identity. The systems are operationalizing Harris well. The expectations we put on them often exceed what Harris's framework can actually provide.

Craig Trim

This is the part of the story that should make practitioners careful. Distributional methods are powerful precisely because they are bounded. They tell you about co-occurrence patterns. They do not tell you about the world. When you build a system that depends on a language model knowing facts, you are betting that the corpus encoded the facts correctly and the model recovered the encoding. Both bets fail more often than people expect.

The other reason this matters is for practitioners trying to evaluate where the field is going. The current moment in NLP looks like a series of architectural breakthroughs, GPT, Claude, Gemini, the transformer, attention, multimodal models. But underneath the architectural variation, the underlying theory of meaning has not changed since Harris articulated it in 1954. Every model is a different implementation of "difference of meaning correlates with difference of distribution." The progress has come from running this principle on more data, with more compute, in more flexible computational structures.

. . .

The Debate Is Still Live

It would be easy to read this story as resolved: the empiricist won, the rationalist lost, the GPUs settled it. That reading is wrong. The reason it is wrong is that some of the people best positioned to evaluate the situation are now publicly betting against the framework that scaled.

In March 2023, Chomsky co-authored an op-ed in the New York Times with Ian Roberts and Jeffrey Watumull titled "The False Promise of ChatGPT." The piece argued that LLMs are "a lumbering statistical engine for pattern matching" and that the human mind, by contrast, "is a surprisingly efficient and elegant system that operates with small amounts of information; it seeks not to infer brute correlations among data points but to create explanations." The complaint is that LLMs can describe and predict but cannot explain. They can tell you what the corpus says happened, but not what would happen if conditions changed, and not why.

First page of Chomsky, Roberts, and Watumull (2023) The False Promise of ChatGPT — Read the original: Chomsky, N., Roberts, I., & Watumull, J. (2023). "The False Promise of ChatGPT." *The New York Times*, March 8, 2023.

This is exactly the limit Harris's framework predicts. Distribution captures co-occurrence patterns, and co-occurrence patterns capture an enormous amount of structure, but they do not encode causal models of the world. Chomsky is making a Chomsky-shaped argument against LLMs, and the argument is internally consistent with the framework he has defended for sixty years. Whether it is also correct is the part that matters.

The same suspicion arrived from a different direction in 2021, when Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell published "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" The phrase "stochastic parrot" stuck. It describes a system that probabilistically reassembles patterns from its training data without any model of what the patterns refer to. The paper is not Chomskyan in the strict sense, but Bender is a linguist, and the linguistic grievance behind the parrot metaphor has a long history that runs through Chomsky's tradition. The Chomsky op-ed and the Bender paper are different arguments aimed at the same target.

Cartoon of a scholarly parrot in a bow tie perched on a stack of books in a library — A stochastic parrot, after Bender et al. (2021). It has read everything and understood nothing.

The Chomsky position also has practical company in industry. In November 2025, Yann LeCun left Meta after twelve years as chief AI scientist. LeCun had spent years arguing publicly that LLMs were architecturally insufficient for what he calls "advanced machine intelligence," and his position grew harder to maintain as Meta restructured around commercial LLM products under Meta Superintelligence Labs. Within months he had raised $1.03 billion at a $3.5 billion pre-money valuation for AMI Labs, a startup explicitly built on the bet that "world models," systems that learn from grounded physical interaction rather than text prediction, are a better path forward than scaling autoregressive language models.

LeCun is not Chomsky. He is not arguing that language is innate or that grammars are generative in the technical sense. But the structural shape of his bet is in the same family. Both men believe that distribution alone, no matter how cleverly compressed or scaled, will hit a ceiling that requires a different kind of representation to break through. The Meta exit is the most expensive philosophical disagreement in the history of AI research, and it is a disagreement about whether Harris's framework, scaled to a trillion parameters and a trillion tokens, is enough.

Meta itself, along with OpenAI, Anthropic, and Google DeepMind, is in the Harris camp by default. Their bet is that more data and more compute applied to next-token prediction will continue to produce results that look increasingly like understanding, even if the underlying mechanism is still distributional. So far, the bet has paid out spectacularly. Whether it continues to pay out at the same rate is the open question on which the LeCun exit and the Chomsky op-ed are different ways of placing the same wager.

Craig Trim

The pleasing symmetry: in the 1950s and 1960s, Harris's framework was the unfashionable underdog and Chomsky's was the establishment. In 2026, Chomsky is the unfashionable underdog and the distributional framework is the establishment. The roles flipped, but the underlying disagreement is unchanged. Both men are still arguing that the other's approach misses the part of language that matters most.

. . .

Where the Framework Runs Out

If you want to know where current LLM systems will hit a wall, look for the parts of language that distribution alone cannot capture: grounded reference to physical objects, causal structure, counterfactual reasoning about facts the corpus never recorded, the ability to construct an explanation rather than retrieve one. These are exactly the places Chomsky's tradition has been pointing at since 1957, and they remain unresolved.

Harris's framework was never meant to handle them. He was making an empirical claim about how linguistic structure could be derived from observable patterns, not a metaphysical claim that distribution exhausts what language is. The failure modes of modern language models, hallucination, lack of grounding, confident wrong answers, are exactly what you would expect from a system that has only the kind of knowledge Harris's framework provides. The systems are operationalizing Harris faithfully. The expectations placed on them often exceed what his framework was ever designed to deliver.

For now, every time a transformer computes attention weights and every time a vector database finds nearest neighbors, a 1954 linguistics paper is being executed in silicon. The student's questions have not been answered. They have been postponed by a framework that turned out to scale.

Cartoon contrasting Harris reading calmly in an armchair with Chomsky buried in stacks of books and papers — Harris's empiricism (left) and Chomsky's rationalism (right).

. . .

References

Harris, Z.S. (1954). "Distributional Structure." WORD, 10(2-3), 146-162.
Firth, J.R. (1957). "A Synopsis of Linguistic Theory, 1930-1955." In Studies in Linguistic Analysis, pp. 1-32. Philological Society, Oxford.
Chomsky, N. (1957). Syntactic Structures. Mouton, The Hague.
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R.A. (1990). "Indexing by Latent Semantic Analysis." Journal of the American Society for Information Science, 41(6), 391-407.
Church, K.W. & Hanks, P. (1990). "Word Association Norms, Mutual Information, and Lexicography." Computational Linguistics, 16(1), 22-29.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." NeurIPS 2013.
Levy, O. & Goldberg, Y. (2014). "Neural Word Embedding as Implicit Matrix Factorization." NeurIPS 2014.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT 2019.
Sahlgren, M. (2008). "The Distributional Hypothesis." Italian Journal of Linguistics, 20(1), 33-53.
Pereira, F. (2000). "Formal Grammar and Information Theory: Together Again?" Philosophical Transactions of the Royal Society A, 358(1769), 1239-1253.
Chomsky, N., Roberts, I., & Watumull, J. (2023). "The False Promise of ChatGPT." The New York Times, March 8, 2023.
Bender, E.M., Gebru, T., McMillan-Major, A., & Mitchell, M. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" Proceedings of FAccT '21, 610-623.
Heaven, W.D. (2026). "Yann LeCun's new venture is a contrarian bet against large language models." MIT Technology Review, January 22, 2026.