← All Articles

The Paper That Funded a Fortune

In 1992, five researchers at IBM Yorktown Heights published a twelve-page paper on grouping English vocabulary into classes. Two of the authors would walk out of that group and help build the most profitable hedge fund in history. The algorithm they left behind became the standard NLP feature for fifteen years, the conceptual ancestor of word2vec, and a footnote in textbooks that mostly skip past it.

Assumed background: Pointwise Mutual Information and the Independence Baseline. That companion piece covers PMI, the independence null hypothesis, and the aggregate mutual information of a bigram matrix. This article treats those as given.

The paper is called "Class-Based n-gram Models of Natural Language." It ran twelve pages in volume 18, issue 4 of Computational Linguistics, December 1992. The byline reads Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai, and Robert L. Mercer.

First page of Brown et al. (1992) Class-Based n-gram Models of Natural Language as published in Computational Linguistics
Read the original: Brown, P.F., Della Pietra, V.J., deSouza, P.V., Lai, J.C., & Mercer, R.L. (1992). "Class-Based n-gram Models of Natural Language." Computational Linguistics, 18(4), 467-479.

All five worked in the speech recognition group at IBM Thomas J. Watson Research Center under Frederick Jelinek. Within a year of publication, Brown and Mercer would leave IBM for a small Long Island hedge fund called Renaissance Technologies, where they would help build a trading system that turned the firm into a printing press for the next three decades.

This article is about the paper, the people, and the strange double life of an algorithm that quietly powered NLP feature engineering through the 2000s while its authors quietly bought half of East Setauket. It is also about a question worth asking when you read any 1990s statistical NLP paper: how much of what we call modern word embeddings was already on the page, waiting for the GPUs to catch up?

. . .

The Paper

Brown and his coauthors started with a problem that the IBM speech group had been wrestling with for a decade: n-gram language models do not have enough data. A trigram model needs to estimate the probability of every word given every two-word history. For a vocabulary of 260,000 words, the number of possible trigrams is roughly seventeen quadrillion. Even with hundreds of millions of training words, the overwhelming majority of those trigrams are never observed. The model has to either smooth aggressively or back off to shorter contexts, and both options leak information.

97.6% of trigrams in a 358-word famous passage appear exactly once. Open in new tab.
Tolkien and Dickens share only 3.2% of their trigrams. Open in new tab.

The Brown team proposed a different escape route. Instead of estimating probabilities for individual words, group the words into classes and estimate probabilities for the classes. A trigram of classes is much more likely to have been observed, because the number of distinct classes is small. Each word's contribution to the model becomes the product of two simpler quantities: the probability of its class given the previous classes, and the probability of the word given its class. The vocabulary problem becomes tractable because the model is no longer trying to remember every word in every context. It is trying to remember every kind of word in every kind of context.

The interesting question is which classes to use. The obvious answer in 1992 was: borrow them from a linguist. Use parts of speech, semantic categories, hand-built ontologies. The Brown team rejected the obvious answer. They wanted to derive the classes from the data, in the same spirit Zellig Harris had described thirty-eight years earlier in "Distributional Structure": words that appear in similar contexts belong to the same class, and the contexts can be observed without any prior linguistic theory.

What Brown and his coauthors added to Harris's idea was a numerical objective: the aggregate mutual information of the class bigram distribution, treated as a quantity to be maximized. A good clustering preserves as much of the original word-by-word predictability as possible after the words have been replaced by class labels. Cluster well and the class transitions stay informative. Cluster badly and they collapse toward independence, which is exactly the null hypothesis the bigram matrix was supposed to be telling you a story against.

The Algorithm

The procedure is hierarchical agglomerative clustering driven by that mutual information criterion. Start with each word in its own class. At every step, find the pair of classes whose merger costs the least mutual information, and merge them. Continue until you have the desired number of classes, or continue all the way down to a single class and keep the merge tree as a binary hierarchy. Every word ends up at a leaf of the tree, and the path from the root to the leaf is a binary code that places the word inside a sequence of progressively more specific groupings.

Brown clustering running step by step on the Jane Eyre bigram matrix, with the inspection panel on the right.

Brown and his coauthors trained the algorithm on 365 million words of running text from a 1988 to 1989 sample of Associated Press news wire. They produced a thousand-class partition of the most frequent 260,741 words. Then they printed a sample of the classes in the paper, and the sample is the part of the paper that has done the most to keep it alive in collective memory.

. . .

The Famous Tables

This is the page that anyone who has ever taught a statistical NLP course has shown to a room of students at least once. The classes are not labeled. Nobody told the algorithm what a day of the week was, or what a month was, or what a unit of time was, or what a relative was. The algorithm received only one input: how often each word appeared next to each other word in 365 million words of news copy. The categories fell out anyway.

Cluster theme (post hoc label)Members
Days of the weekFriday Monday Thursday Wednesday Tuesday Saturday Sunday
MonthsJune March July April January December October November September August
Human collectivespeople guys folks fellows CEOs chaps doubters commies unfortunates blokes
Substanceswater gas coal liquid acid sand carbon steam shale iron
Size adjectivesgreat big vast sudden mere sheer gigantic lifelong scant colossal
Directionsdown backwards ashore sideways southward northward overboard aloft downwards adrift
Family and rolesmother wife father son husband brother daughter sister boss uncle
Personal namesJohn George James Bob Robert Paul William Jim David Mike
Units of measurefeet miles pounds degrees inches barrels tons acres meters bytes
Titles and officialsdirector chief professor commissioner commander treasurer founder superintendent dean custodian
Reproduced from Brown et al. (1992), Table 6. The labels in the first column are mine, not the algorithm's.

The classes are not perfect. Some are clean (days, months, personal names, units of measure). Some are noisy in interesting ways. The "family and roles" cluster puts mother and brother next to boss and uncle, because in news copy these words all occur in the same possessive frame ("his mother", "his boss"). The "substances" cluster groups water with shale and iron, because the algorithm has no chemistry, only context. The "directions" cluster is arguably the strangest of the clean ones: it collects motion adverbs like ashore, aloft, and adrift, a category that no hand-written ontology would have thought to carve out, and that turns out to matter when you are modeling verbs of movement.

The honest description of what the table shows is this: a thousand-way clustering of vocabulary, derived without supervision from a large news corpus, recovered most of the obvious semantic categories that a careful linguist would have hand-coded, and produced a few bonus categories that were genuinely informative. It did this in 1992.

And it still does it now. The demo below reruns the Brown 1992 algorithm on 415 million tokens of Associated Press newswire text, roughly the same corpus size the paper used, and the same clusters fall out. Drag the K slider to change the granularity; click any word to see its binary path through the merge tree and its nearest neighbors.

Brown 1992 clustering of the top 2,000 words over 415 million tokens of AP newswire. Same algorithm, modern corpus, same kinds of categories. Pipeline and full artifact at github.com/craigtrim/brown-clustering. Open in new tab.
. . .

What Brown Clusters Actually Did for the Field

Most papers from 1992 had the lifespan of cut flowers. Brown clusters had the lifespan of an institution. Between 1992 and roughly 2013, "use Brown clusters as features" was one of the standard moves in any structured prediction task in NLP. The reason was not theoretical elegance. The reason was that it worked, on small training sets, with no fuss.

The pattern looked like this. You had a labeled corpus for some task: named entity recognition, part-of-speech tagging, dependency parsing, chunking. The labeled corpus was small, often a few hundred thousand tokens. You also had access to a much larger unlabeled corpus, hundreds of millions of words. You ran Brown clustering on the unlabeled corpus, producing a class label for every word in your vocabulary. Then you trained your supervised model with two new features for each word: its Brown cluster, and various prefixes of its cluster's binary code. The cluster gave the model coarse semantic information that it could not have learned from the small labeled set alone. The prefixes gave the model the ability to back off from a fine cluster to a coarser one when the fine cluster was rare.

The pivotal demonstration was Koo, Carreras, and Collins (2008), "Simple Semi-supervised Dependency Parsing." They added Brown cluster features to a discriminative dependency parser and reported large gains across English and Czech, with the gains concentrated on words that the labeled training data had rarely or never seen. The paper was published at ACL, won broad attention, and helped establish Brown clusters as the default semi-supervised feature in structured NLP. Liang's master's thesis at MIT (2005) had already shown the same pattern for named entity recognition. Turian, Ratinov, and Bengio (2010) compared Brown clusters head to head against neural word embeddings on the same NER tasks, and Brown clusters held their own. In 2010 the question of whether neural embeddings were going to displace Brown clusters was an open empirical question. It was not yet obvious which side would win.

Word2vec changed the answer in 2013. The Mikolov paper showed that a shallow neural network trained for a few hours on a billion tokens produced dense vectors with arithmetic structure ("king" minus "man" plus "woman" approximately "queen") that no Brown cluster could match. The dense representation also slotted into deep learning pipelines more naturally than a discrete cluster ID. Within two years, "Brown clusters" had stopped appearing in the features section of new papers. Within five years, the technique was a footnote in textbooks. The displacement was so quick that an entire generation of NLP practitioners now finishes a PhD without ever running Brown clustering on anything.

YearEvent
1988Jelinek's IBM speech group hits the limits of pure n-gram smoothing
1990Brown et al. publish IBM Model 1 statistical machine translation
1992Brown et al. publish "Class-Based n-gram Models of Natural Language"
1993Robert Mercer leaves IBM for Renaissance Technologies
1995Peter Brown follows Mercer to Renaissance
2005Liang demonstrates Brown clusters for NER in his MIT thesis
2008Koo, Carreras, and Collins establish Brown clusters as default parsing features
2010Mercer becomes co-CEO of Renaissance Technologies
2010Turian et al. compare Brown clusters with neural embeddings, head to head
2013Mikolov et al. publish word2vec; the displacement begins
2017Brown clusters effectively gone from new NLP papers
2018Mercer steps down as Renaissance co-CEO under public pressure
2026Renaissance still trades; the 1992 paper still gets a few citations a year
. . .

The Speech Group That Became a Hedge Fund

The byline of Brown et al. (1992) is also the partial roster of the most consequential career switch in the history of statistical natural language processing. In the early 1990s, the IBM Yorktown speech group was the best statistical NLP team in the world, by some distance. They had built the first serious statistical machine translation system. They had built the first speech recognizer that broke the 5,000-word vocabulary ceiling. They had pioneered the use of mutual information, EM algorithms, and bilingual alignment in a field that had been dominated for thirty years by handwritten grammars. And they were paid like research scientists at IBM in the early 1990s, which is to say competently and not extraordinarily.

Then a man named James Simons noticed.

Simons was a Cold War codebreaker turned mathematician turned hedge fund founder. He had run the math department at Stony Brook before starting Renaissance, and he had a clear thesis: the people who could make money in markets were the people who could find statistical regularities in noisy time series. Speech recognition is statistical regularities in noisy time series. The IBM speech group was the largest concentration of that talent in the country, working on a problem with adjacent mathematics and worse pay.

Simons recruited Robert Mercer first. Mercer joined Renaissance in 1993, less than a year after the class-based n-gram paper appeared. Peter Brown followed in 1995. Vincent Della Pietra and Stephen Della Pietra (Vincent's brother, also from the IBM group) followed. The migration was so complete that for a stretch in the mid-1990s, more former IBM speech researchers were working on the Medallion Fund than on the speech recognizer that had nominally given them their reputations.

What they built at Renaissance was the Medallion Fund, which from 1988 to 2018 produced average annual returns north of 39 percent net of fees, the most consistent excess return in the recorded history of investing. The fund's actual mechanics are secret. The publicly known framing is that Medallion treats market data the way the IBM speech group treated audio data: as a noisy signal whose underlying structure can be modeled statistically, with the modeling driven by huge amounts of data and tiny improvements in predictive accuracy compounded over millions of trades. The connection from the Brown 1992 paper to the trading floor is not direct (nobody was clustering vocabulary at Renaissance), but it is methodological and it is real. The same instinct that says "derive the categories from the data, do not impose them in advance" is the instinct that built both products.

Brown and Mercer became co-CEOs of Renaissance in 2010. Both are now, by most estimates, worth several billion dollars apiece. Simons died in 2024 a multi-billionaire. The five-name byline on a 1992 NLP paper turned out to be the seed crystal for one of the largest concentrated wealth transfers in the history of academic computer science.

The historical pleasure of the migration is real, but the more important point is what it tells us about the people who wrote the paper. They were not academic linguists or theory-builders. They were applied probabilists with engineering instincts, working in an industrial lab on a problem that had not yet been solved. They were the kind of people who would notice that a clustering algorithm with a clean information-theoretic objective and a tractable approximation could be useful in domains far beyond the one that motivated it.

. . .

What Brown Captured, and What It Could Not

The temptation when telling this kind of story is to claim that Brown clustering "was already word2vec." It was not, and the differences matter. Brown clusters and dense embeddings are doing related things in different ways, and the differences explain why one displaced the other.

Discrete versus continuous. A Brown cluster is a hard categorical assignment. A word belongs to exactly one class. A word2vec vector is a point in a 300-dimensional continuous space, and every operation downstream of the embedding is differentiable. The discrete assignment is easier to interpret and easier to use as a feature in a log-linear model. The continuous representation is easier to plug into a neural network and easier to compose with other vectors. As soon as the rest of the pipeline went neural, the continuous representation won by default, regardless of accuracy on any specific task.

One sense per word. Brown clusters assume that each word has one cluster, and therefore one distributional profile. Polysemy is invisible. Bank goes into one class even though it means river edge in some sentences and financial institution in others. Word2vec inherited this limitation. BERT and the other contextual embeddings finally addressed it in 2018 by computing a different vector for each occurrence of a word, depending on the surrounding sentence. Brown clustering could be extended to handle polysemy in principle, but in practice nobody seriously tried, and the extension would have been ugly inside the discrete framework.

Local context only. Brown clustering uses bigram statistics. The class of a word depends on its immediate left and right neighbors, nothing more. Word2vec uses a small window (typically five tokens), which is also local. Modern transformers attend over the entire input, which is global. The progression from bigram to small window to full attention is a steady widening of the context that gets folded into the representation. Brown was the narrowest of the three, and the narrowness shows up as semantic categories that are dominated by syntactic frame ("the X said") rather than topical content.

No analogical structure. The most striking property of word2vec was vector arithmetic: king - man + woman = queen. Brown clusters cannot do this. The cluster IDs are categorical labels with no internal geometry. The hierarchical merge tree has some structure, but it is a tree and not a vector space, and you cannot subtract one branch from another. The arithmetic property of dense embeddings was a genuinely new affordance that Brown clustering had no path to.

What Brown captured cleanly was the part Harris had already described in 1954: words that occur in similar contexts belong together, and the grouping can be derived from observation alone. The 1992 paper was the first time anyone made that claim numerically precise, ran it on a corpus of meaningful size, and showed that the resulting groupings looked like meaningful semantic categories. Everything that came afterward, including word2vec, GloVe, ELMo, and BERT, is a refinement of that move along axes (continuous, contextual, deeper, larger) that Brown clustering did not happen to choose.

. . .

Why It Still Matters

There are three things in Brown 1992 that a working NLP practitioner in 2026 should still take seriously, even if they will never run the algorithm itself.

Brown 1992 is one of the cleanest early demonstrations that statistical induction beats hand-crafted ontology in NLP.

The first is a refusal to ask a linguist. Brown and his coauthors did not begin with a linguistic theory of word classes. They started with the data, defined a numerical objective, optimized it, and let the categories emerge. The categories were not perfect, but they were good enough to be useful in production NLP for fifteen years.

The same instinct, applied at vastly larger scale to a vastly more flexible function class, is what made transformer language models work. Brown 1992 is one of the cleanest early demonstrations that statistical induction beats hand-crafted ontology in NLP, and the demonstration was published a full year before the IBM SMT papers that usually get the credit for the same lesson.

The second is the value of a tractable objective. The mutual information criterion in Brown clustering is not the only objective you could imagine for unsupervised word grouping, but it is the one that admits a clean optimization story. The team spent most of the paper on the optimization, because they understood that an objective you cannot compute is not really an objective.

This is a discipline that is easy to forget in an era when stochastic gradient descent on a billion parameters is a ten-line PyTorch script. Brown 1992 is a useful reminder that the design of the loss matters more than the size of the model, and that the right loss is the one you can actually drive to a meaningful minimum with the compute you have.

The third is the question of where the categories come from. Modern language models do not produce explicit word classes. They produce token embeddings that are implicitly partitioned into regions of vector space, and the partition is opaque.

Brown clusters had the opposite property: the categories were explicit, inspectable, and interpretable. You could read the cluster, name it, argue about it, and use the argument to debug the corpus. The interpretability disappeared along with the discrete representation, and we are slowly relearning that we miss it. Mechanistic interpretability research, sparse autoencoders, dictionary learning, the work on monosemantic features at Anthropic, all of it is in some sense an attempt to recover what Brown clustering had for free in 1992: a list of categories you can read.

Brown 1992 is not the most influential paper in the history of NLP. It is not in the top ten. It is, however, in the small set of papers that anyone who wants to claim to understand the lineage of word embeddings needs to read carefully. The line from Harris (1954) to Brown (1992) to Mikolov (2013) is straight. The Brown stop on the line is the one most often skipped, and it is the one where the abstract idea of distributional similarity first turned into a concrete algorithm with output that you could photocopy and tape to a wall.

It is a snapshot of the moment when the statistical approach to language stopped being a niche commitment of a few IBM researchers and started being a method that worked.

The other reason to read it is historical. The paper is a snapshot of the moment when the statistical approach to language stopped being a niche commitment of a few IBM researchers and started being a method that worked. Five years before the paper, the speech recognition community was still arguing about whether linguistic theory was necessary. Five years after the paper, almost everyone working on the problem had quietly conceded that it was not.

. . .

Coda: Two Career Paths

The five names on the byline of Brown et al. (1992) split, after publication, into two distinct career paths. The first path led to Renaissance Technologies and the Medallion Fund. The second path stayed in research, ran the algorithm a few more times, watched it become a standard NLP feature, and watched it get displaced by word2vec.

A professor in a tuxedo holding a champagne glass
From clustering words to clustering wealth.

Both paths trace back to the same paper, and the paper is a better starting point for both than either path is for the other. The lesson the practitioners took from it was about transfer: a method good enough to find statistical regularities in 1992 newswire text was good enough to find statistical regularities in 1993 commodity prices.

The lesson the researchers took from it was about features: a fast clustering algorithm with an information-theoretic objective could give you free generalization on small labeled corpora, and that was useful to anyone building a tagger or parser through the 2000s.

Neither lesson is wrong. Both lessons are smaller than the paper. The thing the paper actually says is that if you write down a clean objective and put enough data behind it, the categories that fall out will look an awful lot like the categories you would have invented by hand, and that this fact has consequences. Some of the consequences turn into NLP textbooks. Some of them turn into hedge funds. Some of them turn into the entire research program that powers GPT and Claude and the system you used yesterday to draft your last email.

All of them start in the same place: five names on a byline at IBM Yorktown, in 1992, writing a paper about how to group words.

. . .

References

  1. Brown, P.F., Della Pietra, V.J., deSouza, P.V., Lai, J.C., & Mercer, R.L. (1992). "Class-Based n-gram Models of Natural Language." Computational Linguistics, 18(4), 467-480.
  2. Harris, Z.S. (1954). "Distributional Structure." WORD, 10(2-3), 146-162.
  3. Brown, P.F., Cocke, J., Della Pietra, S.A., Della Pietra, V.J., Jelinek, F., Lafferty, J.D., Mercer, R.L., & Roossin, P.S. (1990). "A Statistical Approach to Machine Translation." Computational Linguistics, 16(2), 79-85.
  4. Liang, P. (2005). "Semi-Supervised Learning for Natural Language." Master's thesis, Massachusetts Institute of Technology.
  5. Koo, T., Carreras, X., & Collins, M. (2008). "Simple Semi-supervised Dependency Parsing." Proceedings of ACL-08: HLT, 595-603.
  6. Turian, J., Ratinov, L., & Bengio, Y. (2010). "Word Representations: A Simple and General Method for Semi-Supervised Learning." Proceedings of ACL 2010, 384-394.
  7. Church, K.W. & Hanks, P. (1990). "Word Association Norms, Mutual Information, and Lexicography." Computational Linguistics, 16(1), 22-29.
  8. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." NeurIPS 2013.
  9. Levy, O. & Goldberg, Y. (2014). "Neural Word Embedding as Implicit Matrix Factorization." NeurIPS 2014.
  10. Stratos, K., Collins, M., & Hsu, D. (2014). "A Spectral Algorithm for Learning Class-Based n-gram Models of Natural Language." Proceedings of UAI 2014.
  11. Zuckerman, G. (2019). The Man Who Solved the Market: How Jim Simons Launched the Quant Revolution. Portfolio.