← All Articles

Pointwise Mutual Information and the Independence Baseline

A technical introduction to PMI, the independence null hypothesis, and how the observed bigram statistics of a real corpus deviate from it. Companion prelim to the class-based n-grams article.

English is not random. That sentence is the whole reason statistical language modeling exists, and it is also the whole reason this article has to be written at all.

If you took every word in Jane Austen's Pride and Prejudice, wrote each one on a slip of paper, dumped the slips into a hat, and pulled them back out one at a time, you would reconstruct a text with exactly the same vocabulary distribution as the novel. The word the would still show up 4,331 times. Pemberley would still show up a few dozen. Nothing would be missing and nothing would be added. And yet you would never mistake the result for a novel, or for a grocery list, or for any piece of writing anyone had ever produced, because the hat destroys the one thing about real English that makes it English, which is the part where some words want to be near certain other words and some do not.

A Victorian top hat overflowing with paper slips cut from a book, each scrap holding a single word.
It is a truth universally acknowledged that language is not random.

Austen writes to be more than ten times as often as the hat would.

She writes the mr zero times, against the hat's prediction of nearly 28.

She writes of the about three and a half times as often as chance alone would have arranged.

Every one of those gaps is a piece of structure the hat cannot see, and every technique in statistical NLP, from collocation extraction through Brown clustering through word embeddings through the transformer you used to draft your last email, can be read as an elaborate machine for measuring how large the gap is and exploiting what it contains.

The demo below takes every word from Austen's Pride and Prejudice, writes each one on a slip of paper, dumps the slips into a hat, and pulls them back out at random. The vocabulary distribution is identical to the novel's. The word order is not.

Source text: Pride and Prejudice, Jane Austen, 1813.
Tokenization: lowercased alphabetic words only, punctuation stripped.
Shuffle: Fisher-Yates on a fresh copy each draw. No two draws are the same.

The Independence Null Hypothesis

A null hypothesis is the boring story about your data. Specifically, it is the story you would tell if nothing structural were happening inside it. You compute what the data should look like under the boring story, you compare to what the data actually shows, and the size of the gap is your evidence that something non-boring is going on.

For adjacent-word statistics, the natural boring story is the one the hat tells. Every word position is an independent draw from the novel's unigram distribution. No memory, no context, no preference for what comes next. The hat is the null hypothesis in physical form.

Under that story, if two words \(X\) and \(Y\) have marginal probabilities \(P(X)\) and \(P(Y)\) estimated from the corpus, then the probability of seeing \(X\) immediately followed by \(Y\) at any given position is just \(P(X) \cdot P(Y)\). That's it. The model has no way of knowing that certain words prefer certain other words, because the independence assumption defined that knowledge out of existence before any bigrams were considered.

No one actually writes that way. A real English writer is constrained by a stack of regularities the hat has never heard of:

Side-by-side comparison of ordered Austen prose and randomly shuffled words.
Left: Austen. Right: the hat.

None of that is in the null hypothesis. Which is exactly why the null hypothesis is useful. Every one of those regularities shows up as a measurable gap between what the hat predicts and what Austen actually wrote.

Ratios of Observed to Expected

The simplest way to measure the gap is the most obvious one. Divide the observed bigram count by what the hat would have predicted, and look at the ratio.

Most pairs sit close to one. That means the pair shows up about as often as chance would predict, and there is nothing particularly interesting about it. The pairs that matter are the ones where the ratio is far from one, and in practice those can reach ten times chance, forty times chance, or several hundred times chance. The ratio is a measure of how much the two words behave as if they knew the other existed, compared to a world in which neither had any idea the other was in the vocabulary.

Concrete case. In the 123,520 tokens of Pride and Prejudice, the word of occurs 3,585 times and the word the occurs 4,331 times. The independence prediction for the bigram of the is therefore:

\[ P(\textit{of}) \cdot P(\textit{the}) \;=\; \frac{3585}{123520} \cdot \frac{4331}{123520} \;\approx\; 0.00102 \]

Which works out to roughly 126 expected occurrences of the pair across the whole novel.

Austen writes of the 462 times. That is 3.6 times the independence prediction, which makes it a mildly sticky pair. Mild because of and the are both so common that they would collide frequently even if Austen had been drawing from the hat.

to be tells a stronger story. Predicted: about 42. Observed: 438. Ratio: roughly ten. to is an order of magnitude more likely to be followed by be than the hat would ever guess.

it was lands in similar territory. Predicted: about 23. Observed: 254. Ratio: roughly eleven.

None of these are outliers. They are the bread and butter of English prose, which is made almost entirely of syntactic frames, fixed collocations, and topical consistencies, each of them placing a weight on the scale that the hat has no way of knowing about.

From Ratio to Pointwise Mutual Information

Ratios are fine for a single pair but awkward to aggregate across many of them. They can range from zero to arbitrarily large, and they do not combine linearly in any natural way. The standard fix is to take the log. Logarithms convert multiplicative structure into additive structure and produce a number whose sign tells you whether the pair is sticky or repulsive. The resulting quantity is what the literature calls pointwise mutual information:

\[ \operatorname{PMI}(X, Y) \;=\; \log_2 \frac{P(X, Y)}{P(X)\, P(Y)} \]

The numerator is the joint probability of seeing \(X\) followed by \(Y\), estimated as the bigram count divided by the total number of bigram positions. The denominator is the independence prediction from the previous section. The ratio between the two is the quantity the earlier examples already worked through, and the logarithm just reports that ratio in a more convenient form.

Base-2 logarithms are the convention, so PMI is reported in bits. The numerical correspondences are worth memorizing:

The bigrams from the previous section, recomputed in these terms against the Austen corpus, produce the following table.

Pair Observed Expected Ratio PMI (bits) Interpretation
of the 462 ~126 3.6× +1.85 Mild positive association
to be 438 ~42 10.4× +3.38 Strong syntactic frame
it was 254 ~23 11.0× +3.46 Narrative collocation
in the 385 ~80 4.8× +2.27 Preposition plus determiner
of her 266 ~40 6.6× +2.72 Possessive frame
the mr 0 ~27.6 −∞ Absent by register
Bigram association statistics from Pride and Prejudice, with PMI in bits.

Each row in the table is a countable claim about how English behaves in this particular corpus. None of them are predictions the independence null hypothesis could have produced. That is exactly why the deviations carry the signal.

The Bigram Matrix in the Browser

The handful of pairs above is a small sample of the full bigram matrix. A 50-word vocabulary gives you 2,500 cells. A 5,000-word vocabulary gives you 25 million. Inspecting any reasonable number of cells by hand would be tedious beyond the point of usefulness, so the demo below renders the entire 50-by-50 matrix as a colored grid, loaded with Pride and Prejudice by default and swappable for any text you paste or upload.

The color mapping is straightforward:

Every cell is clickable. Clicking pulls up the observed count, the independence prediction, the ratio, the PMI in bits, and every occurrence of the bigram in the novel with its surrounding context.

The 50 × 50 bigram co-occurrence matrix, loaded with Pride and Prejudice by default. Every cell is clickable for its PMI math and the actual in-text occurrences. Use the ↻ Upload your own text button in the header to rerun the same pipeline on any corpus you care about: paste text, drop a file, or pick one of the built-in classics. The real Brown 1992 tokenization and bigram counting runs on the server and returns a fresh matrix in a few seconds.

A few minutes of clicking around the matrix will do more for your intuition about PMI than any amount of formula reading. The structure is directly visible as a pattern of colors. Strong stickiness and strong repulsion are both readable at a glance, and any individual cell's numerical justification is one click away. The sections that follow will refer back to the matrix as a way of grounding claims that would otherwise stay abstract.

Consider the results of clicking the cell at the intersection of had and been:

Tooltip from the bigram matrix demo showing the had/been cell expanded: bigram count 204, P(had) = 0.00950, P(been) = 0.00417, actual 0.001652, independence baseline 0.000040, ratio 41.7x, PMI +5.38 bits. Right panel shows twenty of the 204 had-been occurrences in their surrounding Austen sentences.
The had been cell of the Austen bigram matrix, with the PMI arithmetic on the left and twenty of the 204 in-context occurrences on the right.

The left panel walks the arithmetic out step by step.

had appears in about 1 word in every 100 positions of the novel, and been in about 1 in 240.

If Austen had written by drawing words from the hat, the pair had been should have landed adjacent about 4 times in every 100,000 positions, by pure multiplication of those two chances.

In the actual novel it lands 165 times in every 100,000. That is 41.7 times more often than the hat would have guessed, which gives a PMI of +5.38 bits.

The right panel shows twenty of those 204 occurrences in their surrounding Austen sentences, the bigram highlighted in each one, so you can see that had been is doing exactly the narrative work the PMI value claims it is: past perfect continuations of narrative action, the backbone of nineteenth-century free indirect prose.

Every other sticky cell in the matrix has a corresponding story, and the panel on the right is where you read it.

Negative PMI and the Forms It Takes

PMI can easily go negative, and that tends to read like an impossibility on first encounter. It is not. A negative value means the pair shows up in real text less often than two independent draws from the unigram distribution would predict. Something in the structure of the language is actively working against the pair. Three distinct phenomena can cause it, and the difference matters.

Absence by Convention

The cleanest source of strongly negative PMI is categorical absence. Both words are common on their own, but grammar, usage, or register rules out the pair.

The corpus-specific example in Austen is the mr. the is about 3.5 percent of all tokens in the novel. mr is about 0.6 percent. The independence prediction works out to:

\[ P(\textit{the}) \cdot P(\textit{mr}) \cdot (N-1) \;\approx\; 0.0353 \cdot 0.00634 \cdot 123{,}519 \;\approx\; 27.6 \]

The hat expects about 28 occurrences. The actual count is zero. The logarithm of zero is negative infinity, which is the PMI value the formula returns for any pair whose observed count is exactly zero.

The interesting thing about the mr is that the zero count is not a rule of English grammar. It is a feature of Austen's register, which treats Mr as a title attached directly to a surname. That is why it is a better example than the the, which is zero in every English corpus because the grammar does not permit two determiners in a row. A modern corpus would contain plenty of the Mr. Right, the Mr. Darcy we are discussing, the Mr. Smith you mentioned. The zero count in Austen is about her, not about English.

This category of negative PMI reflects structural scaffolding rather than meaning. A collocation extractor that drops strongly negative pairs is treating them as noise, which is usually the right call.

Topical Separation

The next source of negative PMI is topical separation. Two words both belong to the corpus vocabulary, but they live in different topical neighborhoods and rarely find themselves adjacent.

Imagine a corpus that combines The Lord of the Rings with The Hitchhiker's Guide to the Galaxy. The two books share many function words and almost no content words. Sauron appears near Shire and Mordor. Vogon appears near poetry and bulldozer. Any cross-book bigram receives a negative PMI score, because the independence null hypothesis does not know about topical gravity while the real corpus is thoroughly organized by it.

Topical PMI is rarely as extreme as grammatical PMI. Cross-topical pairs can still occur in transitional passages. But it is responsible for most of the moderate negative values visible in a typical bigram matrix.

Synonym Competition

The most interesting source of negative PMI is synonym competition. Two words are so similar in meaning and distribution that they compete for the same syntactic slot, and they avoid appearing next to each other in anything a human actually wrote.

believe and suppose illustrate the phenomenon. Both are common. Both are reliably followed by that. Both are reliably preceded by I. Both mean approximately the same thing in almost all contexts. The independence hypothesis would predict that believe suppose occurs occasionally, simply because both words are in the vocabulary.

In real text, no one writes I believe suppose he left. Writers pick one synonym and move on. The pair has a negative PMI value despite the fact that the two words are semantically and distributionally almost indistinguishable.

This is the observation distributional semantics was built to explain.

Zellig Harris noted it explicitly in his 1954 paper Distributional Structure. Two words are synonyms, he wrote, "if they have almost identical environments except chiefly for sentences which contain both." That parenthetical is the prose observation that synonyms avoid each other, made forty years before anyone had the computational resources to verify it.

The phenomenon is sometimes called the synonymy paradox. The pair-level PMI is negative for exactly the words whose broader distributional profiles are most similar. Resolving it requires looking past any single cell of the bigram matrix and thinking about the whole row.

Mutual Information for the Whole Matrix

Everything so far has been about a single cell of the bigram matrix. But PMI also has an aggregate form: a single number that summarizes how much structure the whole matrix contains.

A note on notation before the formula. PMI is the name for a single cell of the matrix. One pair of words, one log-ratio, one number.

The quantity we are about to compute is not that. It is the sum of every PMI across every cell, weighted by how often each pair actually occurs. That is a different thing, so it gets a different symbol.

The literature calls the single-cell version pointwise mutual information, specifically to mark it as the per-cell value. The one without "pointwise" is the aggregate, and it is usually written \(I\). Same family of quantity, different level of resolution.

Brown's 1992 paper uses \(I\) for the aggregate, and so will this article. If we called both quantities PMI, the reader would lose track of which level of resolution is being discussed at any given moment.

The aggregate version sums the weighted PMI across every bigram position in the corpus.

\[ I \;=\; \sum_{X, Y} P(X, Y)\, \log_2 \frac{P(X, Y)}{P(X)\, P(Y)} \]

The sum runs over every pair of words \((X, Y)\) in the vocabulary. Each pair's PMI is weighted by its joint probability. The result is the expected number of bits of information that adjacent-word positions carry above the independence baseline.

A corpus whose words are statistically independent has \(I = 0\) by construction. A corpus with structure has \(I > 0\). In practice the numbers for a real English corpus over a typical 5,000-word vocabulary land in the single digits. The literature calls this quantity the average mutual information of the bigram distribution.

This single number is the endpoint of the PMI story. It is also the starting point of the next one. Brown clustering, and every other technique that builds on distributional similarity, treats the matrix-wide \(I\) as a fixed budget and asks which transformations of the matrix preserve it best. That is where the companion article picks up: Class-Based n-gram Models of Natural Language.

Summary

Pointwise mutual information is the log-ratio of observed to expected bigram frequency. The expected value is computed under the assumption that the two words are drawn independently from the corpus's unigram distribution.

Positive PMI indicates a pair that appears more often than the independence baseline would predict. It usually reflects syntactic frames, fixed collocations, or tight topical association between the two words.

Negative PMI indicates a pair that appears less often than the baseline would predict. It can reflect grammatical prohibition, topical separation, or synonym competition, depending on the words in question.

Single cells of the bigram matrix are the finest unit of analysis, and they are noisy. Aggregating them into a single matrix-wide score, the average mutual information, gives a stable number that describes how much structure the corpus contains above the independence baseline. That aggregate number is what makes PMI the foundation for every downstream technique that treats distributional similarity as a computable quantity rather than a slogan.

. . .

References

  1. Harris, Z. (1954). "Distributional Structure." Word, 10(2-3), 146-162.
  2. Church, K. W., & Hanks, P. (1990). "Word Association Norms, Mutual Information, and Lexicography." Computational Linguistics, 16(1), 22-29.
  3. Brown, P. F., deSouza, P. V., Mercer, R. L., Della Pietra, V. J., & Lai, J. C. (1992). "Class-Based n-gram Models of Natural Language." Computational Linguistics, 18(4), 467-479.
  4. Austen, J. (1813). Pride and Prejudice. 123,520 tokens, used as the corpus for the bigram-matrix demo.